Currently Being Moderated

Update: after several comments and questions, I have created a revised version.

Please find it here and enjoy:

Data Geek Challenge: Analyzing flight delays with SAP Lumira (Revised)


 

According to the United States Bureau of Transportation Statistics, 643 million passengers travel over 9 million flights every year in the US only. A significant portion of these passengers are impacted by delays and cancellations. For this SAP Data Geek challenge, we will be analyzing flight data from 2003 to 2013 (more than 172,000 records) with SAP Lumira.

 

Data Sources:

 

Importing data source

 

First, let’s import the data file into SAP Lumira / Visual Intelligence.

screen_01.png

Now, we need to handle the time dimension.

 

screen_02.png

screen_03.png

 

Then, we will convert some columns from Attributes to Measures and rename them.

 

screen_04b.png

 

Once this is done for all measures, we need to group them for easier analysis. This is done via formula.

 

screen_05.png

 

Now, we can start visualizing these data. First, we need to know if the number of flights, delays and cancelations has evolved over time.

 

screen_06.png

 

The percentage of delayed or canceled flights is hard to analyze this way. Let’s create a new formula to calculate these ratios.

 

screen_07.png

screen_08.png

 

The result is more clear, but the trend is hard to analyze. Let’s add a running average.

 

screen_09b.png

screen_10.png

 

We can now see that on average, the flight delays have increased over the last 10 years. With this information, let’s dig deeper and analyze which type of delay is the highest source of inconvenience for the passengers.

 

screen_11.png

 

From the above chart, we can easily see that the Air Carriers are the main source of flights delays in the US from 2003 to 2013.

 

If we add the Carrier Name dimension to the chart and filter the data by only the Carrier Delay, we get the following chart.

 

screen_12.png

 

Notice that American Airlines and American Eagles are both in the top 5 of the worst carriers. Since both are sharing the same hubs, their difficulties may be related to geography. However, the current dataset is missing exact location like latitude and longitude. To do this, let’s add and merge with a list of all airport codes and corresponding GPS location and create a geographical hierarchy.

 

screen_13.png

screen_14.png

screen_15b.png

screen_16.png

 

Let’s now use this new information and visualize flight delays by geography, filtered by American Airlines and American Eagles.

 

screen_17.png

 

However, when removing the filter and showing all airlines, it seems like the delays are more related to large airports like Chicago, Dallas or Atlanta.

 

screen_18.png

 

This analysis doesn’t provide enough information, but brings the idea of analyzing the geography repartition of weather related delays.

 

screen_19.png

 

Here again, the data at hand doesn’t provide us enough insight, but when weather comes in mind, seasonality is expected. Let’s redraw this chart using timelines.

 

screen_20.png

 

 

Weather delays are indeed seasonal. However, surprisingly, weather delays cannot only be limited to snow and storms since May, June and July are also very impacted, probably due to thunderstorms and / or hurricane season.

 

As a conclusion, I want to try and estimate the chances for my American Airlines flight to SAP TechEd in Las Vegas on October 2013 of being delayed or canceled. Since the dataset end on June 2013, I will need to use Predictive Analytics to estimate this.

screen_21b.png

screen_22.png

screen_23b.png

SAP Lumira's forecast feature calculates that on a flight with American Airlines to Las Vegas in October, the probability of getting delayed is 0.27%. This is of course a pretty rough estimate. With more time and understanding of the dataset, we could improve this calculation by merging the dataset with other data like weather records and forecasts, event schedules that could impact travel (think Christmas, Thanksgiving, large conferences, etc.). Also, notice that the data at hand indicates the arrival airport, but is lacking the information about the departure airport, which probably has an impact on delays and cancellations.

 

As a conclusion, this short first hands-on test of SAP Lumira shows a very effective solution for the analyzing, visualization and forecast of large datasets. Go ahead and try it for yourself!

Comments

Actions

Filter Blog

By author:
By date:
By tag: