Here is my second blog for the DataGeek Challenge....The first one can be see Here
But before I start let me thank Eduardo for being the source of inspiring this blog post..
Now without any delays let’s start with preparing the data for the analysis. For the Olympic medal data I got the medal numbers from Wikipedia...
Now I have 409 rows of data to do analysis spanning across last 5 Olympics.
Now here is a problem as the Navtech engine does not find Great Britain
The only option I am left with here is to change the name to England in the source itself…. and start over again. Unfortunately England is not identified as well. United Kingdom is the one that is working .... I will have to message the Olympic data fro it.
Now let’s start with some basic plotting of medals looking at below chart every country appears to be performing almost the same in the last 5 Olympics. This needs to be tested further however.
Let's see how the data look like in a Tag Cloud.The top three is US , China , Russia
Ok let me clarify my line of analysis for this here. I am trying to predict the number of medals Brazil would win the 2016 and then I hope I get a reasonable accuracy in the next Olympics. To do this I am more focused with the medals of the country that hosted the Olympics because Brazil will be hosting 2016 games. And I will put a filter for the countries that hosted the games.
As you can see from here all the countries (Except Australia in 2000 Sydney Olympics) did much better in their home games than the away games. Now lets term this as Home Advantage Factor (HAF). And for the sake of simplicity lets calculate this factor using their average medal percentage.
Now my intension is to find out the HAF average for any country and term its medal %. To get that I will exclude their home medals and try to predict what should have been the medal percentage had the Olympics not held in the home country. Now there is one feature which is missing here I cannot create a filter to filter out the countries when the Olympics are held in the same country. Hope SAP adds that in next releases. So the only option left is to re prepare the filtered data at excel level.
Now here I removed their home games using only their away games. So you can see the China medal tally missing in Beijing Olympics and UK medal tally missing for London Olympics and so on. Now the next step is let’s try and predict how many medal they would won in their home Olympics.
However before going into this route ; we cannot just take the medal tally alone as a gold medal can not be considered as same as Bronze medal. So let’s make the medal ration as a weighted one which should give us a better prediction. For our analysis let’s consider
Gold medal is equivalent to 3 Bronze medals
Silver medal is equivalent to 2 Bronze medals
So our weighted ratio is considered as 3*gold + 2*Silver+Bronze / ( total gold * 3 + total Silver * 2+ Bronze)
Now the data is ready to be exported for further analysis. Let’s see how they stack up in the weighted ratio analysis. The results are still consistent showing that the host countries certainly performed (except Australia) better in their Home game than away games for the last 5 Olympics.
So we are now pretty sure there is a Home Advantage Factor (HAF) in play for countries hosting the Olympic. We would need to find the approx value to be able to predict how many medals Brazil likely to win in 2016 Rio games.
Now let’s device strategies to find out the value of HAF . Couple of method I can think of to find it
- Lets take the mean in their away games and then take the difference of the weighted medal tally for the Home game. This will be the easiest.
- Assume the Home game data is missing and then predict the Home game numbers using missing data method. And then take the difference with the actual weighted medal tally. Although for this method to be accurate the sample size should be larger and unfortunately we are stuck with less observation. Here is a nice paper on finding missing values in R if anyone is interested.
I will go with this after I gather more result set into the scope of analysis may be in a different post with inclusion of other likely factors.
For the time being let’s just move with the easiest method finding a reasonable HAF value. For this analysis Australia is an outlier balking the trend so we will just exclude that from our analysis.
Now let’s prepare the data for this; we will put a filter to filtered to show only host countries except Australia for only the away games.
Now after some more formulas and computation here is how the output looks like
Here is the results
United States : 1.41
China : 1.73
United Kingdom: 2.16
Greece : 5.33 ( Yes it was really staggering in Athens Olympic)
Average : 2.66 with Greece Olympics observation which is a outlier
Average : 1.76 Excluding Greece
Economic Factors :
Now let’s set the HAF apart and see if the medal tally growth corresponds to any economic parameter. When discussing of economic parameter two things comes to mind one is Absolute GDP per capita and the other is GDP growth %.
The first factor is a well established measure and analyzed multiple time by many analysts and per analysts “The more the per capita GDP the more the excess money available to peruse and invents in professional sporting as Hobby” and this corresponds to the absolute medal tally.
So here we are interested in the other factor is the % GDP growth related to increase in Medal tally as well. Now this % GDP growth should be relative rather than absolute. So we will have to get the data for this analysis as well.
For this analysis I got the data from World Bank website below :
After preparing the GDP data here is how it looks like. Now this spreadsheet data is ready to be merged with the Olympic data for correlation analysis.
Now let's merge the datasets
Now it’s ready for Analysis. For this I m going to define a variable as GDP Growth Factor(GGF) , this is the true indicator of GDP growth of a nation vis-à-vis the entire world. Its going to be defined as the ratio of average GDP growth of the country for the 4 preceding years with respect to the Average GDP growth for the entire world.
Also I will define another variable which is just the % of weighted medal growth of a country for the Olympics vs the performance in the previous Olympics.
Now after merging the Olympic data look like this
Now after plotting the weighted medals to the GDP growth factors, the results are disappointing. No significant correlation appears to exist between these two factors. However we would need to check for one game at a time to see if the correlation actually exists.
Now just let’s look at the Sydney games for the top 15 nations.
Now let’s see if the trend is same with the developing nation and with developed nation. However in a single game the medal growth is positively correlated to the GDP growth rate (weighted) in the last 4 years. Here are the charts for each game. Please make sure to discount the host country as there is the HAF involved while checking it. I will remove couple of outliers to make the chart clearer.
Here is Greece game numbers.
Here is Beijing games with very similar results.
For analysis sake let’s just plot top medal receiving countries with Total medal with respect to GDP growth factor and here it is. China appears to be the growth leader and the graph appears to nicely correlating.
And looking into some analysis I noticed a very striking thing ; the Developing (with a limited sample size) nations positively correlating to the GDP growth factors however the Developed nations are negatively correlated many times to the GDP numbers. This is strange and beyond my comprehension. Please comment if you think of any explanation with this behavior.
Now as this one getting really long and the system does not allow me to add any more screen capture I will have to break this into another part and conclude the analysis in second part.
Here is the link to the second and concluding part of the analysis.