Exploratory Data Analysis on Olympic Games Data
Welcome, data enthusiasts and sports fanatics! Paris 2024 just started last weekend and the next two weeks all we see on social media includes such things about sports. You might already know that Paris 2024 is the one and only Olympics that made its ceremony outdoors through the 6 km part of the Seine. While I was watching the ceremony — I do not make any criticism about it — , I was curious about what happened previously and the outliers of the whole Olympics history.

Today, we’re diving into the Olympics data. We’ll be exploring the history of the Olympics through the lens of SQL queries. From the ancient Greeks to modern athletes breaking records, the Olympics have come a long way since 1896. As we approach Paris 2024, where baguettes might be used as batons in the relay races — just kidding — , let’s take a look into some of the interesting moments that have made the Olympics a global spectacle.
To point out the wild things about the Olympics, I did a quick search then I was faced with really weird things to write here. For example, I didn’t know till now that a pigeon-shooting event was actually part of the 1900 Paris Games. Or how about the 1904 marathon in St. Louis, where the gold medalist hitched a ride in a car for part of the race? Or do you know that during the 1976 Montreal Olympics, an official relit the Olympic flame with his cigarette lighter after the flame for the Olympic torch went out due to a rainstorm? This caused quite a stir, as the flame is supposed to be a continuous link from Olympia, Greece. Don’t worry, we’ll be querying our way through these quirky facts and more, faster than Usain Bolt’s 100-meter.
So, grab your coffee, and let’s embark on this data-driven journey through Olympic history. If you find the SQL part boring, I have also included the results with table visualizations below.
Note: Since the number of participants in the sports specified as GameTeam is not given, the number of participants does not reflect the actual data.
Let’s get to know our tables. You can find them through this link, along with other tables that you can check out. I only worked with olympic_results
and olympic_hosts
.
First, I started with a SELECT *
query to explore the tables and identify some columns to drop. These columns contain athlete information, which is too lengthy and detailed in text format.
SELECT * FROM olympic_results
ALTER TABLE olympic_results
DROP COLUMN athlete, athlete_url, athlete_full_name
While executing simple queries, such as grouping the main columns, I noticed something interesting. There are event and discipline titles in different columns. This caused some confusion during exploratory data analysis due to the same event names appearing in different disciplines. To resolve this, I decided to CONCATENATE them.
ALTER TABLE olympic_results
ADD discipline_event_title VARCHAR(255) NULL
GO
UPDATE olympic_results
SET discipline_event_title = CONCAT(discipline_title, ' ' , event_title)
There is one last thing to do before starting the data analysis. In the main dataset, we have a game_slug
column with values like tokyo-2020
and rio-2016
, but it does not represent the season itself. While we can extract the location or year from the game_slug
, distinguishing the season is crucial for our analysis. To address this, I used a LEFT JOIN on game_slug
with the olympic_hosts
table to incorporate the game_season
.
ALTER TABLE olympic_results
ADD game_season VARCHAR(255) NULL
GO
UPDATE olympic_results
SET game_season = (
SELECT h.game_season
FROM olympic_hosts h
WHERE h.game_slug = olympic_results.slug_game
)
Okay, let’s see how many different games took place in each Olympics. It appears that the number of games is increasing each year. We could also investigate further with queries like “Which games were added each year?” but I will leave this insight here to move forward with the data analysis.
SELECT COUNT(DISTINCT discipline_event_title) as event,
slug_game
FROM olympic_results
GROUP BY slug_game
ORDER BY event DESC

The visualization extension of SQL server does not work well so, I decided to move them from SQL to Python with the CSV format. Below code shows the scatterplot with event counts for each year and season. It is seen that there is a huge jump in Summer events on 2020, however the Winter games shows a linear increase.
trend = sns.lineplot(x='game_year', y='event', hue='game_season', data=df)
sns.scatterplot(x='game_year', y='event', hue='game_season', data=df, legend=False)
trend.grid(True)
plt.title('Event Count with Season and Years')
plt.show()

An increasing event count also means an increasing medal count. Now, let’s look at the medal counts for each country. The code below allows us to analyze different breakdowns, such as by game_season
, and I have added a ranking to limit each season to the top 10 countries. As a result of thisd analysis, the USA consistently ranks among the top countries.
WITH ranked_countries AS (
SELECT country_name,
game_season,
COUNT(medal_type) AS total_medals,
ROW_NUMBER() OVER (PARTITION BY game_season ORDER BY COUNT(medal_type) DESC) AS rank
FROM olympic_results
GROUP BY country_name, game_season
)
SELECT country_name, game_season, total_medals
FROM ranked_countries
WHERE rank <= 3
ORDER BY game_season, total_medals DESC, country_name;
After identifying which countries won the most medals, we can delve deeper by analyzing the participant_type
. This will show us whether team games or individual games generate more medals.
SELECT participant_type, game_season, 'Gold' AS medal_type, COUNT(*) AS medal_count
FROM olympic_results
WHERE medal_type = 'Gold'
GROUP BY participant_type, game_season
UNION ALL
SELECT participant_type, game_season, 'Silver' AS medal_type, COUNT(*) AS medal_count
FROM olympic_results
WHERE medal_type = 'Silver'
GROUP BY participant_type, game_season
UNION ALL
SELECT participant_type, game_season, 'Bronze' AS medal_type, COUNT(*) AS medal_count
FROM olympic_results
WHERE medal_type = 'Bronze'
GROUP BY participant_type, game_season
ORDER BY participant_type, game_season, medal_type

sns.catplot(x='medal_type',
y='medal_count',
hue='participant_type',
col='game_season', data=df,
kind='bar',height=4, aspect=.8)
plt.show()

Alternatively, we can execute another query to delve even deeper into the data by year. This allows you to query any specific year to see the types of medals won by the teams.
SELECT game_year,
country_name,
SUM(CASE WHEN medal_type = 'Gold' THEN 1 ELSE 0 END) AS gold_medals,
SUM(CASE WHEN medal_type = 'Silver' THEN 1 ELSE 0 END) AS silver_medals,
SUM(CASE WHEN medal_type = 'Bronze' THEN 1 ELSE 0 END) AS bronze_medals,
SUM(CASE WHEN medal_type IN ('Gold', 'Silver', 'Bronze') THEN 1 ELSE 0 END) AS total_medals
FROM olympic_results
WHERE game_season = 'Summer'
GROUP BY game_year, country_name
HAVING SUM(CASE WHEN medal_type IN ('Gold', 'Silver', 'Bronze') THEN 1 ELSE 0 END) > 10
ORDER BY game_year, country_name

So far, we’ve mostly discussed years and countries. Now, it’s time to take a look at the disciplines. First, we will execute the query below to see “how many different games have been held over the years and how many participants have competed?” Even though Paris 2024 is the 33rd Summer Olympics, three Olympics were canceled due to World War I and II (1916, 1940, and 1944). The results show that Athletics, Swimming, and Fencing have taken place in all Olympics up to today.
SELECT TOP 20 discipline_title, participant_type,
COUNT(DISTINCT slug_game) AS number_of_games,
COUNT(*) AS total_participations
FROM olympic_results
where participant_type = 'Athlete'
GROUP BY discipline_title, participant_type
ORDER BY number_of_games DESC, total_participations DESC


To see the medal type distribution with a breakdown of disciplines, the code below helps us. You might notice something in the table: none of the rows have the same values for all medal types. This occurs because competitors who achieve the same time or score can share medals. The most recent example of this happened in Tokyo 2020. You can check what happened from this video.
select discipline_title,
SUM(CASE WHEN olympic_results.medal_type = 'Gold' THEN 1 ELSE 0 END) AS gold_medals,
SUM(CASE WHEN olympic_results.medal_type = 'Silver' THEN 1 ELSE 0 END) AS silver_medals,
SUM(CASE WHEN olympic_results.medal_type = 'Bronze' THEN 1 ELSE 0 END) AS bronze_medals
from olympic_results
where game_season = 'Summer'
group by discipline_title
order by gold_medals desc
Is there a correlation between winning more medals when you are the host country? To investigate this, we considered the hosts of the last four Olympics. The results are sufficient for us to conclude with a ‘yes.’
We can say that countries participate with more athletes in the years they are hosts and this brings more medals. The participant-medal correlation values largely confirm this. When I researched how hosts send more athletes compared to previous years, I came across some concrete answers. Host countries are granted privileges such as automatic qualification in team sports, increased participation quotas and wild card entries.
SELECT country_name, game_year, game_season, COUNT(medal_type) AS total_medals, COUNT(*) AS participant
FROM olympic_results
WHERE country_name IN ('Great Britain', 'Japan', 'Brazil', 'Republic of China')
GROUP BY country_name, game_year, game_season

df_sorted = df_updated[(df_updated['game_year'] > 1940) & (df_updated['game_season'] == 'Summer')]
fig, ax = plt.subplots(figsize=(18, 10))
colors = ['red', 'blue', 'green', 'orange', 'purple']
for color, country in zip(colors, countries):
country_data = df_sorted[df_sorted['country_name'] == country]
ax.scatter(x=country_data['game_year'], y=country_data['participant'], s=country_data['total_medals']*20, alpha=0.5, color=color)
sns.lineplot(x='game_year', y='participant', data=country_data, marker='o', linestyle='--', label=f'{country} Total Medals', color=color)
plt.show()

With this article, I wanted to provide a different perspective on the Olympics using both SQL and Python. I hope those who are not interested in data also enjoyed it. Happy querying!