Exploratory Data Analysis on Olympic Games Data

8 min readJul 30, 2024

Welcome, data enthusiasts and sports fanatics! Paris 2024 just started last weekend and the next two weeks all we see on social media includes such things about sports. You might already know that Paris 2024 is the one and only Olympics that made its ceremony outdoors through the 6 km part of the Seine. While I was watching the ceremony — I do not make any criticism about it — , I was curious about what happened previously and the outliers of the whole Olympics history.

Today, we’re diving into the Olympics data. We’ll be exploring the history of the Olympics through the lens of SQL queries. From the ancient Greeks to modern athletes breaking records, the Olympics have come a long way since 1896. As we approach Paris 2024, where baguettes might be used as batons in the relay races — just kidding — , let’s take a look into some of the interesting moments that have made the Olympics a global spectacle.

To point out the wild things about the Olympics, I did a quick search then I was faced with really weird things to write here. For example, I didn’t know till now that a pigeon-shooting event was actually part of the 1900 Paris Games. Or how about the 1904 marathon in St. Louis, where the gold medalist hitched a ride in a car for part of the race? Or do you know that during the 1976 Montreal Olympics, an official relit the Olympic flame with his cigarette lighter after the flame for the Olympic torch went out due to a rainstorm? This caused quite a stir, as the flame is supposed to be a continuous link from Olympia, Greece. Don’t worry, we’ll be querying our way through these quirky facts and more, faster than Usain Bolt’s 100-meter.

So, grab your coffee, and let’s embark on this data-driven journey through Olympic history. If you find the SQL part boring, I have also included the results with table visualizations below.

Note: Since the number of participants in the sports specified as GameTeam is not given, the number of participants does not reflect the actual data.

Let’s get to know our tables. You can find them through this link, along with other tables that you can check out. I only worked with olympic_results and olympic_hosts.

First, I started with a SELECT * query to explore the tables and identify some columns to drop. These columns contain athlete information, which is too lengthy and detailed in text format.

SELECT * FROM olympic_results

ALTER TABLE olympic_results
DROP COLUMN athlete, athlete_url, athlete_full_name

While executing simple queries, such as grouping the main columns, I noticed something interesting. There are event and discipline titles in different columns. This caused some confusion during exploratory data analysis due to the same event names appearing in different disciplines. To resolve this, I decided to CONCATENATE them.

ALTER TABLE olympic_results
ADD discipline_event_title VARCHAR(255) NULL
GO

UPDATE olympic_results
SET discipline_event_title = CONCAT(discipline_title, ' ' , event_title)

There is one last thing to do before starting the data analysis. In the main dataset, we have a game_slug column with values like tokyo-2020 and rio-2016, but it does not represent the season itself. While we can extract the location or year from the game_slug, distinguishing the season is crucial for our analysis. To address this, I used a LEFT JOIN on game_slug with the olympic_hosts table to incorporate the game_season.

ALTER TABLE olympic_results
ADD game_season VARCHAR(255) NULL
GO

UPDATE olympic_results
SET game_season = (
  SELECT h.game_season
  FROM olympic_hosts h
  WHERE h.game_slug = olympic_results.slug_game
)

Okay, let’s see how many different games took place in each Olympics. It appears that the number of games is increasing each year. We could also investigate further with queries like “Which games were added each year?” but I will leave this insight here to move forward with the data analysis.

SELECT COUNT(DISTINCT discipline_event_title) as event, 
       slug_game 
FROM olympic_results
GROUP BY slug_game
ORDER BY event DESC

The visualization extension of SQL server does not work well so, I decided to move them from SQL to Python with the CSV format. Below code shows the scatterplot with event counts for each year and season. It is seen that there is a huge jump in Summer events on 2020, however the Winter games shows a linear increase.

trend = sns.lineplot(x='game_year', y='event', hue='game_season', data=df)
sns.scatterplot(x='game_year', y='event', hue='game_season', data=df, legend=False)
trend.grid(True)
plt.title('Event Count with Season and Years')
plt.show()

An increasing event count also means an increasing medal count. Now, let’s look at the medal counts for each country. The code below allows us to analyze different breakdowns, such as by game_season, and I have added a ranking to limit each season to the top 10 countries. As a result of thisd analysis, the USA consistently ranks among the top countries.

WITH ranked_countries AS (
  SELECT country_name,
         game_season,
         COUNT(medal_type) AS total_medals,
         ROW_NUMBER() OVER (PARTITION BY game_season ORDER BY COUNT(medal_type) DESC) AS rank
  FROM olympic_results
  GROUP BY country_name, game_season
)
SELECT country_name, game_season, total_medals
FROM ranked_countries
WHERE rank <= 3
ORDER BY game_season, total_medals DESC, country_name;

After identifying which countries won the most medals, we can delve deeper by analyzing the participant_type. This will show us whether team games or individual games generate more medals.

SELECT participant_type, game_season, 'Gold' AS medal_type, COUNT(*) AS medal_count
FROM olympic_results
WHERE medal_type = 'Gold'
GROUP BY participant_type, game_season

UNION ALL

SELECT participant_type, game_season, 'Silver' AS medal_type, COUNT(*) AS medal_count
FROM olympic_results
WHERE medal_type = 'Silver'
GROUP BY participant_type, game_season

UNION ALL

SELECT participant_type, game_season, 'Bronze' AS medal_type, COUNT(*) AS medal_count
FROM olympic_results
WHERE medal_type = 'Bronze'
GROUP BY participant_type, game_season
ORDER BY participant_type, game_season, medal_type

**Breakdown of Medal Type by Participant Type**

sns.catplot(x='medal_type', 
            y='medal_count', 
            hue='participant_type', 
            col='game_season', data=df, 
            kind='bar',height=4, aspect=.8)
plt.show()

Alternatively, we can execute another query to delve even deeper into the data by year. This allows you to query any specific year to see the types of medals won by the teams.

SELECT game_year, 
       country_name,
       SUM(CASE WHEN medal_type = 'Gold' THEN 1 ELSE 0 END) AS gold_medals,
       SUM(CASE WHEN medal_type = 'Silver' THEN 1 ELSE 0 END) AS silver_medals,
       SUM(CASE WHEN medal_type = 'Bronze' THEN 1 ELSE 0 END) AS bronze_medals,
       SUM(CASE WHEN medal_type IN ('Gold', 'Silver', 'Bronze') THEN 1 ELSE 0 END) AS total_medals
FROM olympic_results
WHERE game_season = 'Summer'
GROUP BY game_year, country_name
HAVING SUM(CASE WHEN medal_type IN ('Gold', 'Silver', 'Bronze') THEN 1 ELSE 0 END) > 10
ORDER BY game_year, country_name

So far, we’ve mostly discussed years and countries. Now, it’s time to take a look at the disciplines. First, we will execute the query below to see “how many different games have been held over the years and how many participants have competed?” Even though Paris 2024 is the 33rd Summer Olympics, three Olympics were canceled due to World War I and II (1916, 1940, and 1944). The results show that Athletics, Swimming, and Fencing have taken place in all Olympics up to today.

SELECT TOP 20 discipline_title, participant_type,
       COUNT(DISTINCT slug_game) AS number_of_games,
       COUNT(*) AS total_participations
FROM olympic_results
where participant_type = 'Athlete'
GROUP BY discipline_title, participant_type
ORDER BY number_of_games DESC, total_participations DESC

To see the medal type distribution with a breakdown of disciplines, the code below helps us. You might notice something in the table: none of the rows have the same values for all medal types. This occurs because competitors who achieve the same time or score can share medals. The most recent example of this happened in Tokyo 2020. You can check what happened from this video.

select discipline_title,
       SUM(CASE WHEN olympic_results.medal_type = 'Gold' THEN 1 ELSE 0 END) AS gold_medals,
       SUM(CASE WHEN olympic_results.medal_type = 'Silver' THEN 1 ELSE 0 END) AS silver_medals,
       SUM(CASE WHEN olympic_results.medal_type = 'Bronze' THEN 1 ELSE 0 END) AS bronze_medals
from olympic_results
where game_season = 'Summer'
group by discipline_title
order by gold_medals desc

Is there a correlation between winning more medals when you are the host country? To investigate this, we considered the hosts of the last four Olympics. The results are sufficient for us to conclude with a ‘yes.’

We can say that countries participate with more athletes in the years they are hosts and this brings more medals. The participant-medal correlation values largely confirm this. When I researched how hosts send more athletes compared to previous years, I came across some concrete answers. Host countries are granted privileges such as automatic qualification in team sports, increased participation quotas and wild card entries.

SELECT country_name, game_year, game_season, COUNT(medal_type) AS total_medals, COUNT(*) AS participant
FROM olympic_results
WHERE country_name IN ('Great Britain', 'Japan', 'Brazil', 'Republic of China')
GROUP BY country_name, game_year, game_season

df_sorted = df_updated[(df_updated['game_year'] > 1940) & (df_updated['game_season'] == 'Summer')]
fig, ax = plt.subplots(figsize=(18, 10))
colors = ['red', 'blue', 'green', 'orange', 'purple']  

for color, country in zip(colors, countries):
    country_data = df_sorted[df_sorted['country_name'] == country]
    ax.scatter(x=country_data['game_year'], y=country_data['participant'], s=country_data['total_medals']*20, alpha=0.5, color=color)
    sns.lineplot(x='game_year', y='participant', data=country_data, marker='o', linestyle='--', label=f'{country} Total Medals', color=color)

plt.show()

With this article, I wanted to provide a different perspective on the Olympics using both SQL and Python. I hope those who are not interested in data also enjoyed it. Happy querying!

Exploratory Data Analysis on Olympic Games Data

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Deniz Mogulkoc

No responses yet

More from Deniz Mogulkoc

Time Series Forecasting with Machine Learning

We see many different artificial intelligence models and improvements have also been made to lower the human effort in coding and…

Do Not Use Elbow Method (Seriously)

The fast-paced growth in data knowledge places more responsibility on data analysts, who need to get the most from data-driven insights…

The Neural Network Club: Deep Learning

While we are coming to the final thoughts on this post series, it is time to talk about deep learning. Let me remind you what we have…

How Timestamps Can Revolutionize Your Data-Driven Decisions

In this article, we are going to talk about one of the critical variables in our datasets which is timestamp. It’s indispensable for the…

Recommended from Medium

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

In today’s fast-paced data world, traditional methods are evolving rapidly. In 2025, the fusion of classical statistics, AI, and modern…

Data Analytics Methods for Marketing

Cracking the Case: How Data Analytics Solves Marketing’s Biggest Mysteries

Lists

Coding & Development

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT prompts

Data Cleaning in Python

Hello Folks🙂,

Advanced SQL: Data Manipulation and Query Techniques

Chapter 4: SubQuery and CTE -Modularity in Database Queries

Data Analyst Project using Python: Price optimization

Price optimization is a strategic approach that uses data analytics and modelling to determine the optimal pricing of products or services…

Creating a Venn Diagram Style Sales KPI in Power BI

Creating insightful and visually appealing sales KPIs (Key Performance Indicators) is crucial for data-driven decision-making in any…