This is my main repo for showcasing my SQL competency. The project is entirely self-contained and can be cloned and run as is to use both the docker SQL database and the queries that I have written. I have specifically chosen a subset of the Ergast F1 dataset that I found to be the most interesting in my initial inspection of the data, used SQL to define and populate the database tables and then used SQL to combine individual tables into two separate output tables that each highlights different aspects of the dataset, using different SQL approaches.
The file schema.sql defines the tables, loads and filters the data and populates the database to prepare it for queries.
The queries to create the output are written in the analysis.sql files, which includes the two queries along with a few setup commands for temporary tables and commentary explaining what my approach was for the dataset and what I found interesting about the output from each query. Each query also includes multiple SELECT FROM / ORDER BY statements at the bottom, which are meant to be used one at a time, which is why all but one are commented out by default. Each of those SELECT FROM / ORDER BY statements highlight an interesting finding from the rather large output tables, that would otherwise be harder to see without sorting it in specific ways.
Two queries with different purposes:
-
Query #1 looks at individual racers performance over time and compares metrics like avg. points per win, historical standard deviation for points etc., with the purpose of looking at interesting differences between the different drivers' performance.
-
Query #2 looks at unique combinations of teams and drivers and specifically highlight potential strategic choices made by team when choosing what drivers to hire. As in, did a team choose two high volatility drivers or a more balanced pairing of drivers for their new team? How did the drivers standard deviation in points scored develop over while while in that unique pairing of teams and drivers? Amongst other questions. This query also tackles the classic gaps and islands problem, in order to be able to correctly split unique combinations of teams and drivers into stint. As a unique combination of a team and two drivers can potentially be repeated later in time, with a gap in between, the gaps and islands approach allows for those two stints to be correctly identified.
I created a basic PowerBi visualization to show that the database imports cleanly into PowerBi and that the data output of my SQL queries is sensible. The visualization can be seen here:
The graphs is just a simple scatterplot showing the difference in drivers' total career points between using the legacy scoring models in place when a given race took place and adjusting all of a driver's points to match the newest 25-point scoring model used in modern F1 races.
Legacy career points are shown on the x-axis. Modern career points are shown on the y-axis and the dots a colored depending on how large the difference is between the modern and the legacy career points totals:
- Green: Less than 50% difference
- Yellow: Between 50% and 100% difference
- Red: More than 100% difference
- VS Code for displaying the code and running the queries
- The Docker / Container Tools extension for VS Code
- The SQLTools & SQLTools PostgreSQL extensions for VS Code
- Docker
Spin up the Docker database using the command docker compose up -d in PowerShell.
- F1 Ergast dataset from Kaggle: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020?resource=download
- SQL
- Docker