Skip to content
View Cazuchi's full-sized avatar
  • Copenhagen

Block or report Cazuchi

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Cazuchi/README.md

Hi, I'm Mike

And this is my GitHub profile, which is meant to showcase skills that I have developed, which were not a part of my master's degree from Copenhagen Business School. My master's degree focused on a combination of data collection methodologies, statistical analysis of data and the interpretation of results for use in business, however after graduating I have been working constantly to develop my skillset, which now also includes, amongst others:

  • Setting up cloud pipelines from A to Z for data collection, analysis and visualization
  • Coding in Python, SQL, DAX
  • Setting up and interacting with SQL databases in Docker (PostgreSQL with and without dbt on top)
  • GCP: BigQuery, IAM, Secret Manager, SSH, Firewall restrictions, Compute Engine (Debian usually)
  • Azure: Fabric notebooks, Spark, lakehouse table management

These skills are not obvious from my educational credentials however and because of confidentiality requirements I cannot directly show projects that I have worked on professionally. Therefore the projects in this GitHub profile have been created to highlight these skills and show off projects that I find interesting.

Over time, there is going to be a lot of projects within this profile, so the following is a highlight of my main projects and what skills they showcase. Each project also has a dedicated README to further explain the contents of the project and its purpose.

Note

Skills used in this project:

  • Python
  • DAX / PowerBi
  • Google Compute Engine / Secret Manager / IAM / BigQuery

TourMIS is a tourism database with bed nights, arrivals and population figures for 135+ destinations in Europe along with calculated estimates for destinations that do not upload their own data. This project uses Python to pull bed nights, arrivals and population satistics from the TourMIS API, runs it through extensive formatting functions and uploads it to two Google BigQuery tables. These BigQuery tables are then loaded into PowerBi where I used DAX to create a series of measures to calculate performance statistics for the different travel destinations and allow for comparison of performance across different tourist destinations.

Tourism data is notoriously fractured and inconsistent, which required a lot of data wrangling in both Python and DAX. I used Python code to correct for missing values and insure continuity across the different data series within the dataset and then used DAX to create performance metrics in PowerBi that likewise account for the peculiarities of tourism data. Included in this repo is:

Note

Skills used in this project:

  • SQL
  • Docker

This project builds the Kaggle F1 Ergast dataset into a local Docker SQL database using SQL and then presents two extensive SQL queries to output transformed, calculated table that highlight severall interesting aspects of the F1 Ergast dataset. The first query uses a series of CTEs to calculate a table highlighting interesting findings about individual drivers performance over time, while the second query utilizes a CTE chain, window functions and the Gaps & Islands approach to calculate a table showcasing interesting findings about the different teams' strategic choices when pairing drivers for a stint (unique combination of team and drivers). Included in this repo is:

While this project is supposed to primarily be a SQL exercise, I've made a PowerBi dashboard that shows some of the output of my CTE chain queries, which is available here:
https://app.powerbi.com/view?r=eyJrIjoiZDFhOGMyMTMtYzBjMS00Mjc1LTgzN2UtMGJjNjEzMDA3N2ZlIiwidCI6IjcwZjRhY2NiLTM3N2UtNDg5ZS04YjhiLTI4NjllYjQwYmQ3MSJ9

The graphs is just a simple scatterplot showing the difference in drivers' total career points between using the legacy scoring models in place when a given race took place and adjusting all of a driver's points to match the newest 25-point scoring model used in modern F1 races.

Legacy career points are shown on the x-axis. Modern career points are shown on the y-axis and the dots a colored depending on how large the difference is between the modern and the legacy career points totals:

  • Green: Less than 50% difference
  • Yellow: Between 50% and 100% difference
  • Red: More than 100% difference

Note

Skills used in this project:

  • SQL
  • dbt
  • Docker
  • Python

This project uses the same dataset and the same queries, except they've been modified to follow the pattern of a dbt database. To show that it's working and output correct data, I've included a notebook with inline visualization to show the output of the project.

Since it's just a proof of functionality, I've just created a simple array of spider charts, comparing 6 select drivers performance on chosen metrics to the average driver, which resulted in this plot:
Alt text

Note

Skills used in this project:

  • Python
  • Playwright / Browser automation
  • PowerShell interaction

Simple script I made to automate a browser in order to:

  • Navigate to a website
  • Log in
  • Navigate the website's sub-menus
  • Download the desired dataset with hotel statistics and save it in the project folder
  • Load the data from the downloaded .csv file and format it into a compressed, easy-to-read visual format for use in presentations

Since the script requires login info, but also needs to be usable by more than one person, it simply asks for login information from the user in the PowerShell terminal window.

The output looks like this:

Important

The data shown in the image is MOCK data. It broadly mimicks real trends, but is randomized and varies significantly from real data.

Script output
The table shows the occupancy rates for hotels in a select geographic area per day, per month with specific highlights in yellow for days with lower occupancy and highlights in green for days with higher occupany. Used to figure out and showcase what times of the year that there is the most room for added tourism activity for a given year

Note

Skills used in this project:

  • Crontabs
  • Debian
  • Docker
  • Fabric notebooks
  • Git
  • GCP (Bigquery, IAM, Compute Engine)
  • PowerShell
  • Python
  • SSH
  • SQL
  • ... and more.

My cheatsheet that I continually add to as I troubleshoot issues that I encounter. It is purposefully written to match the way that I think when problemsolving and is meant as a personal reference when working on projects, but it might be interesting / useful to some of you reading this.

More projects coming soon. I am actively developing this profile.

Popular repositories Loading

  1. Obsidian-settings Obsidian-settings Public

    This is just a repo to save my Obsidian settings. You can check it out if you want, but there isn't much interesting here.

  2. Cazuchi Cazuchi Public

    Profile README

  3. F1-ergast-data-SQL-project F1-ergast-data-SQL-project Public

    Fully ready-to-use SQL project exploring the F1 Ergast data using advanced SQL functions. You can clone the repo, spin up the docker instance with the database and explore the data using my queries…

    SQL

  4. Cheatsheet Cheatsheet Public

    Just useful commands

  5. Transforming-TourMIS-data-into-a-performance-dashboard Transforming-TourMIS-data-into-a-performance-dashboard Public

    A project I created to show off my Python, PowerBi and cloud pipeline skills that aren't directly discernable from my master's degree. The project collect tourism statistics from an API using Pytho…

    Python

  6. Automating-browser-for-data-collection-and-visualization Automating-browser-for-data-collection-and-visualization Public

    A small script that I wrote to simulate a browser, navigating a website to download a .csv file and convert the data into a compressed, easy to read visual.

    Python