Fintech Data Engineering Project

Overview

This project involves building a comprehensive data processing and analysis pipeline for fintech data. The tasks are divided into four milestones, covering data cleaning, feature engineering, real-time processing, and visualization using ETL pipelines and dashboards.

Milestone 1: Data Ingestion and Initial Cleaning

Objectives

Data Ingestion
- Load raw fintech data from CSV files.
- Preview and understand data structure.
EDA
- Perform exploratory data analysis to understand the data.
- Ask at least 5 questions and visualize the answer to these questions
Data Cleaning
- Remove duplicates.
- Handle incorrect data types.
- Observe and handle missing values.
- Observe and handle outliers
Data Transformation and Feature Engineering
- Adding new features
- Encoding categorical columns
- Applying normalization techniques
Data Storage
- Save the cleaned dataset in CSV/Parquet format for further processing.

Deliverables

Notebook: M1 MAJOR GroupNo ID.ipynb
Cleaned Data: fintech_data_{MAJOR}_{GROUP}_{ID}_clean.csv/parquet
Lookup Table: lookup_table_{MAJOR}_{GROUP}_{ID}_clean.csv/parquet

Milestone 2: Real-time Data Streaming with Kafka

Objectives

Kafka Integration
- Set up Kafka producers and consumers for streaming data.
Streaming Data Processing
- Process incoming data in real-time using Python.
Data Storage
- Store processed data in PostgreSQL for analysis.

Deliverables

Kafka Producer/Consumer Scripts
PostgreSQL Database with Streamed Data

Milestone 3: Data Cleaning and Feature Engineering

Objectives

Loading the Dataset (5%)
- Load the provided Parquet dataset.
- Preview the first 20 rows.
- Adjust partitions to match the number of logical cores.
Data Cleaning (30%)
- Column Renaming (10%): Replace spaces with underscores and convert to lowercase.
- Detect Missing Values (35%): Identify and report missing values.
- Handle Missing Values (35%): Replace missing numerical values with 0 and categorical values with mode.
- Verify Cleaning (20%): Confirm no missing values remain.
Feature Engineering (15%)
- Add features for previous loan amounts and dates by grade and state.
Categorical Encoding (10%)
- Encode specified categorical columns using label and one-hot encoding.
Lookup Table (5%)
- Create and save a lookup table for encodings.
Saving Outputs (5%)
- Save the cleaned dataset and lookup table.
Bonus (5%)
- Load cleaned data into PostgreSQL with PGAdmin screenshots.

Deliverables

Notebook: m3_spark_52_XXXX.ipynb
Cleaned Data: fintech_spark_52_XXXX_clean.parquet
Lookup Table: lookup_spark_52_XXXX.parquet
(Bonus) PGAdmin Screenshots

Milestone 4: ETL Pipeline and Dashboard with Airflow

Objectives

ETL Pipeline with Airflow
- extract_clean: Clean raw data.
- transform: Apply data transformations.
- load_to_db: Load data into PostgreSQL.
Interactive Dashboard
- Visualize key insights using Plotly Dash/Streamlit.
Airflow DAG
- Define ETL workflow with extract_clean → transform → load_to_db → run_dashboard.
Submission Video
- Record a 5–10 minute demo of Airflow execution and dashboard.
Bonus (5%)
- Exceptional dashboard UI/UX.

Deliverables

dags/
- fintech_dag.py
- functions.py
- fintech_dashboard.py
Video: fintech_dashboard_showcase_{your_id}.mp4

Tools & Technologies

Python
Pandas
Matplotlib
scikit-learn
SQLAlchemy
Apache Spark (PySpark)
Apache Kafka
Apache Airflow
PostgreSQL
Plotly Dash
Docker

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Milestone 1		Milestone 1
Milestone 2		Milestone 2
Milestone 3		Milestone 3
Milestone 4		Milestone 4
.gitignore		.gitignore
Dataset Description.pdf		Dataset Description.pdf
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fintech Data Engineering Project

Overview

Milestone 1: Data Ingestion and Initial Cleaning

Objectives

Deliverables

Milestone 2: Real-time Data Streaming with Kafka

Objectives

Deliverables

Milestone 3: Data Cleaning and Feature Engineering

Objectives

Deliverables

Milestone 4: ETL Pipeline and Dashboard with Airflow

Objectives

Deliverables

Tools & Technologies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fintech Data Engineering Project

Overview

Milestone 1: Data Ingestion and Initial Cleaning

Objectives

Deliverables

Milestone 2: Real-time Data Streaming with Kafka

Objectives

Deliverables

Milestone 3: Data Cleaning and Feature Engineering

Objectives

Deliverables

Milestone 4: ETL Pipeline and Dashboard with Airflow

Objectives

Deliverables

Tools & Technologies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages