This project involves building a comprehensive data processing and analysis pipeline for fintech data. The tasks are divided into four milestones, covering data cleaning, feature engineering, real-time processing, and visualization using ETL pipelines and dashboards.
-
Data Ingestion
- Load raw fintech data from CSV files.
- Preview and understand data structure.
-
EDA
- Perform exploratory data analysis to understand the data.
- Ask at least 5 questions and visualize the answer to these questions
-
Data Cleaning
- Remove duplicates.
- Handle incorrect data types.
- Observe and handle missing values.
- Observe and handle outliers
-
Data Transformation and Feature Engineering
- Adding new features
- Encoding categorical columns
- Applying normalization techniques
-
Data Storage
- Save the cleaned dataset in CSV/Parquet format for further processing.
- Notebook:
M1 MAJOR GroupNo ID.ipynb - Cleaned Data:
fintech_data_{MAJOR}_{GROUP}_{ID}_clean.csv/parquet - Lookup Table:
lookup_table_{MAJOR}_{GROUP}_{ID}_clean.csv/parquet
-
Kafka Integration
- Set up Kafka producers and consumers for streaming data.
-
Streaming Data Processing
- Process incoming data in real-time using Python.
-
Data Storage
- Store processed data in PostgreSQL for analysis.
- Kafka Producer/Consumer Scripts
- PostgreSQL Database with Streamed Data
-
Loading the Dataset (5%)
- Load the provided Parquet dataset.
- Preview the first 20 rows.
- Adjust partitions to match the number of logical cores.
-
Data Cleaning (30%)
- Column Renaming (10%): Replace spaces with underscores and convert to lowercase.
- Detect Missing Values (35%): Identify and report missing values.
- Handle Missing Values (35%): Replace missing numerical values with 0 and categorical values with mode.
- Verify Cleaning (20%): Confirm no missing values remain.
-
Feature Engineering (15%)
- Add features for previous loan amounts and dates by grade and state.
-
Categorical Encoding (10%)
- Encode specified categorical columns using label and one-hot encoding.
-
Lookup Table (5%)
- Create and save a lookup table for encodings.
-
Saving Outputs (5%)
- Save the cleaned dataset and lookup table.
-
Bonus (5%)
- Load cleaned data into PostgreSQL with PGAdmin screenshots.
- Notebook:
m3_spark_52_XXXX.ipynb - Cleaned Data:
fintech_spark_52_XXXX_clean.parquet - Lookup Table:
lookup_spark_52_XXXX.parquet - (Bonus) PGAdmin Screenshots
-
ETL Pipeline with Airflow
- extract_clean: Clean raw data.
- transform: Apply data transformations.
- load_to_db: Load data into PostgreSQL.
-
Interactive Dashboard
- Visualize key insights using Plotly Dash/Streamlit.
-
Airflow DAG
- Define ETL workflow with
extract_clean→transform→load_to_db→run_dashboard.
- Define ETL workflow with
-
Submission Video
- Record a 5–10 minute demo of Airflow execution and dashboard.
-
Bonus (5%)
- Exceptional dashboard UI/UX.
dags/fintech_dag.pyfunctions.pyfintech_dashboard.py
- Video:
fintech_dashboard_showcase_{your_id}.mp4
- Python
- Pandas
- Matplotlib
- scikit-learn
- SQLAlchemy
- Apache Spark (PySpark)
- Apache Kafka
- Apache Airflow
- PostgreSQL
- Plotly Dash
- Docker