ayanasamuel8 · ayanasamuel8 · Jun 17, 2025 · Jun 17, 2025 · Jun 17, 2025 · Jun 17, 2025
diff --git a/README.md b/README.md
@@ -1,24 +1,161 @@
-# Insurance Risk Analytics Project
+# 🚗 Insurance Risk Modeling & Dynamic Pricing System
 
-This project analyzes insurance data to extract actionable insights on risk, profitability, and optimal pricing strategies.
+This project develops a robust, explainable, and risk-aware pricing model for auto insurance policies. It incorporates statistical analysis, machine learning, and reproducible data practices to predict insurance claim severity and optimize premium pricing.
 
-## 📂 Structure
-- `data/`: Raw and processed data tracked via DVC
-- `notebooks/`: EDA and modeling notebooks
-- `src/`: Reusable Python scripts
-- `tests/`: Unit tests for core logic
-- `.github/workflows/`: CI/CD pipelines
+> ✅ Completed as part of the Week 3 Challenge at **10Academy**.
+
+---
+
+## 🧭 Project Goals
+
+- Understand and explore insurance data to uncover actionable insights.
+- Establish a reproducible data pipeline using Git, GitHub, and DVC.
+- Statistically validate hypotheses related to insurance risk.
+- Build predictive models to estimate:
+  - 💰 **Claim Severity** — How much we might pay.
+  - 📈 **Claim Probability** — How likely a customer is to claim.
+- Construct a **dynamic pricing formula** that incorporates business margins.
+
+---
+
+## 🔧 Technologies & Tools
+
+| Area               | Tools Used                            |
+|--------------------|----------------------------------------|
+| Programming        | Python, Jupyter                        |
+| Data Handling      | Pandas, NumPy, DVC                     |
+| Visualization      | Matplotlib, Seaborn, Plotly            |
+| Modeling           | Scikit-learn, XGBoost, SHAP, LIME      |
+| Version Control    | Git, GitHub, GitHub Actions            |
+| CI/CD              | GitHub Actions                         |
+| Environment        | `venv` + `requirements.txt`            |
+
+---
+
+## 📂 Repository Structure
+```text
+.
+├── data/ # Raw and processed data (tracked via DVC)
+├── models/ # Saved models
+├── notebooks/ # Jupyter notebooks for EDA, testing, modeling
+├── src/ # Core source code
+│ ├── preprocessing/ # Cleaning, transformation, encoding
+│ ├── task_3/ # Hypothesis testing modules
+│ └── task_4/ # Modeling pipeline and interpretation
+├── tests/ # Unit tests
+├── .dvc/ # DVC metadata
+├── .github/workflows/ # GitHub Actions CI pipeline
+├── dvc.yaml # DVC pipeline definition
+├── requirements.txt # Python dependencies
+└── README.md # Project overview (this file)
+```
+
+---
+
+## 📊 Task Breakdown
+
+### 🔍 Task 1: EDA & Git Setup
+
+- Configured Git and GitHub, created `task-1` branch
+- Performed EDA on claims, premiums, and customer demographics
+- Visualized insights across provinces, genders, and vehicle types
+- Identified key drivers of loss ratio and risk
+
+### 💾 Task 2: Data Version Control (DVC)
+
+- Installed DVC and initialized version control
+- Added data files to DVC tracking
+- Set up a **local remote storage** and pushed data
+- Ensured reproducibility and auditability of datasets
+
+### 📊 Task 3: Hypothesis Testing
+
+- Formulated and tested statistical hypotheses:
+  - 📍 Risk varies across **provinces** and **zip codes**
+  - 👥 Gender differences in **claim frequency** and **severity**
+  - 💸 Profitability margins vary by region
+- Used t-tests, z-tests, chi-squared where applicable
+- Business interpretations provided for each result
+
+### 🧠 Task 4: Predictive Modeling
+
+- Built severity regression models: Linear, Random Forest, XGBoost
+- Evaluated using RMSE, R²
+- Used **SHAP** and **LIME** for feature importance
+- Modeled claim probability (classification) for pricing
+- Final pricing formula:
+
+---
+
+## 📈 Key Insights & Recommendations
+
+| Insight                                    | Impact on Pricing Strategy                        |
+|-------------------------------------------|---------------------------------------------------|
+| ≤4 Cylinder vehicles → ↑ Severity Risk     | Apply loading to small-engine vehicles            |
+| Non-VAT Registered → ↑ Risk                | Raise base rate for unregistered customers        |
+| Converted/Modified Vehicles = ↑ Risk       | Apply higher risk surcharge                       |
+| Alarm/Immobilizer → ↓ Risk                 | Provide discount for security features            |
+| New Vehicles → ↓ Risk                      | Discount for newer vehicles                       |
+
+---
+
+## 📦 Setup Instructions
+
+1. **Clone the repository:**
+   ```bash
+   git clone https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
+   cd insurance-risk-model
+Install dependencies:
 
-## 📦 Setup
 ```bash
 python -m venv venv
-source venv/bin/activate  # or venv\Scripts\activate on Windows
+source venv/bin/activate  # Windows: venv\Scripts\activate
 pip install -r requirements.txt
 ```
+Run notebooks:
+
+```bash
+jupyter notebook
+```
+Run tests:
 
-## ▶️ Run Tests
 ```bash
 pytest
 ```
-## 📊 Tools
-- Python, DVC, GitHub Actions, SHAP, XGBoost
+🧪 Data Versioning with DVC
+bash
+dvc init
+dvc add data/raw/insurance_data.csv
+dvc remote add -d localstorage /path/to/your/storage
+dvc push
+To reproduce the data pipeline:
+
+bash
+dvc pull
+✅ CI/CD
+GitHub Actions is configured for:
+
+Code linting
+
+Unit tests
+
+Model validation (optional step)
+
+Workflow defined in .github/workflows/deploy.yml.
+
+📌 Results Summary
+🧮 Best Severity Model: XGBoost
+RMSE improvement: +Δ% vs. baseline
+Top features: Engine Size, Vehicle Age, Province, Conversion Status
+
+🧠 Classification Accuracy: ~X%
+Enables dynamic, fair, and risk-adjusted premium pricing
+
+👥 Contributors
+👤 Ayana Samuel
+Role: Full Data Science Workflow
+Skills: EDA, DVC, Statistical Testing, Machine Learning, GitOps
+GitHub: https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
+
+📜 License
+This project is licensed for academic and demonstration use. Contact the author for commercial usage rights.
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -1,7 +1,78 @@
-# Notebooks
+# 📒 Notebooks Overview — Insurance Pricing Project
 
-This folder contains Jupyter notebooks for exploratory data analysis (EDA), feature engineering, and modeling.
+This folder contains Jupyter Notebooks used to explore, analyze, and model insurance claim data as part of the **10Academy Week 3 Challenge**.
 
-- `task_1/`: Notebooks related to Task 1 (e.g., data understanding, preprocessing).
+Each notebook is modular and corresponds to a specific task in the data science workflow — from EDA and hypothesis testing to modeling and interpretation.
 
-Notebooks are organized by task or experiment for reproducibility.
+---
+
+## 📁 Folder Purpose
+
+The `notebooks/` directory serves as the primary space for:
+
+- Experimenting with data pipelines
+- Visualizing insights
+- Testing hypotheses
+- Training and evaluating models
+- Documenting key results
+
+---
+
+## 🧭 Execution Order
+
+| Notebook Path                                               | Description                                                                          |
+|-------------------------------------------------------------|--------------------------------------------------------------------------------------|
+| `task_1/01_Data_understanding.ipynb`                        | 🔍 **Initial Data Exploration** — Overview of dataset structure, basic distributions, and types of variables. |
+| `task_1/02.eda_univariate.ipynb`                            | 📊 **Univariate Analysis** — Examines single-variable distributions and statistics, including missing data handling. |
+| `task_1/03_eda_bivariate.ipynb`                             | 🔗 **Bivariate Analysis** — Explores relationships between key variables (e.g., claims vs. gender, province). |
+| `task_1/04_visualization.ipynb`                             | 📈 **Visual Summary** — Aggregated plots and advanced visuals to communicate key trends and risk factors. |
+| `task_3/05_hypthesis_testing.ipynb`                         | 📐 **Statistical Hypothesis Testing** — Validates assumptions across provinces, genders, and customer segments. |
+| `task_4/06_model_training_and_interpretability.ipynb`       | 🧠 **Modeling & Interpretability** — Trains severity and claim probability models; interprets them using SHAP/LIME. |
+
+
+> **Note:** Run these notebooks in order for best results. Dependencies between notebooks are minimal but intentional (e.g., modeling uses cleaned data from Notebook 5).
+
+---
+
+## 🔍 Highlights by Notebook
+
+### 📁 `task_1/01_Data_understanding.ipynb`
+- Overview of dataset structure and key variables
+- Initial univariate statistics and class distributions
+- Identified outliers and null-value patterns
+
+### 📁 `task_1/02.eda_univariate.ipynb`
+- Dealt with missing values and variable types
+- Performed univariate analysis: claims, premiums, risk flags
+- Created new features: vehicle age, risk class, etc.
+
+### 📁 `task_1/03_eda_bivariate.ipynb`
+- Bivariate relationships: claim vs gender, province, zip, cylinders
+- Used cross-tabulations and grouped summaries
+- Inferred possible risk drivers from visual trends
+
+### 📁 `task_1/04_visualization.ipynb`
+- Visual storytelling using bar plots, heatmaps, and boxplots
+- Focused on loss ratio patterns by region and vehicle attributes
+- Illustrated skewness, imbalance, and outliers effectively
+
+### 📁 `task_3/05_hypthesis_testing.ipynb`
+- Statistically tested hypotheses on claim risk factors
+- Methods: z-test, t-test, chi-squared
+- Provided business insights on regional and demographic effects
+
+### 📁 `task_4/06_model_training_and_interpretability.ipynb`
+- Modeled both claim severity (regression) and probability (classification)
+- Trained Linear, RF, XGBoost models with performance metrics
+- Used SHAP and LIME to interpret model decisions and explain pricing
+
+
+---
+
+## 📦 How to Use
+
+To run the notebooks:
+
+```bash
+cd notebooks/
+jupyter notebook
diff --git a/notebooks/task_1/04_visualizations.ipynb b/notebooks/task_1/04_visualizations.ipynb