ayanasamuel8
diff --git a/‎README.md‎
Lines changed: 150 additions & 13 deletions b/‎README.md‎
Lines changed: 150 additions & 13 deletions
diff --git a/‎notebooks/README.md‎
Lines changed: 75 additions & 4 deletions b/‎notebooks/README.md‎
Lines changed: 75 additions & 4 deletions
diff --git a/‎notebooks/task_1/04_visualizations.ipynb‎
Lines changed: 7 additions & 16 deletions b/‎notebooks/task_1/04_visualizations.ipynb‎
Lines changed: 7 additions & 16 deletions
diff --git a/‎src/README.md‎
Lines changed: 80 additions & 6 deletions b/‎src/README.md‎
Lines changed: 80 additions & 6 deletions
@@ -1,24 +1,161 @@
-# Insurance Risk Analytics Project
+# 🚗 Insurance Risk Modeling & Dynamic Pricing System
 
-This project analyzes insurance data to extract actionable insights on risk, profitability, and optimal pricing strategies.
+This project develops a robust, explainable, and risk-aware pricing model for auto insurance policies. It incorporates statistical analysis, machine learning, and reproducible data practices to predict insurance claim severity and optimize premium pricing.
 
-## 📂 Structure
-- `data/`: Raw and processed data tracked via DVC
-- `notebooks/`: EDA and modeling notebooks
-- `src/`: Reusable Python scripts
-- `tests/`: Unit tests for core logic
-- `.github/workflows/`: CI/CD pipelines
+> ✅ Completed as part of the Week 3 Challenge at **10Academy**.
+
+---
+
+## 🧭 Project Goals
+
+- Understand and explore insurance data to uncover actionable insights.
+- Establish a reproducible data pipeline using Git, GitHub, and DVC.
+- Statistically validate hypotheses related to insurance risk.
+- Build predictive models to estimate:
+  - 💰 **Claim Severity** — How much we might pay.
+  - 📈 **Claim Probability** — How likely a customer is to claim.
+- Construct a **dynamic pricing formula** that incorporates business margins.
+
+---
+
+## 🔧 Technologies & Tools
+
+| Area               | Tools Used                            |
+|--------------------|----------------------------------------|
+| Programming        | Python, Jupyter                        |
+| Data Handling      | Pandas, NumPy, DVC                     |
+| Visualization      | Matplotlib, Seaborn, Plotly            |
+| Modeling           | Scikit-learn, XGBoost, SHAP, LIME      |
+| Version Control    | Git, GitHub, GitHub Actions            |
+| CI/CD              | GitHub Actions                         |
+| Environment        | `venv` + `requirements.txt`            |
+
+---
+
+## 📂 Repository Structure
+```text
+.
+├── data/ # Raw and processed data (tracked via DVC)
+├── models/ # Saved models
+├── notebooks/ # Jupyter notebooks for EDA, testing, modeling
+├── src/ # Core source code
+│ ├── preprocessing/ # Cleaning, transformation, encoding
+│ ├── task_3/ # Hypothesis testing modules
+│ └── task_4/ # Modeling pipeline and interpretation
+├── tests/ # Unit tests
+├── .dvc/ # DVC metadata
+├── .github/workflows/ # GitHub Actions CI pipeline
+├── dvc.yaml # DVC pipeline definition
+├── requirements.txt # Python dependencies
+└── README.md # Project overview (this file)
+```
+
+---
+
+## 📊 Task Breakdown
+
+### 🔍 Task 1: EDA & Git Setup
+
+- Configured Git and GitHub, created `task-1` branch
+- Performed EDA on claims, premiums, and customer demographics
+- Visualized insights across provinces, genders, and vehicle types
+- Identified key drivers of loss ratio and risk
+
+### 💾 Task 2: Data Version Control (DVC)
+
+- Installed DVC and initialized version control
+- Added data files to DVC tracking
+- Set up a **local remote storage** and pushed data
+- Ensured reproducibility and auditability of datasets
+
+### 📊 Task 3: Hypothesis Testing
+
+- Formulated and tested statistical hypotheses:
+  - 📍 Risk varies across **provinces** and **zip codes**
+  - 👥 Gender differences in **claim frequency** and **severity**
+  - 💸 Profitability margins vary by region
+- Used t-tests, z-tests, chi-squared where applicable
+- Business interpretations provided for each result
+
+### 🧠 Task 4: Predictive Modeling
+
+- Built severity regression models: Linear, Random Forest, XGBoost
+- Evaluated using RMSE, R²
+- Used **SHAP** and **LIME** for feature importance
+- Modeled claim probability (classification) for pricing
+- Final pricing formula:
+
+---
+
+## 📈 Key Insights & Recommendations
+
+| Insight                                    | Impact on Pricing Strategy                        |
+|-------------------------------------------|---------------------------------------------------|
+| ≤4 Cylinder vehicles → ↑ Severity Risk     | Apply loading to small-engine vehicles            |
+| Non-VAT Registered → ↑ Risk                | Raise base rate for unregistered customers        |
+| Converted/Modified Vehicles = ↑ Risk       | Apply higher risk surcharge                       |
+| Alarm/Immobilizer → ↓ Risk                 | Provide discount for security features            |
+| New Vehicles → ↓ Risk                      | Discount for newer vehicles                       |
+
+---
+
+## 📦 Setup Instructions
+
+1. **Clone the repository:**
+   ```bash
+   git clone https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
+   cd insurance-risk-model
+Install dependencies:
 
-## 📦 Setup
 ```bash
 python -m venv venv
-source venv/bin/activate  # or venv\Scripts\activate on Windows
+source venv/bin/activate  # Windows: venv\Scripts\activate
 pip install -r requirements.txt
 ```
+Run notebooks:
+
+```bash
+jupyter notebook
+```
+Run tests:
 
-## ▶️ Run Tests
 ```bash
 pytest
 ```
-## 📊 Tools
-- Python, DVC, GitHub Actions, SHAP, XGBoost
+🧪 Data Versioning with DVC
+bash
+dvc init
+dvc add data/raw/insurance_data.csv
+dvc remote add -d localstorage /path/to/your/storage
+dvc push
+To reproduce the data pipeline:
+
+bash
+dvc pull
+✅ CI/CD
+GitHub Actions is configured for:
+
+Code linting
+
+Unit tests
+
+Model validation (optional step)
+
+Workflow defined in .github/workflows/deploy.yml.
+
+📌 Results Summary
+🧮 Best Severity Model: XGBoost
+RMSE improvement: +Δ% vs. baseline
+Top features: Engine Size, Vehicle Age, Province, Conversion Status
+
+🧠 Classification Accuracy: ~X%
+Enables dynamic, fair, and risk-adjusted premium pricing
+
+👥 Contributors
+👤 Ayana Samuel
+Role: Full Data Science Workflow
+Skills: EDA, DVC, Statistical Testing, Machine Learning, GitOps
+GitHub: https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
+
+📜 License
+This project is licensed for academic and demonstration use. Contact the author for commercial usage rights.
@@ -1,7 +1,78 @@
-# Notebooks
+# 📒 Notebooks Overview — Insurance Pricing Project
 
-This folder contains Jupyter notebooks for exploratory data analysis (EDA), feature engineering, and modeling.
+This folder contains Jupyter Notebooks used to explore, analyze, and model insurance claim data as part of the **10Academy Week 3 Challenge**.
 
-- `task_1/`: Notebooks related to Task 1 (e.g., data understanding, preprocessing).
+Each notebook is modular and corresponds to a specific task in the data science workflow — from EDA and hypothesis testing to modeling and interpretation.
 
-Notebooks are organized by task or experiment for reproducibility.
+---
+
+## 📁 Folder Purpose
+
+The `notebooks/` directory serves as the primary space for:
+
+- Experimenting with data pipelines
+- Visualizing insights
+- Testing hypotheses
+- Training and evaluating models
+- Documenting key results
+
+---
+
+## 🧭 Execution Order
+
+| Notebook Path                                               | Description                                                                          |
+|-------------------------------------------------------------|--------------------------------------------------------------------------------------|
+| `task_1/01_Data_understanding.ipynb`                        | 🔍 **Initial Data Exploration** — Overview of dataset structure, basic distributions, and types of variables. |
+| `task_1/02.eda_univariate.ipynb`                            | 📊 **Univariate Analysis** — Examines single-variable distributions and statistics, including missing data handling. |
+| `task_1/03_eda_bivariate.ipynb`                             | 🔗 **Bivariate Analysis** — Explores relationships between key variables (e.g., claims vs. gender, province). |
+| `task_1/04_visualization.ipynb`                             | 📈 **Visual Summary** — Aggregated plots and advanced visuals to communicate key trends and risk factors. |
+| `task_3/05_hypthesis_testing.ipynb`                         | 📐 **Statistical Hypothesis Testing** — Validates assumptions across provinces, genders, and customer segments. |
+| `task_4/06_model_training_and_interpretability.ipynb`       | 🧠 **Modeling & Interpretability** — Trains severity and claim probability models; interprets them using SHAP/LIME. |
+
+
+> **Note:** Run these notebooks in order for best results. Dependencies between notebooks are minimal but intentional (e.g., modeling uses cleaned data from Notebook 5).
+
+---
+
+## 🔍 Highlights by Notebook
+
+### 📁 `task_1/01_Data_understanding.ipynb`
+- Overview of dataset structure and key variables
+- Initial univariate statistics and class distributions
+- Identified outliers and null-value patterns
+
+### 📁 `task_1/02.eda_univariate.ipynb`
+- Dealt with missing values and variable types
+- Performed univariate analysis: claims, premiums, risk flags
+- Created new features: vehicle age, risk class, etc.
+
+### 📁 `task_1/03_eda_bivariate.ipynb`
+- Bivariate relationships: claim vs gender, province, zip, cylinders
+- Used cross-tabulations and grouped summaries
+- Inferred possible risk drivers from visual trends
+
+### 📁 `task_1/04_visualization.ipynb`
+- Visual storytelling using bar plots, heatmaps, and boxplots
+- Focused on loss ratio patterns by region and vehicle attributes
+- Illustrated skewness, imbalance, and outliers effectively
+
+### 📁 `task_3/05_hypthesis_testing.ipynb`
+- Statistically tested hypotheses on claim risk factors
+- Methods: z-test, t-test, chi-squared
+- Provided business insights on regional and demographic effects
+
+### 📁 `task_4/06_model_training_and_interpretability.ipynb`
+- Modeled both claim severity (regression) and probability (classification)
+- Trained Linear, RF, XGBoost models with performance metrics
+- Used SHAP and LIME to interpret model decisions and explain pricing
+
+
+---
+
+## 📦 How to Use
+
+To run the notebooks:
+
+```bash
+cd notebooks/
+jupyter notebook
@@ -1,9 +1,83 @@
-# Source Code
+# 📦 `src/` — Source Code Overview
 
-This folder contains reusable Python modules for the project.
+This directory contains all core logic and modular components used in data processing, hypothesis testing, modeling, and interpretability for the **Insurance Risk Modeling** project.
 
-- `data_loader.py`: Functions for loading and saving data.
-- `preprocessing.py`: Data cleaning and preprocessing utilities.
-- `config.py`: Configuration and path management.
+---
 
-Import these modules in notebooks and scripts to avoid code duplication.
+## 📁 Top-Level Modules
+
+| File | Description |
+|------|-------------|
+| `__init__.py` | Marks the directory as a Python package. |
+| `config.py` | Central location for storing global configuration variables (paths, constants, toggles). |
+| `data_loader.py` | Contains utilities for loading raw and processed datasets. |
+| `preprocessing.py` | High-level preprocessing functions for cleaning and transforming data. |
+| `README.md` | This file – documentation of the codebase structure. |
+
+---
+
+## 🔍 `task_1/` — Exploratory Data Analysis & Visualization
+
+### 📁 `eda/` — EDA Logic
+| File | Description |
+|------|-------------|
+| `bivariate.py` | Analyzes relationships between pairs of variables (e.g., claims vs. gender). |
+| `univariate.py` | Summarizes distributions of individual features. |
+| `summary_stats.py` | Provides descriptive statistics and summary tables. |
+| `outlier_detection.py` | Detects and flags unusual or extreme values using statistical methods. |
+
+### 📁 `viz/` — Visualization Utilities
+| File | Description |
+|------|-------------|
+| `plot_utils.py` | Helper functions for generating consistent plots (bar charts, heatmaps, boxplots, etc.). |
+
+---
+
+## 📈 `task_3/` — Hypothesis Testing & Risk Segmentation
+
+| File | Description |
+|------|-------------|
+| `hypothesis_tests.py` | Implements z-tests, t-tests, chi-squared tests for hypothesis validation. |
+| `data_segmentation.py` | Splits the data by demographic and regional segments for focused analysis. |
+| `business_analysis.py` | Converts statistical results into actionable business interpretations. |
+| `segmentation_utils.py` | Utilities to group and label segmented datasets. |
+| `stats_helpers.py` | Reusable statistical functions (p-value calculators, assumptions checks, etc.). |
+
+---
+
+## 🤖 `task_4/` — Modeling & Interpretability
+
+| File | Description |
+|------|-------------|
+| `data_processing.py` | Handles encoding, normalization, splitting for modeling. |
+| `feature_engineering.py` | Creates new predictive features such as vehicle age, conversion flags, etc. |
+| `model_training.py` | Core training logic for regression and classification models (XGBoost, Random Forest, etc.). |
+| `interpretability.py` | Generates SHAP & LIME explanations to interpret model decisions. |
+
+---
+
+## ✅ Usage Example
+
+```python
+from src.data_loader import load_clean_data
+from src.task_1.eda.univariate import summarize_numerics
+from src.task_4.model_training import train_xgboost_model
+from src.task_4.interpretability import explain_with_shap
+
+# Load and analyze data
+df = load_clean_data("data/processed/cleaned_data.csv")
+summarize_numerics(df)
+
+# Train and explain model
+model, X_test = train_xgboost_model(df)
+explain_with_shap(model, X_test)
+```
+🧩 Design Philosophy
+Modular: Code is separated into logically coherent, reusable units.
+
+Interpretable: Business-facing logic (e.g., hypothesis results) is separated from statistical code.
+
+Scalable: Easy to extend for future tasks like time-series modeling or real-time pricing engines.
+
+📝 Note
+Ensure all module imports use relative paths (from .module import ...) if running as a package, or adjust PYTHONPATH accordingly for standalone script runs.