Skip to content

Commit 5baf43b

Browse files
author
Ayana Samuel
committed
update: readme updated
1 parent ac5b95c commit 5baf43b

5 files changed

Lines changed: 390 additions & 43 deletions

File tree

README.md

Lines changed: 150 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,161 @@
1-
# Insurance Risk Analytics Project
1+
# 🚗 Insurance Risk Modeling & Dynamic Pricing System
22

3-
This project analyzes insurance data to extract actionable insights on risk, profitability, and optimal pricing strategies.
3+
This project develops a robust, explainable, and risk-aware pricing model for auto insurance policies. It incorporates statistical analysis, machine learning, and reproducible data practices to predict insurance claim severity and optimize premium pricing.
44

5-
## 📂 Structure
6-
- `data/`: Raw and processed data tracked via DVC
7-
- `notebooks/`: EDA and modeling notebooks
8-
- `src/`: Reusable Python scripts
9-
- `tests/`: Unit tests for core logic
10-
- `.github/workflows/`: CI/CD pipelines
5+
> ✅ Completed as part of the Week 3 Challenge at **10Academy**.
6+
7+
---
8+
9+
## 🧭 Project Goals
10+
11+
- Understand and explore insurance data to uncover actionable insights.
12+
- Establish a reproducible data pipeline using Git, GitHub, and DVC.
13+
- Statistically validate hypotheses related to insurance risk.
14+
- Build predictive models to estimate:
15+
- 💰 **Claim Severity** — How much we might pay.
16+
- 📈 **Claim Probability** — How likely a customer is to claim.
17+
- Construct a **dynamic pricing formula** that incorporates business margins.
18+
19+
---
20+
21+
## 🔧 Technologies & Tools
22+
23+
| Area | Tools Used |
24+
|--------------------|----------------------------------------|
25+
| Programming | Python, Jupyter |
26+
| Data Handling | Pandas, NumPy, DVC |
27+
| Visualization | Matplotlib, Seaborn, Plotly |
28+
| Modeling | Scikit-learn, XGBoost, SHAP, LIME |
29+
| Version Control | Git, GitHub, GitHub Actions |
30+
| CI/CD | GitHub Actions |
31+
| Environment | `venv` + `requirements.txt` |
32+
33+
---
34+
35+
## 📂 Repository Structure
36+
```text
37+
.
38+
├── data/ # Raw and processed data (tracked via DVC)
39+
├── models/ # Saved models
40+
├── notebooks/ # Jupyter notebooks for EDA, testing, modeling
41+
├── src/ # Core source code
42+
│ ├── preprocessing/ # Cleaning, transformation, encoding
43+
│ ├── task_3/ # Hypothesis testing modules
44+
│ └── task_4/ # Modeling pipeline and interpretation
45+
├── tests/ # Unit tests
46+
├── .dvc/ # DVC metadata
47+
├── .github/workflows/ # GitHub Actions CI pipeline
48+
├── dvc.yaml # DVC pipeline definition
49+
├── requirements.txt # Python dependencies
50+
└── README.md # Project overview (this file)
51+
```
52+
53+
---
54+
55+
## 📊 Task Breakdown
56+
57+
### 🔍 Task 1: EDA & Git Setup
58+
59+
- Configured Git and GitHub, created `task-1` branch
60+
- Performed EDA on claims, premiums, and customer demographics
61+
- Visualized insights across provinces, genders, and vehicle types
62+
- Identified key drivers of loss ratio and risk
63+
64+
### 💾 Task 2: Data Version Control (DVC)
65+
66+
- Installed DVC and initialized version control
67+
- Added data files to DVC tracking
68+
- Set up a **local remote storage** and pushed data
69+
- Ensured reproducibility and auditability of datasets
70+
71+
### 📊 Task 3: Hypothesis Testing
72+
73+
- Formulated and tested statistical hypotheses:
74+
- 📍 Risk varies across **provinces** and **zip codes**
75+
- 👥 Gender differences in **claim frequency** and **severity**
76+
- 💸 Profitability margins vary by region
77+
- Used t-tests, z-tests, chi-squared where applicable
78+
- Business interpretations provided for each result
79+
80+
### 🧠 Task 4: Predictive Modeling
81+
82+
- Built severity regression models: Linear, Random Forest, XGBoost
83+
- Evaluated using RMSE, R²
84+
- Used **SHAP** and **LIME** for feature importance
85+
- Modeled claim probability (classification) for pricing
86+
- Final pricing formula:
87+
88+
---
89+
90+
## 📈 Key Insights & Recommendations
91+
92+
| Insight | Impact on Pricing Strategy |
93+
|-------------------------------------------|---------------------------------------------------|
94+
| ≤4 Cylinder vehicles → ↑ Severity Risk | Apply loading to small-engine vehicles |
95+
| Non-VAT Registered → ↑ Risk | Raise base rate for unregistered customers |
96+
| Converted/Modified Vehicles = ↑ Risk | Apply higher risk surcharge |
97+
| Alarm/Immobilizer → ↓ Risk | Provide discount for security features |
98+
| New Vehicles → ↓ Risk | Discount for newer vehicles |
99+
100+
---
101+
102+
## 📦 Setup Instructions
103+
104+
1. **Clone the repository:**
105+
```bash
106+
git clone https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
107+
cd insurance-risk-model
108+
Install dependencies:
11109

12-
## 📦 Setup
13110
```bash
14111
python -m venv venv
15-
source venv/bin/activate # or venv\Scripts\activate on Windows
112+
source venv/bin/activate # Windows: venv\Scripts\activate
16113
pip install -r requirements.txt
17114
```
115+
Run notebooks:
116+
117+
```bash
118+
jupyter notebook
119+
```
120+
Run tests:
18121

19-
## ▶️ Run Tests
20122
```bash
21123
pytest
22124
```
23-
## 📊 Tools
24-
- Python, DVC, GitHub Actions, SHAP, XGBoost
125+
🧪 Data Versioning with DVC
126+
bash
127+
dvc init
128+
dvc add data/raw/insurance_data.csv
129+
dvc remote add -d localstorage /path/to/your/storage
130+
dvc push
131+
To reproduce the data pipeline:
132+
133+
bash
134+
dvc pull
135+
✅ CI/CD
136+
GitHub Actions is configured for:
137+
138+
Code linting
139+
140+
Unit tests
141+
142+
Model validation (optional step)
143+
144+
Workflow defined in .github/workflows/deploy.yml.
145+
146+
📌 Results Summary
147+
🧮 Best Severity Model: XGBoost
148+
RMSE improvement: +Δ% vs. baseline
149+
Top features: Engine Size, Vehicle Age, Province, Conversion Status
150+
151+
🧠 Classification Accuracy: ~X%
152+
Enables dynamic, fair, and risk-adjusted premium pricing
153+
154+
👥 Contributors
155+
👤 Ayana Samuel
156+
Role: Full Data Science Workflow
157+
Skills: EDA, DVC, Statistical Testing, Machine Learning, GitOps
158+
GitHub: https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
159+
160+
📜 License
161+
This project is licensed for academic and demonstration use. Contact the author for commercial usage rights.

notebooks/README.md

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,78 @@
1-
# Notebooks
1+
# 📒 Notebooks Overview — Insurance Pricing Project
22

3-
This folder contains Jupyter notebooks for exploratory data analysis (EDA), feature engineering, and modeling.
3+
This folder contains Jupyter Notebooks used to explore, analyze, and model insurance claim data as part of the **10Academy Week 3 Challenge**.
44

5-
- `task_1/`: Notebooks related to Task 1 (e.g., data understanding, preprocessing).
5+
Each notebook is modular and corresponds to a specific task in the data science workflow — from EDA and hypothesis testing to modeling and interpretation.
66

7-
Notebooks are organized by task or experiment for reproducibility.
7+
---
8+
9+
## 📁 Folder Purpose
10+
11+
The `notebooks/` directory serves as the primary space for:
12+
13+
- Experimenting with data pipelines
14+
- Visualizing insights
15+
- Testing hypotheses
16+
- Training and evaluating models
17+
- Documenting key results
18+
19+
---
20+
21+
## 🧭 Execution Order
22+
23+
| Notebook Path | Description |
24+
|-------------------------------------------------------------|--------------------------------------------------------------------------------------|
25+
| `task_1/01_Data_understanding.ipynb` | 🔍 **Initial Data Exploration** — Overview of dataset structure, basic distributions, and types of variables. |
26+
| `task_1/02.eda_univariate.ipynb` | 📊 **Univariate Analysis** — Examines single-variable distributions and statistics, including missing data handling. |
27+
| `task_1/03_eda_bivariate.ipynb` | 🔗 **Bivariate Analysis** — Explores relationships between key variables (e.g., claims vs. gender, province). |
28+
| `task_1/04_visualization.ipynb` | 📈 **Visual Summary** — Aggregated plots and advanced visuals to communicate key trends and risk factors. |
29+
| `task_3/05_hypthesis_testing.ipynb` | 📐 **Statistical Hypothesis Testing** — Validates assumptions across provinces, genders, and customer segments. |
30+
| `task_4/06_model_training_and_interpretability.ipynb` | 🧠 **Modeling & Interpretability** — Trains severity and claim probability models; interprets them using SHAP/LIME. |
31+
32+
33+
> **Note:** Run these notebooks in order for best results. Dependencies between notebooks are minimal but intentional (e.g., modeling uses cleaned data from Notebook 5).
34+
35+
---
36+
37+
## 🔍 Highlights by Notebook
38+
39+
### 📁 `task_1/01_Data_understanding.ipynb`
40+
- Overview of dataset structure and key variables
41+
- Initial univariate statistics and class distributions
42+
- Identified outliers and null-value patterns
43+
44+
### 📁 `task_1/02.eda_univariate.ipynb`
45+
- Dealt with missing values and variable types
46+
- Performed univariate analysis: claims, premiums, risk flags
47+
- Created new features: vehicle age, risk class, etc.
48+
49+
### 📁 `task_1/03_eda_bivariate.ipynb`
50+
- Bivariate relationships: claim vs gender, province, zip, cylinders
51+
- Used cross-tabulations and grouped summaries
52+
- Inferred possible risk drivers from visual trends
53+
54+
### 📁 `task_1/04_visualization.ipynb`
55+
- Visual storytelling using bar plots, heatmaps, and boxplots
56+
- Focused on loss ratio patterns by region and vehicle attributes
57+
- Illustrated skewness, imbalance, and outliers effectively
58+
59+
### 📁 `task_3/05_hypthesis_testing.ipynb`
60+
- Statistically tested hypotheses on claim risk factors
61+
- Methods: z-test, t-test, chi-squared
62+
- Provided business insights on regional and demographic effects
63+
64+
### 📁 `task_4/06_model_training_and_interpretability.ipynb`
65+
- Modeled both claim severity (regression) and probability (classification)
66+
- Trained Linear, RF, XGBoost models with performance metrics
67+
- Used SHAP and LIME to interpret model decisions and explain pricing
68+
69+
70+
---
71+
72+
## 📦 How to Use
73+
74+
To run the notebooks:
75+
76+
```bash
77+
cd notebooks/
78+
jupyter notebook

notebooks/task_1/04_visualizations.ipynb

Lines changed: 7 additions & 16 deletions
Large diffs are not rendered by default.

src/README.md

Lines changed: 80 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,83 @@
1-
# Source Code
1+
# 📦 `src/`Source Code Overview
22

3-
This folder contains reusable Python modules for the project.
3+
This directory contains all core logic and modular components used in data processing, hypothesis testing, modeling, and interpretability for the **Insurance Risk Modeling** project.
44

5-
- `data_loader.py`: Functions for loading and saving data.
6-
- `preprocessing.py`: Data cleaning and preprocessing utilities.
7-
- `config.py`: Configuration and path management.
5+
---
86

9-
Import these modules in notebooks and scripts to avoid code duplication.
7+
## 📁 Top-Level Modules
8+
9+
| File | Description |
10+
|------|-------------|
11+
| `__init__.py` | Marks the directory as a Python package. |
12+
| `config.py` | Central location for storing global configuration variables (paths, constants, toggles). |
13+
| `data_loader.py` | Contains utilities for loading raw and processed datasets. |
14+
| `preprocessing.py` | High-level preprocessing functions for cleaning and transforming data. |
15+
| `README.md` | This file – documentation of the codebase structure. |
16+
17+
---
18+
19+
## 🔍 `task_1/` — Exploratory Data Analysis & Visualization
20+
21+
### 📁 `eda/` — EDA Logic
22+
| File | Description |
23+
|------|-------------|
24+
| `bivariate.py` | Analyzes relationships between pairs of variables (e.g., claims vs. gender). |
25+
| `univariate.py` | Summarizes distributions of individual features. |
26+
| `summary_stats.py` | Provides descriptive statistics and summary tables. |
27+
| `outlier_detection.py` | Detects and flags unusual or extreme values using statistical methods. |
28+
29+
### 📁 `viz/` — Visualization Utilities
30+
| File | Description |
31+
|------|-------------|
32+
| `plot_utils.py` | Helper functions for generating consistent plots (bar charts, heatmaps, boxplots, etc.). |
33+
34+
---
35+
36+
## 📈 `task_3/` — Hypothesis Testing & Risk Segmentation
37+
38+
| File | Description |
39+
|------|-------------|
40+
| `hypothesis_tests.py` | Implements z-tests, t-tests, chi-squared tests for hypothesis validation. |
41+
| `data_segmentation.py` | Splits the data by demographic and regional segments for focused analysis. |
42+
| `business_analysis.py` | Converts statistical results into actionable business interpretations. |
43+
| `segmentation_utils.py` | Utilities to group and label segmented datasets. |
44+
| `stats_helpers.py` | Reusable statistical functions (p-value calculators, assumptions checks, etc.). |
45+
46+
---
47+
48+
## 🤖 `task_4/` — Modeling & Interpretability
49+
50+
| File | Description |
51+
|------|-------------|
52+
| `data_processing.py` | Handles encoding, normalization, splitting for modeling. |
53+
| `feature_engineering.py` | Creates new predictive features such as vehicle age, conversion flags, etc. |
54+
| `model_training.py` | Core training logic for regression and classification models (XGBoost, Random Forest, etc.). |
55+
| `interpretability.py` | Generates SHAP & LIME explanations to interpret model decisions. |
56+
57+
---
58+
59+
## ✅ Usage Example
60+
61+
```python
62+
from src.data_loader import load_clean_data
63+
from src.task_1.eda.univariate import summarize_numerics
64+
from src.task_4.model_training import train_xgboost_model
65+
from src.task_4.interpretability import explain_with_shap
66+
67+
# Load and analyze data
68+
df = load_clean_data("data/processed/cleaned_data.csv")
69+
summarize_numerics(df)
70+
71+
# Train and explain model
72+
model, X_test = train_xgboost_model(df)
73+
explain_with_shap(model, X_test)
74+
```
75+
🧩 Design Philosophy
76+
Modular: Code is separated into logically coherent, reusable units.
77+
78+
Interpretable: Business-facing logic (e.g., hypothesis results) is separated from statistical code.
79+
80+
Scalable: Easy to extend for future tasks like time-series modeling or real-time pricing engines.
81+
82+
📝 Note
83+
Ensure all module imports use relative paths (from .module import ...) if running as a package, or adjust PYTHONPATH accordingly for standalone script runs.

0 commit comments

Comments
 (0)