Skip to content

Commit a59f422

Browse files
authored
task-4 from ayanasamuel8/task-4
Update: task-4 - model training and interpretability
2 parents b397446 + 5baf43b commit a59f422

15 files changed

Lines changed: 2225 additions & 43 deletions

README.md

Lines changed: 150 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,161 @@
1-
# Insurance Risk Analytics Project
1+
# 🚗 Insurance Risk Modeling & Dynamic Pricing System
22

3-
This project analyzes insurance data to extract actionable insights on risk, profitability, and optimal pricing strategies.
3+
This project develops a robust, explainable, and risk-aware pricing model for auto insurance policies. It incorporates statistical analysis, machine learning, and reproducible data practices to predict insurance claim severity and optimize premium pricing.
44

5-
## 📂 Structure
6-
- `data/`: Raw and processed data tracked via DVC
7-
- `notebooks/`: EDA and modeling notebooks
8-
- `src/`: Reusable Python scripts
9-
- `tests/`: Unit tests for core logic
10-
- `.github/workflows/`: CI/CD pipelines
5+
> ✅ Completed as part of the Week 3 Challenge at **10Academy**.
6+
7+
---
8+
9+
## 🧭 Project Goals
10+
11+
- Understand and explore insurance data to uncover actionable insights.
12+
- Establish a reproducible data pipeline using Git, GitHub, and DVC.
13+
- Statistically validate hypotheses related to insurance risk.
14+
- Build predictive models to estimate:
15+
- 💰 **Claim Severity** — How much we might pay.
16+
- 📈 **Claim Probability** — How likely a customer is to claim.
17+
- Construct a **dynamic pricing formula** that incorporates business margins.
18+
19+
---
20+
21+
## 🔧 Technologies & Tools
22+
23+
| Area | Tools Used |
24+
|--------------------|----------------------------------------|
25+
| Programming | Python, Jupyter |
26+
| Data Handling | Pandas, NumPy, DVC |
27+
| Visualization | Matplotlib, Seaborn, Plotly |
28+
| Modeling | Scikit-learn, XGBoost, SHAP, LIME |
29+
| Version Control | Git, GitHub, GitHub Actions |
30+
| CI/CD | GitHub Actions |
31+
| Environment | `venv` + `requirements.txt` |
32+
33+
---
34+
35+
## 📂 Repository Structure
36+
```text
37+
.
38+
├── data/ # Raw and processed data (tracked via DVC)
39+
├── models/ # Saved models
40+
├── notebooks/ # Jupyter notebooks for EDA, testing, modeling
41+
├── src/ # Core source code
42+
│ ├── preprocessing/ # Cleaning, transformation, encoding
43+
│ ├── task_3/ # Hypothesis testing modules
44+
│ └── task_4/ # Modeling pipeline and interpretation
45+
├── tests/ # Unit tests
46+
├── .dvc/ # DVC metadata
47+
├── .github/workflows/ # GitHub Actions CI pipeline
48+
├── dvc.yaml # DVC pipeline definition
49+
├── requirements.txt # Python dependencies
50+
└── README.md # Project overview (this file)
51+
```
52+
53+
---
54+
55+
## 📊 Task Breakdown
56+
57+
### 🔍 Task 1: EDA & Git Setup
58+
59+
- Configured Git and GitHub, created `task-1` branch
60+
- Performed EDA on claims, premiums, and customer demographics
61+
- Visualized insights across provinces, genders, and vehicle types
62+
- Identified key drivers of loss ratio and risk
63+
64+
### 💾 Task 2: Data Version Control (DVC)
65+
66+
- Installed DVC and initialized version control
67+
- Added data files to DVC tracking
68+
- Set up a **local remote storage** and pushed data
69+
- Ensured reproducibility and auditability of datasets
70+
71+
### 📊 Task 3: Hypothesis Testing
72+
73+
- Formulated and tested statistical hypotheses:
74+
- 📍 Risk varies across **provinces** and **zip codes**
75+
- 👥 Gender differences in **claim frequency** and **severity**
76+
- 💸 Profitability margins vary by region
77+
- Used t-tests, z-tests, chi-squared where applicable
78+
- Business interpretations provided for each result
79+
80+
### 🧠 Task 4: Predictive Modeling
81+
82+
- Built severity regression models: Linear, Random Forest, XGBoost
83+
- Evaluated using RMSE, R²
84+
- Used **SHAP** and **LIME** for feature importance
85+
- Modeled claim probability (classification) for pricing
86+
- Final pricing formula:
87+
88+
---
89+
90+
## 📈 Key Insights & Recommendations
91+
92+
| Insight | Impact on Pricing Strategy |
93+
|-------------------------------------------|---------------------------------------------------|
94+
| ≤4 Cylinder vehicles → ↑ Severity Risk | Apply loading to small-engine vehicles |
95+
| Non-VAT Registered → ↑ Risk | Raise base rate for unregistered customers |
96+
| Converted/Modified Vehicles = ↑ Risk | Apply higher risk surcharge |
97+
| Alarm/Immobilizer → ↓ Risk | Provide discount for security features |
98+
| New Vehicles → ↓ Risk | Discount for newer vehicles |
99+
100+
---
101+
102+
## 📦 Setup Instructions
103+
104+
1. **Clone the repository:**
105+
```bash
106+
git clone https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
107+
cd insurance-risk-model
108+
Install dependencies:
11109

12-
## 📦 Setup
13110
```bash
14111
python -m venv venv
15-
source venv/bin/activate # or venv\Scripts\activate on Windows
112+
source venv/bin/activate # Windows: venv\Scripts\activate
16113
pip install -r requirements.txt
17114
```
115+
Run notebooks:
116+
117+
```bash
118+
jupyter notebook
119+
```
120+
Run tests:
18121

19-
## ▶️ Run Tests
20122
```bash
21123
pytest
22124
```
23-
## 📊 Tools
24-
- Python, DVC, GitHub Actions, SHAP, XGBoost
125+
🧪 Data Versioning with DVC
126+
bash
127+
dvc init
128+
dvc add data/raw/insurance_data.csv
129+
dvc remote add -d localstorage /path/to/your/storage
130+
dvc push
131+
To reproduce the data pipeline:
132+
133+
bash
134+
dvc pull
135+
✅ CI/CD
136+
GitHub Actions is configured for:
137+
138+
Code linting
139+
140+
Unit tests
141+
142+
Model validation (optional step)
143+
144+
Workflow defined in .github/workflows/deploy.yml.
145+
146+
📌 Results Summary
147+
🧮 Best Severity Model: XGBoost
148+
RMSE improvement: +Δ% vs. baseline
149+
Top features: Engine Size, Vehicle Age, Province, Conversion Status
150+
151+
🧠 Classification Accuracy: ~X%
152+
Enables dynamic, fair, and risk-adjusted premium pricing
153+
154+
👥 Contributors
155+
👤 Ayana Samuel
156+
Role: Full Data Science Workflow
157+
Skills: EDA, DVC, Statistical Testing, Machine Learning, GitOps
158+
GitHub: https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
159+
160+
📜 License
161+
This project is licensed for academic and demonstration use. Contact the author for commercial usage rights.

notebooks/README.md

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,78 @@
1-
# Notebooks
1+
# 📒 Notebooks Overview — Insurance Pricing Project
22

3-
This folder contains Jupyter notebooks for exploratory data analysis (EDA), feature engineering, and modeling.
3+
This folder contains Jupyter Notebooks used to explore, analyze, and model insurance claim data as part of the **10Academy Week 3 Challenge**.
44

5-
- `task_1/`: Notebooks related to Task 1 (e.g., data understanding, preprocessing).
5+
Each notebook is modular and corresponds to a specific task in the data science workflow — from EDA and hypothesis testing to modeling and interpretation.
66

7-
Notebooks are organized by task or experiment for reproducibility.
7+
---
8+
9+
## 📁 Folder Purpose
10+
11+
The `notebooks/` directory serves as the primary space for:
12+
13+
- Experimenting with data pipelines
14+
- Visualizing insights
15+
- Testing hypotheses
16+
- Training and evaluating models
17+
- Documenting key results
18+
19+
---
20+
21+
## 🧭 Execution Order
22+
23+
| Notebook Path | Description |
24+
|-------------------------------------------------------------|--------------------------------------------------------------------------------------|
25+
| `task_1/01_Data_understanding.ipynb` | 🔍 **Initial Data Exploration** — Overview of dataset structure, basic distributions, and types of variables. |
26+
| `task_1/02.eda_univariate.ipynb` | 📊 **Univariate Analysis** — Examines single-variable distributions and statistics, including missing data handling. |
27+
| `task_1/03_eda_bivariate.ipynb` | 🔗 **Bivariate Analysis** — Explores relationships between key variables (e.g., claims vs. gender, province). |
28+
| `task_1/04_visualization.ipynb` | 📈 **Visual Summary** — Aggregated plots and advanced visuals to communicate key trends and risk factors. |
29+
| `task_3/05_hypthesis_testing.ipynb` | 📐 **Statistical Hypothesis Testing** — Validates assumptions across provinces, genders, and customer segments. |
30+
| `task_4/06_model_training_and_interpretability.ipynb` | 🧠 **Modeling & Interpretability** — Trains severity and claim probability models; interprets them using SHAP/LIME. |
31+
32+
33+
> **Note:** Run these notebooks in order for best results. Dependencies between notebooks are minimal but intentional (e.g., modeling uses cleaned data from Notebook 5).
34+
35+
---
36+
37+
## 🔍 Highlights by Notebook
38+
39+
### 📁 `task_1/01_Data_understanding.ipynb`
40+
- Overview of dataset structure and key variables
41+
- Initial univariate statistics and class distributions
42+
- Identified outliers and null-value patterns
43+
44+
### 📁 `task_1/02.eda_univariate.ipynb`
45+
- Dealt with missing values and variable types
46+
- Performed univariate analysis: claims, premiums, risk flags
47+
- Created new features: vehicle age, risk class, etc.
48+
49+
### 📁 `task_1/03_eda_bivariate.ipynb`
50+
- Bivariate relationships: claim vs gender, province, zip, cylinders
51+
- Used cross-tabulations and grouped summaries
52+
- Inferred possible risk drivers from visual trends
53+
54+
### 📁 `task_1/04_visualization.ipynb`
55+
- Visual storytelling using bar plots, heatmaps, and boxplots
56+
- Focused on loss ratio patterns by region and vehicle attributes
57+
- Illustrated skewness, imbalance, and outliers effectively
58+
59+
### 📁 `task_3/05_hypthesis_testing.ipynb`
60+
- Statistically tested hypotheses on claim risk factors
61+
- Methods: z-test, t-test, chi-squared
62+
- Provided business insights on regional and demographic effects
63+
64+
### 📁 `task_4/06_model_training_and_interpretability.ipynb`
65+
- Modeled both claim severity (regression) and probability (classification)
66+
- Trained Linear, RF, XGBoost models with performance metrics
67+
- Used SHAP and LIME to interpret model decisions and explain pricing
68+
69+
70+
---
71+
72+
## 📦 How to Use
73+
74+
To run the notebooks:
75+
76+
```bash
77+
cd notebooks/
78+
jupyter notebook

notebooks/task_1/04_visualizations.ipynb

Lines changed: 7 additions & 16 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)