Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 150 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,161 @@
# Insurance Risk Analytics Project
# 🚗 Insurance Risk Modeling & Dynamic Pricing System

This project analyzes insurance data to extract actionable insights on risk, profitability, and optimal pricing strategies.
This project develops a robust, explainable, and risk-aware pricing model for auto insurance policies. It incorporates statistical analysis, machine learning, and reproducible data practices to predict insurance claim severity and optimize premium pricing.

## 📂 Structure
- `data/`: Raw and processed data tracked via DVC
- `notebooks/`: EDA and modeling notebooks
- `src/`: Reusable Python scripts
- `tests/`: Unit tests for core logic
- `.github/workflows/`: CI/CD pipelines
> ✅ Completed as part of the Week 3 Challenge at **10Academy**.

---

## 🧭 Project Goals

- Understand and explore insurance data to uncover actionable insights.
- Establish a reproducible data pipeline using Git, GitHub, and DVC.
- Statistically validate hypotheses related to insurance risk.
- Build predictive models to estimate:
- 💰 **Claim Severity** — How much we might pay.
- 📈 **Claim Probability** — How likely a customer is to claim.
- Construct a **dynamic pricing formula** that incorporates business margins.

---

## 🔧 Technologies & Tools

| Area | Tools Used |
|--------------------|----------------------------------------|
| Programming | Python, Jupyter |
| Data Handling | Pandas, NumPy, DVC |
| Visualization | Matplotlib, Seaborn, Plotly |
| Modeling | Scikit-learn, XGBoost, SHAP, LIME |
| Version Control | Git, GitHub, GitHub Actions |
| CI/CD | GitHub Actions |
| Environment | `venv` + `requirements.txt` |

---

## 📂 Repository Structure
```text
.
├── data/ # Raw and processed data (tracked via DVC)
├── models/ # Saved models
├── notebooks/ # Jupyter notebooks for EDA, testing, modeling
├── src/ # Core source code
│ ├── preprocessing/ # Cleaning, transformation, encoding
│ ├── task_3/ # Hypothesis testing modules
│ └── task_4/ # Modeling pipeline and interpretation
├── tests/ # Unit tests
├── .dvc/ # DVC metadata
├── .github/workflows/ # GitHub Actions CI pipeline
├── dvc.yaml # DVC pipeline definition
├── requirements.txt # Python dependencies
└── README.md # Project overview (this file)
```

---

## 📊 Task Breakdown

### 🔍 Task 1: EDA & Git Setup

- Configured Git and GitHub, created `task-1` branch
- Performed EDA on claims, premiums, and customer demographics
- Visualized insights across provinces, genders, and vehicle types
- Identified key drivers of loss ratio and risk

### 💾 Task 2: Data Version Control (DVC)

- Installed DVC and initialized version control
- Added data files to DVC tracking
- Set up a **local remote storage** and pushed data
- Ensured reproducibility and auditability of datasets

### 📊 Task 3: Hypothesis Testing

- Formulated and tested statistical hypotheses:
- 📍 Risk varies across **provinces** and **zip codes**
- 👥 Gender differences in **claim frequency** and **severity**
- 💸 Profitability margins vary by region
- Used t-tests, z-tests, chi-squared where applicable
- Business interpretations provided for each result

### 🧠 Task 4: Predictive Modeling

- Built severity regression models: Linear, Random Forest, XGBoost
- Evaluated using RMSE, R²
- Used **SHAP** and **LIME** for feature importance
- Modeled claim probability (classification) for pricing
- Final pricing formula:

---

## 📈 Key Insights & Recommendations

| Insight | Impact on Pricing Strategy |
|-------------------------------------------|---------------------------------------------------|
| ≤4 Cylinder vehicles → ↑ Severity Risk | Apply loading to small-engine vehicles |
| Non-VAT Registered → ↑ Risk | Raise base rate for unregistered customers |
| Converted/Modified Vehicles = ↑ Risk | Apply higher risk surcharge |
| Alarm/Immobilizer → ↓ Risk | Provide discount for security features |
| New Vehicles → ↓ Risk | Discount for newer vehicles |

---

## 📦 Setup Instructions

1. **Clone the repository:**
```bash
git clone https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
cd insurance-risk-model
Install dependencies:

## 📦 Setup
```bash
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
```
Run notebooks:

```bash
jupyter notebook
```
Run tests:

## ▶️ Run Tests
```bash
pytest
```
## 📊 Tools
- Python, DVC, GitHub Actions, SHAP, XGBoost
🧪 Data Versioning with DVC
bash
dvc init
dvc add data/raw/insurance_data.csv
dvc remote add -d localstorage /path/to/your/storage
dvc push
To reproduce the data pipeline:

bash
dvc pull
✅ CI/CD
GitHub Actions is configured for:

Code linting

Unit tests

Model validation (optional step)

Workflow defined in .github/workflows/deploy.yml.

📌 Results Summary
🧮 Best Severity Model: XGBoost
RMSE improvement: +Δ% vs. baseline
Top features: Engine Size, Vehicle Age, Province, Conversion Status

🧠 Classification Accuracy: ~X%
Enables dynamic, fair, and risk-adjusted premium pricing

👥 Contributors
👤 Ayana Samuel
Role: Full Data Science Workflow
Skills: EDA, DVC, Statistical Testing, Machine Learning, GitOps
GitHub: https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git

📜 License
This project is licensed for academic and demonstration use. Contact the author for commercial usage rights.
79 changes: 75 additions & 4 deletions notebooks/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,78 @@
# Notebooks
# 📒 Notebooks Overview — Insurance Pricing Project

This folder contains Jupyter notebooks for exploratory data analysis (EDA), feature engineering, and modeling.
This folder contains Jupyter Notebooks used to explore, analyze, and model insurance claim data as part of the **10Academy Week 3 Challenge**.

- `task_1/`: Notebooks related to Task 1 (e.g., data understanding, preprocessing).
Each notebook is modular and corresponds to a specific task in the data science workflow — from EDA and hypothesis testing to modeling and interpretation.

Notebooks are organized by task or experiment for reproducibility.
---

## 📁 Folder Purpose

The `notebooks/` directory serves as the primary space for:

- Experimenting with data pipelines
- Visualizing insights
- Testing hypotheses
- Training and evaluating models
- Documenting key results

---

## 🧭 Execution Order

| Notebook Path | Description |
|-------------------------------------------------------------|--------------------------------------------------------------------------------------|
| `task_1/01_Data_understanding.ipynb` | 🔍 **Initial Data Exploration** — Overview of dataset structure, basic distributions, and types of variables. |
| `task_1/02.eda_univariate.ipynb` | 📊 **Univariate Analysis** — Examines single-variable distributions and statistics, including missing data handling. |
| `task_1/03_eda_bivariate.ipynb` | 🔗 **Bivariate Analysis** — Explores relationships between key variables (e.g., claims vs. gender, province). |
| `task_1/04_visualization.ipynb` | 📈 **Visual Summary** — Aggregated plots and advanced visuals to communicate key trends and risk factors. |
| `task_3/05_hypthesis_testing.ipynb` | 📐 **Statistical Hypothesis Testing** — Validates assumptions across provinces, genders, and customer segments. |
| `task_4/06_model_training_and_interpretability.ipynb` | 🧠 **Modeling & Interpretability** — Trains severity and claim probability models; interprets them using SHAP/LIME. |


> **Note:** Run these notebooks in order for best results. Dependencies between notebooks are minimal but intentional (e.g., modeling uses cleaned data from Notebook 5).

---

## 🔍 Highlights by Notebook

### 📁 `task_1/01_Data_understanding.ipynb`
- Overview of dataset structure and key variables
- Initial univariate statistics and class distributions
- Identified outliers and null-value patterns

### 📁 `task_1/02.eda_univariate.ipynb`
- Dealt with missing values and variable types
- Performed univariate analysis: claims, premiums, risk flags
- Created new features: vehicle age, risk class, etc.

### 📁 `task_1/03_eda_bivariate.ipynb`
- Bivariate relationships: claim vs gender, province, zip, cylinders
- Used cross-tabulations and grouped summaries
- Inferred possible risk drivers from visual trends

### 📁 `task_1/04_visualization.ipynb`
- Visual storytelling using bar plots, heatmaps, and boxplots
- Focused on loss ratio patterns by region and vehicle attributes
- Illustrated skewness, imbalance, and outliers effectively

### 📁 `task_3/05_hypthesis_testing.ipynb`
- Statistically tested hypotheses on claim risk factors
- Methods: z-test, t-test, chi-squared
- Provided business insights on regional and demographic effects

### 📁 `task_4/06_model_training_and_interpretability.ipynb`
- Modeled both claim severity (regression) and probability (classification)
- Trained Linear, RF, XGBoost models with performance metrics
- Used SHAP and LIME to interpret model decisions and explain pricing


---

## 📦 How to Use

To run the notebooks:

```bash
cd notebooks/
jupyter notebook
23 changes: 7 additions & 16 deletions notebooks/task_1/04_visualizations.ipynb

Large diffs are not rendered by default.

Loading