|
1 | | -# Insurance Risk Analytics Project |
| 1 | +# 🚗 Insurance Risk Modeling & Dynamic Pricing System |
2 | 2 |
|
3 | | -This project analyzes insurance data to extract actionable insights on risk, profitability, and optimal pricing strategies. |
| 3 | +This project develops a robust, explainable, and risk-aware pricing model for auto insurance policies. It incorporates statistical analysis, machine learning, and reproducible data practices to predict insurance claim severity and optimize premium pricing. |
4 | 4 |
|
5 | | -## 📂 Structure |
6 | | -- `data/`: Raw and processed data tracked via DVC |
7 | | -- `notebooks/`: EDA and modeling notebooks |
8 | | -- `src/`: Reusable Python scripts |
9 | | -- `tests/`: Unit tests for core logic |
10 | | -- `.github/workflows/`: CI/CD pipelines |
| 5 | +> ✅ Completed as part of the Week 3 Challenge at **10Academy**. |
| 6 | +
|
| 7 | +--- |
| 8 | + |
| 9 | +## 🧭 Project Goals |
| 10 | + |
| 11 | +- Understand and explore insurance data to uncover actionable insights. |
| 12 | +- Establish a reproducible data pipeline using Git, GitHub, and DVC. |
| 13 | +- Statistically validate hypotheses related to insurance risk. |
| 14 | +- Build predictive models to estimate: |
| 15 | + - 💰 **Claim Severity** — How much we might pay. |
| 16 | + - 📈 **Claim Probability** — How likely a customer is to claim. |
| 17 | +- Construct a **dynamic pricing formula** that incorporates business margins. |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## 🔧 Technologies & Tools |
| 22 | + |
| 23 | +| Area | Tools Used | |
| 24 | +|--------------------|----------------------------------------| |
| 25 | +| Programming | Python, Jupyter | |
| 26 | +| Data Handling | Pandas, NumPy, DVC | |
| 27 | +| Visualization | Matplotlib, Seaborn, Plotly | |
| 28 | +| Modeling | Scikit-learn, XGBoost, SHAP, LIME | |
| 29 | +| Version Control | Git, GitHub, GitHub Actions | |
| 30 | +| CI/CD | GitHub Actions | |
| 31 | +| Environment | `venv` + `requirements.txt` | |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## 📂 Repository Structure |
| 36 | +```text |
| 37 | +. |
| 38 | +├── data/ # Raw and processed data (tracked via DVC) |
| 39 | +├── models/ # Saved models |
| 40 | +├── notebooks/ # Jupyter notebooks for EDA, testing, modeling |
| 41 | +├── src/ # Core source code |
| 42 | +│ ├── preprocessing/ # Cleaning, transformation, encoding |
| 43 | +│ ├── task_3/ # Hypothesis testing modules |
| 44 | +│ └── task_4/ # Modeling pipeline and interpretation |
| 45 | +├── tests/ # Unit tests |
| 46 | +├── .dvc/ # DVC metadata |
| 47 | +├── .github/workflows/ # GitHub Actions CI pipeline |
| 48 | +├── dvc.yaml # DVC pipeline definition |
| 49 | +├── requirements.txt # Python dependencies |
| 50 | +└── README.md # Project overview (this file) |
| 51 | +``` |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +## 📊 Task Breakdown |
| 56 | + |
| 57 | +### 🔍 Task 1: EDA & Git Setup |
| 58 | + |
| 59 | +- Configured Git and GitHub, created `task-1` branch |
| 60 | +- Performed EDA on claims, premiums, and customer demographics |
| 61 | +- Visualized insights across provinces, genders, and vehicle types |
| 62 | +- Identified key drivers of loss ratio and risk |
| 63 | + |
| 64 | +### 💾 Task 2: Data Version Control (DVC) |
| 65 | + |
| 66 | +- Installed DVC and initialized version control |
| 67 | +- Added data files to DVC tracking |
| 68 | +- Set up a **local remote storage** and pushed data |
| 69 | +- Ensured reproducibility and auditability of datasets |
| 70 | + |
| 71 | +### 📊 Task 3: Hypothesis Testing |
| 72 | + |
| 73 | +- Formulated and tested statistical hypotheses: |
| 74 | + - 📍 Risk varies across **provinces** and **zip codes** |
| 75 | + - 👥 Gender differences in **claim frequency** and **severity** |
| 76 | + - 💸 Profitability margins vary by region |
| 77 | +- Used t-tests, z-tests, chi-squared where applicable |
| 78 | +- Business interpretations provided for each result |
| 79 | + |
| 80 | +### 🧠 Task 4: Predictive Modeling |
| 81 | + |
| 82 | +- Built severity regression models: Linear, Random Forest, XGBoost |
| 83 | +- Evaluated using RMSE, R² |
| 84 | +- Used **SHAP** and **LIME** for feature importance |
| 85 | +- Modeled claim probability (classification) for pricing |
| 86 | +- Final pricing formula: |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## 📈 Key Insights & Recommendations |
| 91 | + |
| 92 | +| Insight | Impact on Pricing Strategy | |
| 93 | +|-------------------------------------------|---------------------------------------------------| |
| 94 | +| ≤4 Cylinder vehicles → ↑ Severity Risk | Apply loading to small-engine vehicles | |
| 95 | +| Non-VAT Registered → ↑ Risk | Raise base rate for unregistered customers | |
| 96 | +| Converted/Modified Vehicles = ↑ Risk | Apply higher risk surcharge | |
| 97 | +| Alarm/Immobilizer → ↓ Risk | Provide discount for security features | |
| 98 | +| New Vehicles → ↓ Risk | Discount for newer vehicles | |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +## 📦 Setup Instructions |
| 103 | + |
| 104 | +1. **Clone the repository:** |
| 105 | + ```bash |
| 106 | + git clone https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git |
| 107 | + cd insurance-risk-model |
| 108 | +Install dependencies: |
11 | 109 |
|
12 | | -## 📦 Setup |
13 | 110 | ```bash |
14 | 111 | python -m venv venv |
15 | | -source venv/bin/activate # or venv\Scripts\activate on Windows |
| 112 | +source venv/bin/activate # Windows: venv\Scripts\activate |
16 | 113 | pip install -r requirements.txt |
17 | 114 | ``` |
| 115 | +Run notebooks: |
| 116 | + |
| 117 | +```bash |
| 118 | +jupyter notebook |
| 119 | +``` |
| 120 | +Run tests: |
18 | 121 |
|
19 | | -## ▶️ Run Tests |
20 | 122 | ```bash |
21 | 123 | pytest |
22 | 124 | ``` |
23 | | -## 📊 Tools |
24 | | -- Python, DVC, GitHub Actions, SHAP, XGBoost |
| 125 | +🧪 Data Versioning with DVC |
| 126 | +bash |
| 127 | +dvc init |
| 128 | +dvc add data/raw/insurance_data.csv |
| 129 | +dvc remote add -d localstorage /path/to/your/storage |
| 130 | +dvc push |
| 131 | +To reproduce the data pipeline: |
| 132 | + |
| 133 | +bash |
| 134 | +dvc pull |
| 135 | +✅ CI/CD |
| 136 | +GitHub Actions is configured for: |
| 137 | + |
| 138 | +Code linting |
| 139 | + |
| 140 | +Unit tests |
| 141 | + |
| 142 | +Model validation (optional step) |
| 143 | + |
| 144 | +Workflow defined in .github/workflows/deploy.yml. |
| 145 | + |
| 146 | +📌 Results Summary |
| 147 | +🧮 Best Severity Model: XGBoost |
| 148 | +RMSE improvement: +Δ% vs. baseline |
| 149 | +Top features: Engine Size, Vehicle Age, Province, Conversion Status |
| 150 | + |
| 151 | +🧠 Classification Accuracy: ~X% |
| 152 | +Enables dynamic, fair, and risk-adjusted premium pricing |
| 153 | + |
| 154 | +👥 Contributors |
| 155 | +👤 Ayana Samuel |
| 156 | +Role: Full Data Science Workflow |
| 157 | +Skills: EDA, DVC, Statistical Testing, Machine Learning, GitOps |
| 158 | +GitHub: https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git |
| 159 | + |
| 160 | +📜 License |
| 161 | +This project is licensed for academic and demonstration use. Contact the author for commercial usage rights. |
0 commit comments