This repository contains a machine learning project that analyzes the "Default of Credit Card Clients" dataset to predict whether a bank customer will fail to pay their debt next month.
In credit risk modeling, prioritizing overall accuracy can leave a financial institution vulnerable to massive losses. This project focuses on a business-first data approach: minimizing financial risk by catching as many potential defaulters as possible (maximizing Recall/minimizing Type II errors), while maintaining an optimal balance with overall model accuracy.
The project analyzes data from 30,000 credit card clients.
- Features Used: Demographic data (Age, Education, Marriage), Credit Limits (
LIMIT_BAL), Billing Amounts (BILL_AMT), and historical repayment statuses (PAY_X). - Target Classification:
Class 0(Safe/No Default) andClass 1(Default). Statistical modeling treats the significant risk event—the Default (1)—as the positive class.
A key experiment was conducted to evaluate how distance-based algorithms compare against tree-based algorithms when handling unscaled vs. scaled data:
- Distance-Based Models (Highly Sensitive): * Logistic Regression: With unscaled data, it failed entirely to detect defaults (F1-Score: 0.0) because the large scale of
LIMIT_BALoverwhelmed behavioral features likePAY_0. Standardizing the data allowed the model to converge, achieving an F1-Score of 0.36.- K-Nearest Neighbors (KNN): Performance nearly doubled after scaling, with the F1-Score jumping from 0.24 to 0.42. Scaling prevented the distance calculations from being dominated entirely by the credit limit over age and payment history.
- Tree-Based Models (Scale-Invariant): Random Forest performance remained virtually identical across unscaled (F1: 0.4714) and scaled data (F1: 0.4717), proving that trees evaluate the order of numbers rather than their absolute scale.
An analysis of the feature relationships revealed significant multicollinearity among the financial billing variables (BILL_AMT1 to BILL_AMT6), creating a high-redundancy block in the correlation matrix. This violated the independence assumptions of simpler models like Naive Bayes. To stabilize modeling, the feature space was isolated down to the strongest behavioral predictors: the historical repayment tracking variables (PAY_X).
Optimization focused heavily on two contrasting models: Gaussian Naive Bayes (for its high baseline sensitivity) and Random Forest (for its structural robustness).
- Why Naive Bayes was Rejected: While it achieved a high baseline recall of 0.63, it suffered from a lower overall accuracy of 71% (falsely flagging too many safe clients). Furthermore, threshold testing proved it was too inflexible for this data structure; dropping the decision boundary to 0.3 leave the recall stagnant around 41–44%, showing it lacked the complexity to capture subtle risk profiles.
- Why Random Forest was Selected: Random Forest demonstrated significantly stronger predictive power with an initial ROC-AUC score of 0.786 (compared to Naive Bayes' 0.727). It offered an excellent mathematical foundation for probability threshold tuning to consciously trade off standard accuracy for bank safety.
Standard machine learning models default to a classification threshold of 0.5, which optimizes default accuracy but can be dangerous in risk management. To align with the business goal of minimizing catastrophic Type II errors (predicting a client is safe when they actually default), the decision threshold was heavily experimented with:
- Default Threshold (0.50): Good overall accuracy (78.5%), but highly risky as it missed nearly half of all defaults (Recall: 56.5%).
- Aggressive Safety Threshold (0.30): Caught almost all defaults (Recall: 87.5%), but generated excessive false alarms, decimating bank accuracy down to 55%.
- Optimal Business Threshold (0.40): Selected as the ideal operational sweet spot. It captured the vast majority of risky customers (Recall: 69.0%) while preserving a reliable baseline of operational efficiency (Accuracy: 72.1%).
- Total Samples Evaluated: 5,000 clients
- True Negatives (Correctly identified as Safe): 3,403
- True Positives (Correctly caught Defaulting): 924
- False Negatives (Type II Errors - Missed Defaults): 415
- False Positives (Type I Errors - False Alarms): 1,258
To confirm these results were mathematically stable and not a byproduct of a fortunate train/test split, a 5-fold Cross-Validation was performed.
- Average Cross-Validated AUC: 0.7779
- Standard Deviation: 0.0046
The exceptionally low variance confirms that the tuned Random Forest classifier is highly robust and generalizes reliably across different subsets of banking data.