This project investigates customer behavior and predicts responses to marketing campaigns for a retail client. The primary goal is to provide an efficient method to prioritize customers for future campaigns and customize marketing messages based on customer segments.
- What customer segments exist based on value, recency, purchase frequency, and channel behavior?
- What are the strongest correlations to campaign response?
- Who are the best customers to target, based on their predicted likelihood of responding?
The project uses the "Customer Personality Analysis" dataset from Kaggle.com, comprising 2,240 customer records and 29 variables, including:
- Demographics: Age, Income, Education, Marital Status
- Spending Behavior: Amount spent on various products (wines, meat, fruits, etc.)
- Customer Activity: Number of purchases by channel, recency (days since last purchase)
- Marketing Response: Binary variables for acceptance of previous campaigns and the most recent campaign.
Retrieve From (Kaggle): https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis
- Handled missing
Incomevalues by imputation with the mean. - Created an
Agefeature fromYear_Birthand capped unrealistic ages. - Dropped identifier (
ID) and constant columns (Z_CostContact,Z_Revenue). - Addressed negative values in spending and purchase-related columns.
- Clipped extreme outliers in
Incomeand spending features. - Engineered new features:
Total_Spending,Total_Purchases, andSpending_per_Purchaseto capture overall customer value and engagement.
- Boxplot Analysis: Visualized feature distributions, confirming the need for scaling due to varying scales and outliers.
- Income Distribution: Showed a slight right-skew, justifying log-transformation.
- Spending Correlation Heatmap: Identified strong positive correlations between spending categories, particularly wine and meat, indicating high-value customers spend across multiple areas.
- Target Variable Distribution: Revealed a class imbalance (14.9% response rate), highlighting the need for appropriate evaluation metrics.
- Feature–Response Correlation (Point-Biserial): Identified key features like
Total_Spending,MntMeatProducts,NumWebPurchases, andRecencyas strong predictors of campaign response. - Log-Odds Linearity Check: Confirmed near-linear trends for several features, supporting Logistic Regression.
- Non-Linear Patterns and Interaction Effects: Illustrated interactions between Age, Income, Recency, and Web Purchases, suggesting the value of Decision Trees.
Figure 1: EDA Boxplot Analysis of Min-Max
Figure 2: EDA Boxplot Analysis of Standard Deviation
Figure 3: Target Variable Distribution
- Objective: Segment customers into distinct behavioral groups.
- Features:
Income(log-transformed and outlier-treated),Age,Total_Spending,Total_Purchases,Spending_per_Purchase. - Preprocessing: Outlier handling, log transformation of
Income, andStandardScalerfor normalization. - Dimensionality Reduction: PCA was applied to reduce feature redundancy and facilitate 2D visualization.
- Optimal Clusters: The Elbow Method, Silhouette Score, and Davies–Bouldin Index consistently suggested 3 optimal clusters.
- Cluster Profiles: Identified three distinct segments:
- Cluster 0: High-income, high-spending, high-frequency purchasers.
- Cluster 1: Lower-income, low-spending, low-engagement customers.
- Cluster 2: Middle-income, moderate-spending, and oldest on average.
Figure 4: K-Means Clustering
Objective: Predict customer response to marketing campaigns.
- Features: All relevant demographic, spending, and purchase behavior features, one-hot encoded for categorical variables.
- Preprocessing:
StandardScalerapplied to numerical features. - Class Imbalance: Addressed using SMOTE during training.
- Interpretation: Coefficients and odds ratios provided insights into feature importance:
- Positive Indicators: Prior campaign acceptance, web visits, meat spending, higher education.
- Negative Indicators: High recency (inactive), presence of teenagers, married/cohabiting status, in-store purchases.
Figure 5: Cofficient in Logistic Regression
- Features: Similar to Logistic Regression, handling mixed data types intrinsically.
- Training & Tuning: Used
GridSearchCVwithaverage_precisionscoring to optimize hyperparameters, includingclass_weightto manage imbalance. - Threshold Tuning: Optimized the classification threshold on a validation set to maximize F1-score and Recall, selecting 0.7.
- Interpretation: The tree structure and feature importances revealed decision rules for response prediction. Key features included
Total_Spending,NumWebPurchases,Recency,MntMeatProducts, andIncome.
Figure 6: Decision Tree Visualization
- Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC, and Confusion Matrices were used.
- Performance:
- Logistic Regression: Achieved an ROC-AUC of 0.827, showing slightly better overall discrimination.
- Decision Tree: Achieved an ROC-AUC of 0.802. Crucially, it provided a lift of 2.97 in response rate (44.4% vs 15.0% overall) when targeting the top 20% of customers.
Figure 7:Model Comparison 1
Figure 8:Model Comparison 2
Figure 9:ROC curve comparison
- Decision Tree for Targeting: Utilize the Decision Tree for prioritizing and targeting customers due to its high lift and interpretable if-then rules, which are easy to implement in CRM systems. This will significantly improve campaign efficiency by focusing on the highest-probability responders.
- Logistic Regression for Explanation: Employ Logistic Regression for explaining and justifying marketing decisions. Its interpretable coefficients and stable probability scores offer valuable insights into why certain customer attributes influence response, supporting strategic planning.
The two models are complementary, with the Decision Tree driving operational targeting and Logistic Regression providing strategic understanding.
- Ethical: Guard against unfair targeting. Regularly review selected customer lists to ensure no groups are consistently excluded based on proxies for social advantage.
- Privacy: Treat customer-level scores as sensitive data. Use minimal fields, anonymize identifiers where possible, and restrict access to campaign teams only.
- Security: Implement strict access controls for customer score lists. Ensure secure storage, audit trails, and limited reuse of scores to mitigate data breach risks.









