Dexory Technical Task - Warehouse Intelligence System

Applied AI Engineer Technical Assessment
Author: Alessandro Alati
Date: November 2025

📋 Project Overview

This project analyzes 10 days of warehouse scan data to extract actionable intelligence about inventory accuracy, error patterns, and operational issues. It includes a complete data pipeline, exploratory analysis, anomaly detection, and a containerized REST API.

🎯 Completed Tasks

✅ Point 1: Data Engineering Pipeline

Ingests 10 days of warehouse scan data (350K+ records)
Merges with warehouse layout (33K+ locations)
Outputs clean Parquet dataset with spatial features
Includes comprehensive unit tests

✅ Point 2: Exploratory Data Analysis (EDA)

WHAT: Daily accuracy trends and error type breakdown
WHERE: Spatial hotspots (shelf levels, aisles, height correlations)
WHEN: Velocity analysis (fast-moving locations)
Generates 8 publication-quality visualizations
Statistical validation (chi-square tests, correlations)

✅ Point 3: Anomaly Detection

Composite risk scoring model (error severity + operational impact)
Identifies Top 20 most problematic locations
Transparent, explainable scoring system
Output: Ranked CSV with actionable metrics

✅ Point 4: Scalable & Containerized API

FastAPI application with 4 endpoints
Docker containerization with docker-compose
Interactive Swagger documentation
Health checks and error handling

✅ Point 5: Error Prediction Model

Random Forest classifier (zero-to-one model)
Predicts high/low error risk from static features only
Works on new warehouses with no scan history
65% accuracy (30% above baseline)

📁 Project Structure

dexory-technical-task/
├── core_scripts/               # Main analysis scripts
│   ├── data_pipeline.py        # Data ingestion and cleaning
│   ├── eda.py                  # Exploratory data analysis
│   ├── anomaly_detection.py    # Top 20 problematic locations
│   ├── error_prediction.py     # ML model for error prediction
│   └── test_pipeline.py        # Unit tests
│
├── API/
│   └── warehouse-api/          # FastAPI application
│       ├── app/main.py         # API endpoints
│       ├── Dockerfile          # Container definition
│       ├── docker-compose.yml  # Docker orchestration
│       └── requirements.txt    # API dependencies
│
├── data_models/
│   ├── technical-task-data/    # Raw input data (10 days)
│   └── output/                 # Processed data & models
│       ├── warehouse_data.parquet
│       ├── top_20_problematic.csv
│       ├── error_predictor.pkl
│       └── eda_plots/          # Analysis visualizations
│
└── README.md                   # This file

🚀 Quick Start

Prerequisites

Python 3.11+
Docker Desktop (for API)

1. Install Dependencies

pip install -r requirements.txt

2. Run the Complete Pipeline

# Step 1: Process data (Point 1)
cd core_scripts
python data_pipeline.py

# Step 2: Run EDA (Point 2)
python eda.py

# Step 3: Detect anomalies (Point 3)
python anomaly_detection.py

# Step 4: Train prediction model (Point 5)
python error_prediction.py

3. Launch the API (Point 4)

# Navigate to API folder
cd ../API/warehouse-api

# Run with Docker
docker-compose up --build

# Access API at:
# http://localhost:8000/docs

📊 Key Results

Inventory Accuracy

Mean Accuracy: 75.07%
Range: 3.09% - 99.30%
Most Common Error: Unknown item found (3.83%)

Spatial Insights

Ground shelves: 6.24% error rate (highest)
High shelves: 1.37% error rate (lowest)
Most problematic aisle: AZ 1 (10.64% error rate)
Significant correlation: Shelf level affects error rate (p < 0.001)

Velocity Analysis

High-velocity locations: ~650 (2% of total)
Static locations: 25,234 (75% of total)
Finding: High-velocity locations have significantly more errors

Anomaly Detection

Top 20 problematic locations identified
Scoring factors: Error rate (40%), Operational impact (30%), Error severity (20%), Spatial context (10%)
Highest risk score: 0.52 (Location with 18.5% error rate + high velocity)

Prediction Model

Algorithm: Random Forest (balanced class weights)
Accuracy: 65% (vs 50% baseline)
High Error Recall: 73% (catches 73% of problematic locations)
Key features: Shelf height, position, aisle location

🔌 API Endpoints

The API serves analysis results and predictions:

Endpoint	Description
`GET /health`	System health check
`GET /warehouse/anomalies`	Top 20 problematic locations
`GET /warehouse/stats`	Daily accuracy trends & error breakdown
`GET /location/{name}`	Detailed location analysis + prediction

Interactive docs: http://localhost:8000/docs

See API/warehouse-api/README.md for detailed API documentation.

🧪 Testing

Run unit tests:

cd core_scripts
pytest test_pipeline.py -v

Test coverage includes:

Data loading and validation
Merge operations (no data loss)
Feature extraction
Edge case handling

📈 Visualizations

The EDA generates 8 visualizations in data_models/output/eda_plots/:

daily_accuracy.png - Accuracy trends over 10 days
status_breakdown.png - Overall status distribution
substatus_breakdown.png - Top 15 error types
spatial_hotspots.png - Error rates by shelf level and aisle
fast_moving_locations.png - Top 20 highest velocity locations
problematic_locations.png - Top 20 risk scores
error_prediction_model.png - Model performance metrics

💡 Key Insights

Operational Recommendations

Ground-Level Shelves Need Attention
- Despite easy access, ground shelves have highest error rates (6.24%)
- Hypothesis: Rushing, picking interference, or label damage
- Action: Investigate workflows for ground-level operations
Aisle AZ 1 Requires Investigation
- 10.64% error rate (2.7x warehouse average)
- May indicate: lighting issues, layout problems, or label quality
- Action: On-site audit of physical conditions
Fast-Moving Locations = Higher Risk
- Positive correlation between velocity and error rate
- More handling = more opportunities for errors
- Action: Implement more frequent audits for high-velocity locations
Predictive Model Enables Proactive Management
- Can identify high-risk locations before they accumulate errors
- Works on new warehouses (zero-to-one capability)
- Action: Deploy for ongoing monitoring and early intervention

🛠️ Technologies Used

Data Processing: pandas, numpy, pyarrow
Validation: pydantic
Machine Learning: scikit-learn (Random Forest)
Visualization: matplotlib, seaborn
Statistical Analysis: scipy
API: FastAPI, uvicorn
Containerization: Docker, docker-compose
Testing: pytest

📝 Model Justification

Why Composite Risk Scoring (Point 3)?

Chosen over unsupervised methods (Isolation Forest, DBSCAN) because:

✅ Transparent and explainable to stakeholders
✅ Incorporates domain knowledge (error severity weights)
✅ Tunable based on business priorities
✅ Every component can be validated independently
✅ Produces actionable insights

Formula:

Risk Score = 0.40 × Error_Severity + 
             0.30 × Operational_Impact + 
             0.20 × Error_Type_Severity + 
             0.10 × Spatial_Context

Why Random Forest (Point 5)?

Chosen for zero-to-one prediction because:

✅ Handles mixed feature types (numeric + categorical)
✅ Robust to class imbalance (with class_weight='balanced')
✅ Provides feature importances (interpretability)
✅ No feature scaling required
✅ Proven performance on tabular data

Alternative considered: Logistic Regression (too simple), XGBoost (overkill for this data size)

📄 Requirements

See requirements.txt for complete dependencies.

Core packages:

pandas>=2.0.0
scikit-learn>=1.3.0
fastapi>=0.104.0
uvicorn>=0.24.0

Built with FastAPI • Docker • Python 3.11 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
API/warehouse-api		API/warehouse-api
core_scripts		core_scripts
data_models		data_models
setup		setup
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Dexory Technical Task - Warehouse Intelligence System

📋 Project Overview

🎯 Completed Tasks

✅ Point 1: Data Engineering Pipeline

✅ Point 2: Exploratory Data Analysis (EDA)

✅ Point 3: Anomaly Detection

✅ Point 4: Scalable & Containerized API

✅ Point 5: Error Prediction Model

📁 Project Structure

🚀 Quick Start

Prerequisites

1. Install Dependencies

2. Run the Complete Pipeline

3. Launch the API (Point 4)

📊 Key Results

Inventory Accuracy

Spatial Insights

Velocity Analysis

Anomaly Detection

Prediction Model

🔌 API Endpoints

🧪 Testing

📈 Visualizations

💡 Key Insights

Operational Recommendations

🛠️ Technologies Used

📝 Model Justification

Why Composite Risk Scoring (Point 3)?

Why Random Forest (Point 5)?

📄 Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages