Skip to content

numericalmachinelearning/dexory-technical-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dexory Technical Task - Warehouse Intelligence System

Applied AI Engineer Technical Assessment
Author: Alessandro Alati
Date: November 2025

📋 Project Overview

This project analyzes 10 days of warehouse scan data to extract actionable intelligence about inventory accuracy, error patterns, and operational issues. It includes a complete data pipeline, exploratory analysis, anomaly detection, and a containerized REST API.


🎯 Completed Tasks

✅ Point 1: Data Engineering Pipeline

  • Ingests 10 days of warehouse scan data (350K+ records)
  • Merges with warehouse layout (33K+ locations)
  • Outputs clean Parquet dataset with spatial features
  • Includes comprehensive unit tests

✅ Point 2: Exploratory Data Analysis (EDA)

  • WHAT: Daily accuracy trends and error type breakdown
  • WHERE: Spatial hotspots (shelf levels, aisles, height correlations)
  • WHEN: Velocity analysis (fast-moving locations)
  • Generates 8 publication-quality visualizations
  • Statistical validation (chi-square tests, correlations)

✅ Point 3: Anomaly Detection

  • Composite risk scoring model (error severity + operational impact)
  • Identifies Top 20 most problematic locations
  • Transparent, explainable scoring system
  • Output: Ranked CSV with actionable metrics

✅ Point 4: Scalable & Containerized API

  • FastAPI application with 4 endpoints
  • Docker containerization with docker-compose
  • Interactive Swagger documentation
  • Health checks and error handling

✅ Point 5: Error Prediction Model

  • Random Forest classifier (zero-to-one model)
  • Predicts high/low error risk from static features only
  • Works on new warehouses with no scan history
  • 65% accuracy (30% above baseline)

📁 Project Structure

dexory-technical-task/
├── core_scripts/               # Main analysis scripts
│   ├── data_pipeline.py        # Data ingestion and cleaning
│   ├── eda.py                  # Exploratory data analysis
│   ├── anomaly_detection.py    # Top 20 problematic locations
│   ├── error_prediction.py     # ML model for error prediction
│   └── test_pipeline.py        # Unit tests
│
├── API/
│   └── warehouse-api/          # FastAPI application
│       ├── app/main.py         # API endpoints
│       ├── Dockerfile          # Container definition
│       ├── docker-compose.yml  # Docker orchestration
│       └── requirements.txt    # API dependencies
│
├── data_models/
│   ├── technical-task-data/    # Raw input data (10 days)
│   └── output/                 # Processed data & models
│       ├── warehouse_data.parquet
│       ├── top_20_problematic.csv
│       ├── error_predictor.pkl
│       └── eda_plots/          # Analysis visualizations
│
└── README.md                   # This file

🚀 Quick Start

Prerequisites

  • Python 3.11+
  • Docker Desktop (for API)

1. Install Dependencies

pip install -r requirements.txt

2. Run the Complete Pipeline

# Step 1: Process data (Point 1)
cd core_scripts
python data_pipeline.py

# Step 2: Run EDA (Point 2)
python eda.py

# Step 3: Detect anomalies (Point 3)
python anomaly_detection.py

# Step 4: Train prediction model (Point 5)
python error_prediction.py

3. Launch the API (Point 4)

# Navigate to API folder
cd ../API/warehouse-api

# Run with Docker
docker-compose up --build

# Access API at:
# http://localhost:8000/docs

📊 Key Results

Inventory Accuracy

  • Mean Accuracy: 75.07%
  • Range: 3.09% - 99.30%
  • Most Common Error: Unknown item found (3.83%)

Spatial Insights

  • Ground shelves: 6.24% error rate (highest)
  • High shelves: 1.37% error rate (lowest)
  • Most problematic aisle: AZ 1 (10.64% error rate)
  • Significant correlation: Shelf level affects error rate (p < 0.001)

Velocity Analysis

  • High-velocity locations: ~650 (2% of total)
  • Static locations: 25,234 (75% of total)
  • Finding: High-velocity locations have significantly more errors

Anomaly Detection

  • Top 20 problematic locations identified
  • Scoring factors: Error rate (40%), Operational impact (30%), Error severity (20%), Spatial context (10%)
  • Highest risk score: 0.52 (Location with 18.5% error rate + high velocity)

Prediction Model

  • Algorithm: Random Forest (balanced class weights)
  • Accuracy: 65% (vs 50% baseline)
  • High Error Recall: 73% (catches 73% of problematic locations)
  • Key features: Shelf height, position, aisle location

🔌 API Endpoints

The API serves analysis results and predictions:

Endpoint Description
GET /health System health check
GET /warehouse/anomalies Top 20 problematic locations
GET /warehouse/stats Daily accuracy trends & error breakdown
GET /location/{name} Detailed location analysis + prediction

Interactive docs: http://localhost:8000/docs

See API/warehouse-api/README.md for detailed API documentation.


🧪 Testing

Run unit tests:

cd core_scripts
pytest test_pipeline.py -v

Test coverage includes:

  • Data loading and validation
  • Merge operations (no data loss)
  • Feature extraction
  • Edge case handling

📈 Visualizations

The EDA generates 8 visualizations in data_models/output/eda_plots/:

  1. daily_accuracy.png - Accuracy trends over 10 days
  2. status_breakdown.png - Overall status distribution
  3. substatus_breakdown.png - Top 15 error types
  4. spatial_hotspots.png - Error rates by shelf level and aisle
  5. fast_moving_locations.png - Top 20 highest velocity locations
  6. problematic_locations.png - Top 20 risk scores
  7. error_prediction_model.png - Model performance metrics

💡 Key Insights

Operational Recommendations

  1. Ground-Level Shelves Need Attention

    • Despite easy access, ground shelves have highest error rates (6.24%)
    • Hypothesis: Rushing, picking interference, or label damage
    • Action: Investigate workflows for ground-level operations
  2. Aisle AZ 1 Requires Investigation

    • 10.64% error rate (2.7x warehouse average)
    • May indicate: lighting issues, layout problems, or label quality
    • Action: On-site audit of physical conditions
  3. Fast-Moving Locations = Higher Risk

    • Positive correlation between velocity and error rate
    • More handling = more opportunities for errors
    • Action: Implement more frequent audits for high-velocity locations
  4. Predictive Model Enables Proactive Management

    • Can identify high-risk locations before they accumulate errors
    • Works on new warehouses (zero-to-one capability)
    • Action: Deploy for ongoing monitoring and early intervention

🛠️ Technologies Used

  • Data Processing: pandas, numpy, pyarrow
  • Validation: pydantic
  • Machine Learning: scikit-learn (Random Forest)
  • Visualization: matplotlib, seaborn
  • Statistical Analysis: scipy
  • API: FastAPI, uvicorn
  • Containerization: Docker, docker-compose
  • Testing: pytest

📝 Model Justification

Why Composite Risk Scoring (Point 3)?

Chosen over unsupervised methods (Isolation Forest, DBSCAN) because:

  • ✅ Transparent and explainable to stakeholders
  • ✅ Incorporates domain knowledge (error severity weights)
  • ✅ Tunable based on business priorities
  • ✅ Every component can be validated independently
  • ✅ Produces actionable insights

Formula:

Risk Score = 0.40 × Error_Severity + 
             0.30 × Operational_Impact + 
             0.20 × Error_Type_Severity + 
             0.10 × Spatial_Context

Why Random Forest (Point 5)?

Chosen for zero-to-one prediction because:

  • ✅ Handles mixed feature types (numeric + categorical)
  • ✅ Robust to class imbalance (with class_weight='balanced')
  • ✅ Provides feature importances (interpretability)
  • ✅ No feature scaling required
  • ✅ Proven performance on tabular data

Alternative considered: Logistic Regression (too simple), XGBoost (overkill for this data size)


📄 Requirements

See requirements.txt for complete dependencies.

Core packages:

pandas>=2.0.0
scikit-learn>=1.3.0
fastapi>=0.104.0
uvicorn>=0.24.0

Built with FastAPI • Docker • Python 3.11 🚀

About

Warehouse Intelligence System: Data pipeline, EDA, anomaly detection, and ML prediction API for warehouse scan analysis. Built with Python, FastAPI, and Docker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors