A comprehensive Python-based data analyst agent API that uses Large Language Models (LLMs) to automatically source, prepare, analyze, and visualize data.
- LLM-Powered Analysis: Uses OpenAI GPT-4 or Anthropic Claude for intelligent data analysis planning
- Multiple Data Sources: Supports file uploads (CSV, JSON, Excel, Parquet) and URL-based data fetching
- Automated Processing: Intelligent data cleaning, missing value handling, and preprocessing
- Comprehensive Analysis: Statistical analysis, correlation analysis, regression, clustering, and time series
- Rich Visualizations: Generates histograms, heatmaps, scatter plots, box plots, and interactive dashboards
- RESTful API: FastAPI-based with automatic documentation and async support
- 3-Minute Timeout: Designed to return results within the required timeframe
- Containerized: Docker support for easy deployment
POST /api/Parameters:
file(optional): Data file upload (CSV, JSON, Excel, etc.)question: Natural language analysis requestformat(optional): Response format (default: "json")data_source(optional): URL to fetch data from
Example Request:
curl -X POST "http://localhost:8000/api/" \
-F "file=@data.csv" \
-F "question=Analyze the correlation between sales and marketing spend"GET /- Root endpoint with API informationGET /health- Health checkGET /docs- Interactive API documentationGET /api/visualization/{analysis_id}/{filename}- Serve visualization files
- Clone and Setup:
git clone <repository-url>
cd data-analyst-agent- Install Dependencies:
pip install -r requirements.txt- Environment Configuration:
cp .env.example .env
# Edit .env with your API keys and configuration- Run the Application:
python main.pyThe API will be available at http://localhost:8000
- Using Docker Compose (Recommended):
docker-compose up -d- Using Docker directly:
docker build -t data-analyst-agent .
docker run -p 8000:8000 -v $(pwd)/data:/app/data data-analyst-agentcurl -X POST "http://localhost:8000/api/" \
-F "file=@sales_data.csv" \
-F "question=Provide summary statistics and identify trends"curl -X POST "http://localhost:8000/api/" \
-F "file=@customer_data.csv" \
-F "question=Find correlations between customer demographics and purchase behavior"curl -X POST "http://localhost:8000/api/" \
-F "file=@stock_prices.csv" \
-F "question=Analyze the time series trend and seasonality patterns"curl -X POST "http://localhost:8000/api/" \
-F "data_source=https://api.example.com/data.json" \
-F "question=Analyze the distribution of values and detect outliers"data-analyst-agent/
βββ main.py # FastAPI application entry point
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ LICENSE # MIT License
βββ Dockerfile # Container configuration
βββ docker-compose.yml # Multi-container setup
βββ .env.example # Environment configuration template
βββ app/
β βββ __init__.py
β βββ api/
β β βββ __init__.py
β β βββ routes.py # API endpoints
β β βββ models.py # Pydantic models
β βββ core/
β β βββ __init__.py
β β βββ config.py # Application configuration
β β βββ llm_client.py # LLM integration
β βββ services/
β β βββ __init__.py
β β βββ data_sourcer.py # Data loading service
β β βββ data_processor.py # Data cleaning service
β β βββ analyzer.py # Analysis service
β β βββ visualizer.py # Visualization service
β βββ utils/
β βββ __init__.py
β βββ file_handler.py # File operations
βββ data/
β βββ uploads/ # Uploaded files
β βββ outputs/ # Generated visualizations
βββ tests/ # Test files (to be implemented)
Create a .env file based on .env.example:
# LLM API Keys (at least one required for advanced features)
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
# Server Configuration
HOST=0.0.0.0
PORT=8000
DEBUG=False
# Processing Limits
MAX_FILE_SIZE=104857600 # 100MB
ANALYSIS_TIMEOUT=180 # 3 minutesThe system supports multiple analysis types:
- Descriptive Analysis: Summary statistics, data quality assessment
- Diagnostic Analysis: Correlation analysis, distribution analysis
- Predictive Analysis: Regression models, feature importance
- Prescriptive Analysis: Clustering, segmentation
Automatically generates appropriate visualizations:
- Histograms: Distribution of numeric variables
- Correlation Heatmaps: Relationships between variables
- Scatter Plots: Bivariate relationships with trend lines
- Box Plots: Outlier detection and quartile analysis
- Bar Charts: Categorical data distribution
- Line Charts: Time series and trends
- Interactive Dashboards: HTML summaries with insights
- Input validation for file types and sizes
- Rate limiting (implement as needed)
- API key management through environment variables
- CORS configuration for production deployment
- AWS ECS/Fargate
- Google Cloud Run
- Azure Container Instances
- Heroku
- DigitalOcean App Platform
# Build and push to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account>.dkr.ecr.us-east-1.amazonaws.com
docker build -t data-analyst-agent .
docker tag data-analyst-agent:latest <account>.dkr.ecr.us-east-1.amazonaws.com/data-analyst-agent:latest
docker push <account>.dkr.ecr.us-east-1.amazonaws.com/data-analyst-agent:latest# Test the health endpoint
curl http://localhost:8000/health
# Test with sample data
curl -X POST "http://localhost:8000/api/test"- Timeout: 3-minute maximum processing time
- File Size: 100MB maximum upload
- Concurrent: Async processing for multiple requests
- Memory: Optimized pandas operations
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Visit
/docsfor interactive API documentation - Issues: Report bugs and feature requests via GitHub issues
- API Status: Check
/healthendpoint for service status
- WebSocket support for real-time analysis updates
- Additional LLM providers (Gemini, Claude-3)
- Database integration (PostgreSQL, MongoDB)
- Advanced ML models (XGBoost, Neural Networks)
- Caching layer for repeated analyses
- User authentication and rate limiting
- Batch processing for large datasets