Machine learning project to classify source code as either AI-generated or human-written. It analyzes language patterns and stylistic signals across Python, Java, and JavaScript code.
The project includes:
- A training pipeline for multiple classifiers
- Saved model artifacts under
model/ - A Streamlit web app for interactive prediction
- Dataset folders for Python, Java, and JavaScript samples
- Binary classification:
aivshuman - Multiple models trained and evaluated:
- Logistic Regression
- Random Forest
- Gradient Boosting
- XGBoost
- Character-level TF-IDF feature extraction
- Ensemble voting in the app for final prediction
- Confidence display per model and overall result
MLProject/
app.py
train.py
requirements.txt
README.md
data/
python/
ai/
human/
java/
AI/
Human/
js/
AI/
Human/
model/
logistic.pkl
randomforest.pkl
gradientboost.pkl
xgboost.pkl
vectorizer.pkl
labelencoder.pkl
python/
java/
js/
tableau_data/
- Python 3.9+
- pip
- Create and activate a virtual environment.
python -m venv .venvWindows (Command Prompt):
.venv\Scripts\activate- Install dependencies.
pip install -r requirements.txtRun:
python train.pyThis will:
- Load code samples from
data/ - Build language-specific TF-IDF features
- Train models
- Print classification reports
- Save artifacts into
model/(and language subfolders)
Start Streamlit:
streamlit run app.pyThen open the local URL shown in terminal (usually http://localhost:8501).
- Paste code into the text area.
- Click Analyze Code.
- Review:
- Overall prediction (
AI-GeneratedorHuman-Written) - Ensemble confidence
- Per-model predictions and confidence values
Each language folder should contain two classes:
- AI-generated code files
- Human-written code files
Expected extensions:
- Python:
.py - Java:
.java - JavaScript:
.js
- Prediction quality depends on dataset quality, size, and balance.
- Results are probabilistic and should be treated as supportive signals, not absolute truth.
- If model files are missing, run
python train.pybefore launching the app.
ModuleNotFoundError: install dependencies withpip install -r requirements.txt.Model not foundwarning in app: ensure training completed and artifacts exist inmodel/.- Streamlit command not found: use
python -m streamlit run app.py.
- Add more languages and larger datasets
- Improve feature engineering and model calibration
- Add explainability signals for why a prediction was made
- Export prediction logs and metrics dashboard