Classic data science and machine learning projects covering tabular modeling, time-series forecasting, computer vision, statistical analysis and applied NLP.
This repository is a general portfolio archive. It focuses on supervised ML workflows, exploratory analysis, feature engineering, validation, and clear project writeups. Source datasets, model checkpoints, private credentials, generated outputs and large artifacts are intentionally excluded from version control.
The LLM, retrieval, information extraction and vision-language projects now live in ai-research-projects:
- Russian LLM Pretraining and SFT
- Semantic Retrieval for arXiv Papers
- Multi-Task Information Extraction on NEREL
- Text-to-Image Product Search with Fine-Tuned CLIP
| Project | Summary | Main tools |
|---|---|---|
| Toxic Comment Classification | Built a text classification pipeline to detect toxic comments for moderation. Tested BERT-based embeddings and several classifiers. | Python, pandas, BERT, NLTK, CatBoost, scikit-learn |
| Used Car Price Prediction | Built a model to estimate the market value of used cars from vehicle characteristics. Compared Random Forest, CatBoost, and LightGBM models. | Python, pandas, scikit-learn, CatBoost, LightGBM |
| Customer Age Prediction from Images | Trained a neural network to estimate customer age from facial images using a ResNet-based computer vision pipeline. | Python, TensorFlow, Keras |
| Taxi Order Forecasting | Forecasted the number of taxi orders for the next hour using time-series feature engineering and regression models. | Python, pandas, scikit-learn, statsmodels, CatBoost |
| Project | Summary | Main tools |
|---|---|---|
| Video Game Success Pattern Analysis | Analyzed global video game sales, platforms, genres, critic scores, and user scores to identify patterns associated with commercial success. | Python, pandas, NumPy, SciPy, Matplotlib |
| Gold Recovery Process Modeling | Predicted gold recovery efficiency from mining and purification process parameters using custom sMAPE evaluation. | Python, pandas, NumPy, Matplotlib, scikit-learn |
| Real Estate Market Analysis in Saint Petersburg | Analyzed real estate listings to estimate market value drivers and typical apartment characteristics. | Python, pandas, Matplotlib |
| Telecom Tariff Revenue Analysis | Compared customer behavior and revenue between telecom tariff plans and tested statistical hypotheses. | Python, pandas, NumPy, SciPy, Matplotlib |
| Telecom Tariff Recommendation | Built a classification model that recommends one of two telecom tariffs based on monthly customer behavior. | Python, pandas, Matplotlib, scikit-learn |
| Bank Customer Churn Prediction | Predicted whether a bank customer is likely to leave, with special attention to class imbalance. | Python, pandas, Matplotlib, scikit-learn |
| Oil Well Location Selection | Used regression modeling and bootstrap simulation to select the most profitable oil extraction region under risk constraints. | Python, pandas, NumPy, scikit-learn, Bootstrap |
| Insurance Client Data Protection | Developed and justified a linear algebra based data transformation that protects personal data without degrading model quality. | Python, pandas, NumPy, scikit-learn |
Each project folder contains a README and notebook with the analysis workflow, modeling choices, validation results, limitations and reproduction notes.
Metrics are project-local results from the documented notebooks and should not be treated as broad benchmarks. Source datasets, model checkpoints, embedding caches, private credentials and generated artifacts are not committed.