Build software better, together

OlivierBinette / er-evaluation

An End-to-End Evaluation Framework for Entity Resolution Systems

data-science statistics matching record-linkage entity-resolution evaluation fuzzy-matching disambiguation deduplication duplicate-detection author-name-disambiguation ml-testing ml-evaluation inventor-name-disambiguation

Updated Dec 3, 2023
Python

greynewell / mcpbr

Sponsor

Star

Benchmark your MCP server.

python benchmarking machine-learning mcp ml-evaluation llm-evaluation model-context-protocol swe-bench

Updated Apr 28, 2026
Python

studio-11-co / falsify

Star

A single-file Python CLI that pre-registers AI/ML accuracy claims with SHA-256. Lock the threshold before the data, or it didn't happen.

Updated May 23, 2026
HTML

AmeyaWagh / robometric-frame

Star

A metrics library to evaluate vision language models with a pytorch eco system.

robotics policy-evaluation evaluation-metrics ml-evaluation torchmetrics diffusion-policy lerobot vision-language-action-model policy-lear

Updated Apr 4, 2026
Python

Giskard-AI / community-content

Sponsor

Star

✍️ Collaborate on writing technical content for the Giskard Community

testing content machine-learning ai ml tutorials artificial-intelligence tutorial-code ml-testing giskard ml-evaluation

Updated Nov 11, 2022

pareshrnayak / confusion-matrix-generator

Star

An open-source Streamlit web app to generate beautiful confusion matrices for multi-class machine learning models. Supports numeric and string labels, CSV upload, manual label entry, custom color maps, and displays evaluation metrics like Accuracy, Precision, Recall, and F1-score. Users can download the confusion matrix as an image.

python open-source data-science machine-learning data-visualization confusion-matrix model-evaluation multiclass-classification streamlit streamlit-webapp classification-metrics ml-evaluation confusion-matrix-generator

Updated Jan 18, 2026
Python

Comrade-1729 / lex-brief-ai

Star

Safety-first legal NLP system with hierarchical long-document processing, deterministic inference, clause extraction, and rule-based risk engine — built for traceability and deployment constraints.

nlp django transformers pytorch production-ml ml-evaluation rule-based-systems deterministic-inference

Updated Feb 10, 2026
Python

taneltaluri / evolve-skill

Star

Measurement-disciplined optimizer for Claude Agent Skills — three-gate system (stability, effect size, function preservation), anchored rubric, train/holdout split. Inspired by Karpathy's autoresearch and alchaincyf's darwin-skill.

ml-evaluation ai-tooling claude-skills claude-agent-skills skill-optimization autoresearch

Updated Apr 18, 2026
Python

rodrigoguedes09 / model-observability-system

Star

Enterprise-grade machine learning observability platform that detects data drift, concept drift, and performance degradation in production models. Features statistical drift detection (KS test, PSI), real-time alerting, Redis caching, and FastAPI backend.

python machine-learning machine-learning-algorithms ml observability ml-observability ml-evaluation

Updated Jan 15, 2026
Python

johnsonhk88 / Data-Science-Challenge-Coursera-Project-Loan-Default-Prediction

Star

Data Science Challenge from Coursera Project : Loan Default Prediction

data-science machine-learning ai deep-learning random-forest exploratory-data-analysis coursera data-cleaning loan-default-prediction xgboost-classifier ml-evaluation

Updated Oct 16, 2024
Jupyter Notebook

kmock930 / Drug-Consumption-Machine-Learning-analysis

Star

This project contains codes and paperwork based on the course CSI5155 at University of Ottawa (delivered by Professor Dr. Herna Viktor).

machine-learning random-forest svm supervised-learning semi-supervised-learning mlp unsupervised-learning knn decision-tree ensemble-model gradient-boosting boosting receiver-operating-characteristic bagging xai area-under-curve ml-pipeline shap-analysis ml-evaluation

Updated Dec 9, 2024
Jupyter Notebook

SvetLuna-Lab / Mini-rag-eval-demo

Star

Small, educational project that shows how to build a **minimal RAG pipeline** with a **simple evaluation loop**

python nlp machine-learning information-retrieval text-mining evaluation tfidf educational-project rag qa-system ml-evaluation retrieval-augmented-generation

Updated Nov 10, 2025
Python

greynewell / swe-bench-pro-action

Sponsor

Star

GitHub Action for SWE-bench Pro evaluation powered by mcpbr

python benchmarking mcp ai-agents github-actions ml-evaluation llm-evaluation swe-bench

Updated Feb 26, 2026
Shell

djohnson68 / handlabel

Star

Local-first stratified-sample audit UI for ML classifiers and labeled datasets. Wilson CIs, keyboard-first, no cloud.

machine-learning typescript annotation audit labeling ground-truth human-in-the-loop bun label-noise ml-evaluation stratified-sample wilson-confidence-interval

Updated May 12, 2026
TypeScript

ivy-mainaa / System-Risk-in-Policy-Driven-AI-Systems

Star

Evaluation of system-level risks in content moderation models using policy-driven metrics, identity-based analysis, and governance-aligned datasets.

fairness content-moderation responsible-ai ai-governance ml-evaluation

Updated Jan 4, 2026
Jupyter Notebook

kirtis111 / e-commerce-recommendation-system

Star

End-to-end E-Commerce Recommendation System using implicit feedback, featuring Popularity, Item-Item CF, ALS (Matrix Factorization), and a Hybrid model, with offline evaluation and online serving via FastAPI + Streamlit.

data-science ecommerce personalization ranking recall recommendation-engine als recommender-systems hybrid-model implicit-feedback ndcg model-serving fastapi ml-pipeline streamlit ml-evaluation

Updated Jan 26, 2026
Jupyter Notebook

victoropp / naive-bayes-spam-detection

Star

A MATLAB-based machine learning project that implements a Naive Bayes spam email classifier using the UCI Spambase dataset. Includes feature selection, model tuning, performance evaluation, and deployment-ready model export.