An End-to-End Evaluation Framework for Entity Resolution Systems
-
Updated
Dec 3, 2023 - Python
An End-to-End Evaluation Framework for Entity Resolution Systems
Benchmark your MCP server.
A single-file Python CLI that pre-registers AI/ML accuracy claims with SHA-256. Lock the threshold before the data, or it didn't happen.
A metrics library to evaluate vision language models with a pytorch eco system.
✍️ Collaborate on writing technical content for the Giskard Community
An open-source Streamlit web app to generate beautiful confusion matrices for multi-class machine learning models. Supports numeric and string labels, CSV upload, manual label entry, custom color maps, and displays evaluation metrics like Accuracy, Precision, Recall, and F1-score. Users can download the confusion matrix as an image.
Safety-first legal NLP system with hierarchical long-document processing, deterministic inference, clause extraction, and rule-based risk engine — built for traceability and deployment constraints.
Measurement-disciplined optimizer for Claude Agent Skills — three-gate system (stability, effect size, function preservation), anchored rubric, train/holdout split. Inspired by Karpathy's autoresearch and alchaincyf's darwin-skill.
Enterprise-grade machine learning observability platform that detects data drift, concept drift, and performance degradation in production models. Features statistical drift detection (KS test, PSI), real-time alerting, Redis caching, and FastAPI backend.
Data Science Challenge from Coursera Project : Loan Default Prediction
This project contains codes and paperwork based on the course CSI5155 at University of Ottawa (delivered by Professor Dr. Herna Viktor).
Small, educational project that shows how to build a **minimal RAG pipeline** with a **simple evaluation loop**
GitHub Action for SWE-bench Pro evaluation powered by mcpbr
Local-first stratified-sample audit UI for ML classifiers and labeled datasets. Wilson CIs, keyboard-first, no cloud.
Evaluation of system-level risks in content moderation models using policy-driven metrics, identity-based analysis, and governance-aligned datasets.
End-to-end E-Commerce Recommendation System using implicit feedback, featuring Popularity, Item-Item CF, ALS (Matrix Factorization), and a Hybrid model, with offline evaluation and online serving via FastAPI + Streamlit.
A MATLAB-based machine learning project that implements a Naive Bayes spam email classifier using the UCI Spambase dataset. Includes feature selection, model tuning, performance evaluation, and deployment-ready model export.
Collection of Machine Learning (ML) and Natural Language Processing (NLP) projects showcasing a range of applications, algorithms, and techniques.
Static HTML backtest reports from eval JSON (calibration, Brier, CLV, optional bet ledger)
A decision-oriented benchmark framework for evaluating action-conditioned world models beyond static AI benchmarks.
Add a description, image, and links to the ml-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the ml-evaluation topic, visit your repo's landing page and select "manage topics."