Scientific Data Significance Rankings with Shapley Explanations
DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.
- Three Significance Types: Archetypal, prototypical, stereotypical (all computed simultaneously, or selectively)
- Shapley Explanations: Feature-level attributions for why samples are significant
- Formative Discovery: Distinguish samples that ARE significant from those that CREATE structure
- Publication Visualizations: Dual-perspective scatter plots, heatmaps, and profile plots
- Multi-Modal Support: Tabular data, text, and graph networks through unified API
- Performance Optimized: Fast exploration mode and efficient Shapley computation
pip install datatypicalfrom datatypical import DataTypical
from datatypical_viz import significance_plot, heatmap, profile_plot
import pandas as pd
# Load your data
data = pd.read_csv('your_data.csv')
# Analyze with explanations
dt = DataTypical(shapley_mode=True)
results = dt.fit_transform(data)
# Three significance perspectives (0-1 normalized ranks)
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])
# Visualize: which samples are critical vs replaceable?
significance_plot(results, significance='archetypal')
# Understand: which features drive significance?
heatmap(dt, results, significance='archetypal', top_n=20)
# Explain: why is this sample significant?
top_idx = results['archetypal_rank'].idxmax()
profile_plot(dt, top_idx, significance='archetypal')| Lens | Finds | Use Cases |
|---|---|---|
| Archetypal | Extreme, boundary samples | Edge case discovery, outlier detection, range understanding |
| Prototypical | Representative, central samples | Dataset summarization, cluster centers, typical examples |
| Stereotypical | Target-similar samples | Optimization, goal-oriented selection, phenotype matching |
The Power: All three computed simultaneously—different perspectives reveal different insights.
When shapley_mode=True, DataTypical reveals two views:
Actual Significance (*_rank): Samples that ARE significant
Formative Significance (*_shapley_rank): Samples that CREATE the structure
Four Quadrants:
Formative High
│
Gap │ Critical
Fillers │ (irreplaceable)
──────────┼──────────────── Actual High
Redundant │ Replaceable
│ (keep one)
Formative Low
This distinction—between what IS significant vs what CREATES structure—is a genuinely novel contribution.
# Analyze compound library
dt = DataTypical(
shapley_mode=True,
stereotype_column='activity', # Target property
fast_mode=False
)
results = dt.fit_transform(compounds)
# Find critical compounds (high actual + high formative)
critical = results[
(results['stereotypical_rank'] > 0.8) &
(results['stereotypical_shapley_rank'] > 0.8)
]
print(f"Found {len(critical)} critical compounds")
# Find redundant compounds (high actual + low formative)
redundant = results[
(results['stereotypical_rank'] > 0.8) &
(results['stereotypical_shapley_rank'] < 0.3)
]
print(f"Found {len(redundant)} replaceable compounds")
# Understand alternative mechanisms
for idx in critical.index:
profile_plot(dt, idx, significance='stereotypical')
# Each shows different feature pattern → different mechanismDiscovery: Multiple structural pathways to high activity!
| Dataset Size | Without Shapley | With Shapley |
|---|---|---|
| 1,000 samples | ~5 seconds | ~5 minutes |
| 10,000 samples | ~30 seconds | ~60 minutes |
Phase 1: Fast exploration (fast_mode=True, no Shapley)
↓ Identify interesting samples
Phase 2: Detailed analysis (shapley_mode=True, subset to interesting samples)
↓ Generate explanations and publication figures
DataTypical(
# Enable explanations and formative analysis
shapley_mode=False, # True for explanations
# Speed vs accuracy
fast_mode=True, # False for publication quality
# Significance types
n_archetypes=8, # Number of extreme corners
n_prototypes=8, # Number of representatives
stereotype_column=None, # Target column for stereotypical
stereotype_target='max', # 'max', 'min', or numeric value
# Selective computation
selected_significance=None, # 'archetypal', 'prototypical', 'stereotypical', or None (all)
# Shapley optimization
shapley_top_n=500, # Limit explanations to top N
shapley_n_permutations=100, # Number of permutations (30 in fast_mode)
# Reproducibility
random_state=None, # Set for reproducible results
# Memory management
max_memory_mb=8000 # Memory limit for operations
)When you only need one significance type, set selected_significance to skip the others entirely—saving substantial compute time:
# Only compute archetypal (skip prototypical and stereotypical)
dt = DataTypical(selected_significance='archetypal', shapley_mode=True)
results = dt.fit_transform(data)
# → archetypal_rank computed; prototypical_rank and stereotypical_rank are NaNfrom datatypical_viz import significance_plot, heatmap, profile_plot
# 1. Overview: Actual vs Formative scatter
significance_plot(results, significance='archetypal')
# 2. Feature patterns: Which features matter?
heatmap(dt, results,
significance='archetypal',
order='actual', # or 'formative'
top_n=20)
# 3. Individual explanation: Why is this sample significant?
profile_plot(dt, sample_idx,
significance='archetypal',
order='local') # or 'global'See docs/VISUALIZATION_GUIDE.md for detailed interpretation.
df = pd.DataFrame(...)
dt = DataTypical()
results = dt.fit_transform(df)texts = ["document 1", "document 2", ...]
dt = DataTypical()
results = dt.fit_transform(texts)node_features = pd.DataFrame(...)
edges = [(0, 1), (1, 2), ...]
dt = DataTypical()
results = dt.fit_transform(node_features, edges=edges)- Alternative mechanisms: Formative instances reveal different pathways
- Boundary definition: Which samples define system limits
- Quality control: Distinguish novel variation from known patterns
- Coverage analysis: Identify sampling gaps
- Size reduction: Remove redundant samples while preserving diversity
- Representative selection: Choose samples spanning full space
- Redundancy detection: Find clusters of similar samples
- Gap identification: Locate undersampled regions
- Feature importance: Global and local significance patterns
- Individual explanations: Why specific samples matter
- Pattern recognition: Discover multiple pathways to outcomes
- Interpretability: Explanations in original feature space
New Users:
- docs/START_HERE.md — Friendly introduction and first steps
- docs/QUICK_REFERENCE.md — Daily reference for parameters and workflows
- docs/EXAMPLES.md — Complete worked examples across domains
Visualization:
- docs/VISUALIZATION_GUIDE.md — Comprehensive guide to plots and interpretation
Advanced:
- docs/INTERPRETATION_GUIDE.md — Interpreting complex patterns
- docs/COMPUTATION_GUIDE.md — Implementation details and algorithms
- Python ≥ 3.8
- NumPy ≥ 1.20
- Pandas ≥ 1.3
- SciPy ≥ 1.7
- scikit-learn ≥ 1.0
- Matplotlib ≥ 3.3
- Seaborn ≥ 0.11
- Numba ≥ 0.55 (for performance)
If you use DataTypical in your research, please cite:
@software{datatypical2026,
author = {Barnard, Amanda S.},
title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
year = {2026},
url = {https://github.com/amaxiom/DataTypical},
version = {0.7.6}
}Outlier Detection: Only finds extremes → DataTypical finds extremes AND explains why
Clustering: Groups samples, picks centroids → DataTypical finds representatives maximizing coverage
Feature Selection: Ranks features → DataTypical explains which features matter for which samples
PCA/t-SNE: Projects to low dimensions → DataTypical maintains interpretability in original space
Formative instances are genuinely new. The distinction between samples that ARE significant vs samples that CREATE structure emerges from the Shapley mechanism and enables:
- Redundancy detection even among significant samples
- Finding structurally important but non-extreme samples
- Understanding irreplaceable vs interchangeable samples
- Quality control based on structural contribution
This dual perspective transforms instance significance from pure ranking into causal understanding.
Current Version: 0.7.6
Recent Updates (v0.7.6):
- Added
selected_significanceparameter for selective computation of one significance type - Fixed prototype feature storage so
transform()on new data uses correct prototype vectors - Full Shapley analysis (formative + explanations) now runs correctly on text data paths
- Fixed iterator exhaustion in all text fit/transform methods
- Fixed local/global index mismatch in stereotypical Shapley explanations when subsampling
- Improved error messages when a significance type was not fitted
Stability: Production-ready for research use
MIT License — See LICENSE for details.
Copyright (c) 2026 Amanda S. Barnard
- Documentation: See docs/ folder or links above
- Issues: Report bugs via GitHub Issues
- Questions: Open a GitHub Discussion
DataTypical builds on foundational work in:
- Archetypal analysis (Cutler & Breiman, 1994)
- Facility location optimization (Nemhauser et al., 1978)
- Shapley value theory (Shapley, 1953)
- PCHA optimization (Mørup & Hansen, 2012)
Special thanks to the scientific Python community.
Documentation
Quick Start
Examples
Visualization Guide
Report Issues
Discussions
Ready to explore your data?
pip install datatypicalThen see docs/START_HERE.md for your first analysis!