Skip to content

Add patient-level leakage checks for dataset splits#1159

Open
tippered1-debug wants to merge 1 commit into
sunlabuiuc:masterfrom
tippered1-debug:reliability/patient-disjoint-checks
Open

Add patient-level leakage checks for dataset splits#1159
tippered1-debug wants to merge 1 commit into
sunlabuiuc:masterfrom
tippered1-debug:reliability/patient-disjoint-checks

Conversation

@tippered1-debug

Copy link
Copy Markdown

Summary

This PR adds helper utilities to audit patient-level leakage across dataset splits.

Healthcare datasets often contain multiple samples per patient. If the same patient appears in both train and validation/test splits, evaluation metrics may be inflated because the model is partially evaluated on patients already seen during training.

The new helpers inspect the actual samples in each split and verify that patient IDs are disjoint.

Changes

  • Add get_patient_ids(...) to extract patient IDs from datasets, subsets, sample collections, or ID collections.
  • Add check_patient_disjoint(...) to return a report with split counts and patient overlaps.
  • Add assert_patient_disjoint(...) to raise a clear error when patient overlap is detected.
  • Export the helpers from pyhealth.datasets.
  • Add focused tests for patient-level splits, sample-level leakage detection, conformal splits, and missing patient_id errors.

Tests

  • python -m unittest tests.core.test_patient_disjoint -v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant