Skip to content

Add configurable checksum types (imohash-64k, xxhash, md5)#23

Open
siligam wants to merge 2 commits into
mainfrom
checksum-options
Open

Add configurable checksum types (imohash-64k, xxhash, md5)#23
siligam wants to merge 2 commits into
mainfrom
checksum-options

Conversation

@siligam
Copy link
Copy Markdown
Collaborator

@siligam siligam commented May 13, 2026

Summary

  • Upgrades the imohash default sample size from 16 KB → 64 KB (new prefix imohash-64k) to meaningfully reduce false-negative risk while keeping snapshot generation fast
  • Adds --checksum-type option to ptool checksums with three choices: imohash-64k (default), xxhash, and md5
  • Guards compare and summary against mixing CSVs generated with different checksum types — raises a clear error rather than silently producing wrong results
  • Adds xxhash as a package dependency
  • Adds tests covering prefix correctness, collision detection, and checksum type mismatch guard
  • Bug fix: corrects a pre-existing invalid dtype string in read_csv ("str[pyarrow]""string[pyarrow]") which caused a TypeError on pandas 2.x — discovered via the new tests

Checksum type trade-offs

Type Speed on large files False negatives possible?
imohash-64k Very fast (samples 3×64KB) Yes — differences in unsampled regions are missed
xxhash Fast (full file, CPU-efficient) No
md5 Moderate (full file) No

For multi-GB scientific files the bottleneck is I/O, not the algorithm, so xxhash and md5 have similar wall-clock cost. xxhash is preferred when correctness matters and MD5 interoperability is not required.

Test plan

  • conda run -n ptool pytest tests/ -v — all 12 tests pass
  • ptool checksums --help--checksum-type option visible with correct choices and default
  • ptool checksums --checksum-type xxhash <path> — output prefixed with xxhash:
  • ptool checksums --checksum-type md5 <path> — output prefixed with md5:
  • ptool compare on two CSVs with mismatched types — raises UsageError with clear message

Closes #4

🤖 Generated with Claude Code

siligam and others added 2 commits May 13, 2026 02:00
- Upgrade imohash default sample size from 16 KB to 64 KB (prefix: imohash-64k)
  to reduce false-negative risk in large file comparisons
- Add --checksum-type option to checksums command supporting imohash-64k
  (default), md5, and xxhash
- Guard compare/summary against mixing CSVs with different checksum types,
  raising a clear error to prevent silent mismatches
- Add xxhash to dependencies (setup.py, environment.yaml)
- Add tests documenting imohash false-negative behaviour

Closes #4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add test_checksum_types.py covering:
  - correct prefix embedding for imohash-64k, md5, xxhash
  - md5 and xxhash catching differences that imohash-64k misses
  - all hashers agreeing on truly identical files
  - compare() raising UsageError on checksum type mismatch
  - compare() passing when both CSVs use the same type
- Fix pre-existing bug in read_csv: "str[pyarrow]" → "string[pyarrow]"
  (invalid dtype string in pandas 2.x, exposed by the new tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow other checksum options

1 participant