Add configurable checksum types (imohash-64k, xxhash, md5)#23
Open
siligam wants to merge 2 commits into
Open
Conversation
- Upgrade imohash default sample size from 16 KB to 64 KB (prefix: imohash-64k) to reduce false-negative risk in large file comparisons - Add --checksum-type option to checksums command supporting imohash-64k (default), md5, and xxhash - Guard compare/summary against mixing CSVs with different checksum types, raising a clear error to prevent silent mismatches - Add xxhash to dependencies (setup.py, environment.yaml) - Add tests documenting imohash false-negative behaviour Closes #4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add test_checksum_types.py covering: - correct prefix embedding for imohash-64k, md5, xxhash - md5 and xxhash catching differences that imohash-64k misses - all hashers agreeing on truly identical files - compare() raising UsageError on checksum type mismatch - compare() passing when both CSVs use the same type - Fix pre-existing bug in read_csv: "str[pyarrow]" → "string[pyarrow]" (invalid dtype string in pandas 2.x, exposed by the new tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
imohash-64k) to meaningfully reduce false-negative risk while keeping snapshot generation fast--checksum-typeoption toptool checksumswith three choices:imohash-64k(default),xxhash, andmd5compareandsummaryagainst mixing CSVs generated with different checksum types — raises a clear error rather than silently producing wrong resultsxxhashas a package dependencyread_csv("str[pyarrow]"→"string[pyarrow]") which caused aTypeErroron pandas 2.x — discovered via the new testsChecksum type trade-offs
imohash-64kxxhashmd5For multi-GB scientific files the bottleneck is I/O, not the algorithm, so
xxhashandmd5have similar wall-clock cost.xxhashis preferred when correctness matters and MD5 interoperability is not required.Test plan
conda run -n ptool pytest tests/ -v— all 12 tests passptool checksums --help—--checksum-typeoption visible with correct choices and defaultptool checksums --checksum-type xxhash <path>— output prefixed withxxhash:ptool checksums --checksum-type md5 <path>— output prefixed withmd5:ptool compareon two CSVs with mismatched types — raisesUsageErrorwith clear messageCloses #4
🤖 Generated with Claude Code