Modernise repository to open-source standards#24
Open
siligam wants to merge 7 commits into
Open
Conversation
- Replace setup.py with pyproject.toml (PEP 517/518 compliant) - Fix license metadata: MIT → GPL-3.0 (matches LICENSE file) - Fix PyPI classifier: invalid version string → Development Status 4 - Beta - Fix url: GitLab → GitHub - Add CONTRIBUTING.md, CODE_OF_CONDUCT.md, SECURITY.md, CHANGELOG.md - Add GitHub issue templates (bug report, feature request) - Add GitHub pull request template - Add GitHub Actions CI workflow (Python 3.9–3.12) - Expand .gitignore with standard Python, editor, and OS entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tests/test_checksums.py: split, ignore_re, Results, hasher, stats, scanner, get_files — covers all utility functions and file scanning logic - tests/test_analyse.py: read_csv, compare, compare_compact, merge, directory_map — covers identical/renamed/modified/unique classification - tests/test_cli.py: checksums, compare, summary, prepare-rsync commands via click CliRunner — covers all CLI entry points and flags - Fix pre-existing bug in read_csv: "str[pyarrow]" → "string[pyarrow]" (TypeError on pandas 2.x, exposed by the new tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add CI, license, and Python version badges - Add table of contents and command summary table - Fix filename: environment.yml → environment.yaml - Fix typos: comapre, minumin, invloved, Denpending - Add dev installation instructions (pip install -e ".[dev]") - Add note on checksum algorithm trade-offs (imohash sampling behaviour) - Add note that both CSVs must use the same algorithm before comparing - Clarify prepare-rsync examples (works/fails sections) - Add Contributing and License sections with links - Update license statement: MIT → GPL v3 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Delete ptool/utils.py: dead code that referenced a non-existent ptool_config.yaml and imported yaml (not a declared dependency) - Remove standalone cli() entry point from checksums.py: the registered entry point is ptool.cli:cli; the duplicate was never invoked - Fix prepare_rsync double-open bug: replace open(outfile, "w") (which passes a file object to open()) with outfile.write() on the Click-provided handle; fix hardcoded "sync_cmd.sh" in success message - Clarify sanitise() docstring: document the AWI shared-filesystem rationale for the awi.de special case - Fix DataFrame site/filename attributes: replace monkey-patching with a _SnapshotFrame subclass using pandas _metadata + _constructor so those attributes survive merges, filters, and copies - Fix docstring escape sequences in split(): \, -> \\, (DeprecationWarning) - Replace groupby.apply in merge() with groupby.size() + idxmax(): eliminates FutureWarning about grouping columns being excluded from apply; same max-association-wins algorithm, no apply() needed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace raw file-count scoring in merge() with TF-IDF weighting: - Filenames appearing in many left directories (generic sequential names like my_list00001.out, com_info00001.out) receive low weight IDF = log(total_dirs / dirs_containing_fname) - Filenames unique to one directory carry full weight, correctly anchoring the folder pairing to content-specific files rather than structural noise - Also accept 'md5' as a column name alias for 'checksum' in read_csv(), enabling compatibility with older CSV snapshots Add TestTFIDFFolderMapping with two tests: - test_tfidf_prefers_specific_over_generic: constructs a case where raw count maps dir_a to the wrong directory (3 generic matches > 2 specific matches); TF-IDF correctly maps via the unique anchor file - test_unique_files_still_map_correctly: confirms TF-IDF degrades gracefully to correct behaviour when all filenames are unique Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a folder pair shares filenames but all matched files differ in content (0 identical, 100% modified), the mapping is likely spurious — e.g. MPI partition files generated for different core counts. Adds _suspicious_pairs() to detect this and marks affected rows with [!] in the summary directory map table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
imohash samples only 3×16 KB windows (start, middle, end) for files ≥ 128 KB, so two files that differ only in the unsampled region produce the same digest. Tests demonstrate: (1) the false-negative exists for large files, (2) small files are fully read and immune, (3) different file sizes never collide because size is encoded in the digest. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the repository up to open-source standards, fixes several structural issues identified in an architecture review, and adds two analysis improvements for real-world HPC pool comparisons.
Changes
Open-source standards
pyproject.toml— replaces legacysetup.py(PEP 517/518 compliant, resolves pip deprecation warning)licenseMIT → GPL-3.0,Development Statusclassifier,urlGitLab → GitHubCONTRIBUTING.md,CODE_OF_CONDUCT.md,SECURITY.md,CHANGELOG.md.github/ISSUE_TEMPLATE/— bug report and feature request templates.github/PULL_REQUEST_TEMPLATE.md— PR template.github/workflows/ci.yml— GitHub Actions CI running tests on Python 3.9–3.12.gitignore— expanded with standard Python, editor, and OS entriesTest suite (59 tests)
tests/test_checksums.py— split, ignore_re, Results, hasher, stats, scanner, get_filestests/test_analyse.py— read_csv, compare, compare_compact, merge, directory_map, TF-IDF folder mapping, suspicious pair detectiontests/test_cli.py— all four CLI commands via CliRunnertests/test_imohash_false_negatives.py— documents the known imohash sampling limitation (see below)Documentation
--thresholdArchitecture fixes
ptool/utils.py— dead code; referenced a non-existentptool_config.yamland importedyaml(undeclared dependency)cli()fromchecksums.py— the registered entry point isptool.cli:cli; the standalone command was never reachableprepare_rsyncdouble-open bug — replacedopen(outfile, "w")withoutfile.write()on the Click-provided handle; fixed hardcoded"sync_cmd.sh"in the success messagesanitise()docstring — documents the AWI shared-filesystem rationale for theawi.despecial casedf.site = ...) with a_SnapshotFramesubclass using pandas_metadata+_constructorsositeandfilenamesurvive merges, filters, and copiessplit()—\,→\\,(Python 3.12 DeprecationWarning)groupby.applyinmerge()— eliminated FutureWarning by rewriting the max-association algorithm usinggroupby.size()+idxmax()Analysis improvements
TF-IDF weighted folder association
my_list00001.out) over directories sharing a single rare, specific filelog(n_dirs / n_dirs_containing_fname)), so rare filenames anchor directory mappings and ubiquitous ones are downweightedSpurious folder mapping detection (
_suspicious_pairs)_suspicious_pairs()detects these pairs andsummary()marks them with[!]in the directory map table with an explanatory noteOld CSV format compatibility
read_csv()now handles CSVs with the legacymd5column name (renamed tochecksumon load) for backward compatibility with older snapshotsimohash false-negative awareness tests
Test plan
conda run -n ptool python -m pytest tests/ -v)pip install -e ".[dev]"works without deprecation warnings🤖 Generated with Claude Code