Skip to content

Modernise repository to open-source standards#24

Open
siligam wants to merge 7 commits into
mainfrom
open-source-standards
Open

Modernise repository to open-source standards#24
siligam wants to merge 7 commits into
mainfrom
open-source-standards

Conversation

@siligam
Copy link
Copy Markdown
Collaborator

@siligam siligam commented May 13, 2026

Summary

Brings the repository up to open-source standards, fixes several structural issues identified in an architecture review, and adds two analysis improvements for real-world HPC pool comparisons.

Changes

Open-source standards

  • pyproject.toml — replaces legacy setup.py (PEP 517/518 compliant, resolves pip deprecation warning)
  • Metadata fixes: license MIT → GPL-3.0, Development Status classifier, url GitLab → GitHub
  • CONTRIBUTING.md, CODE_OF_CONDUCT.md, SECURITY.md, CHANGELOG.md
  • .github/ISSUE_TEMPLATE/ — bug report and feature request templates
  • .github/PULL_REQUEST_TEMPLATE.md — PR template
  • .github/workflows/ci.yml — GitHub Actions CI running tests on Python 3.9–3.12
  • .gitignore — expanded with standard Python, editor, and OS entries

Test suite (59 tests)

  • tests/test_checksums.py — split, ignore_re, Results, hasher, stats, scanner, get_files
  • tests/test_analyse.py — read_csv, compare, compare_compact, merge, directory_map, TF-IDF folder mapping, suspicious pair detection
  • tests/test_cli.py — all four CLI commands via CliRunner
  • tests/test_imohash_false_negatives.py — documents the known imohash sampling limitation (see below)

Documentation

  • Added "How it works" section to README with the file association classification table (identical / modified / renamed / unique) and an explanation of the folder mapping algorithm and --threshold

Architecture fixes

  • Deleted ptool/utils.py — dead code; referenced a non-existent ptool_config.yaml and imported yaml (undeclared dependency)
  • Removed duplicate cli() from checksums.py — the registered entry point is ptool.cli:cli; the standalone command was never reachable
  • Fixed prepare_rsync double-open bug — replaced open(outfile, "w") with outfile.write() on the Click-provided handle; fixed hardcoded "sync_cmd.sh" in the success message
  • Clarified sanitise() docstring — documents the AWI shared-filesystem rationale for the awi.de special case
  • Fixed DataFrame metadata attributes — replaced monkey-patching (df.site = ...) with a _SnapshotFrame subclass using pandas _metadata + _constructor so site and filename survive merges, filters, and copies
  • Fixed docstring escape sequences in split()\,\\, (Python 3.12 DeprecationWarning)
  • Replaced groupby.apply in merge() — eliminated FutureWarning by rewriting the max-association algorithm using groupby.size() + idxmax()

Analysis improvements

TF-IDF weighted folder association

  • Folder mapping previously used raw file-match counts, which favoured directories full of generic sequential filenames (e.g. my_list00001.out) over directories sharing a single rare, specific file
  • Now weights each filename match by its IDF score (log(n_dirs / n_dirs_containing_fname)), so rare filenames anchor directory mappings and ubiquitous ones are downweighted
  • Verified with a dedicated test: a directory sharing one specific file + one generic correctly maps to its true counterpart rather than a directory with three generic matches

Spurious folder mapping detection (_suspicious_pairs)

  • When a folder pair shares filenames but every matched file differs in content (0 identical, 100% modified rate), the mapping is likely spurious — e.g. MPI partition files generated for different core counts at different sites
  • _suspicious_pairs() detects these pairs and summary() marks them with [!] in the directory map table with an explanatory note
  • Three tests cover: all-modified pairs flagged, mixed pairs (some identical) not flagged, unique-only directories not flagged

Old CSV format compatibility

  • read_csv() now handles CSVs with the legacy md5 column name (renamed to checksum on load) for backward compatibility with older snapshots

imohash false-negative awareness tests

  • imohash samples only 3×16 KB windows (start, middle, end) for files ≥ 128 KB, so two files that differ only in the unsampled region produce the same digest — a known limitation
  • Three deterministic tests document this: (1) the false-negative exists and is reproducible for large files, (2) small files are fully read and immune, (3) different file sizes never collide because size is encoded in the digest

Test plan

  • 59 tests pass with zero warnings (conda run -n ptool python -m pytest tests/ -v)
  • CI workflow triggers on push/PR and all matrix jobs pass
  • pip install -e ".[dev]" works without deprecation warnings
  • Issue templates appear correctly on GitHub when creating a new issue

🤖 Generated with Claude Code

siligam and others added 7 commits May 13, 2026 02:17
- Replace setup.py with pyproject.toml (PEP 517/518 compliant)
- Fix license metadata: MIT → GPL-3.0 (matches LICENSE file)
- Fix PyPI classifier: invalid version string → Development Status 4 - Beta
- Fix url: GitLab → GitHub
- Add CONTRIBUTING.md, CODE_OF_CONDUCT.md, SECURITY.md, CHANGELOG.md
- Add GitHub issue templates (bug report, feature request)
- Add GitHub pull request template
- Add GitHub Actions CI workflow (Python 3.9–3.12)
- Expand .gitignore with standard Python, editor, and OS entries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tests/test_checksums.py: split, ignore_re, Results, hasher, stats,
  scanner, get_files — covers all utility functions and file scanning logic
- tests/test_analyse.py: read_csv, compare, compare_compact, merge,
  directory_map — covers identical/renamed/modified/unique classification
- tests/test_cli.py: checksums, compare, summary, prepare-rsync commands
  via click CliRunner — covers all CLI entry points and flags
- Fix pre-existing bug in read_csv: "str[pyarrow]" → "string[pyarrow]"
  (TypeError on pandas 2.x, exposed by the new tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add CI, license, and Python version badges
- Add table of contents and command summary table
- Fix filename: environment.yml → environment.yaml
- Fix typos: comapre, minumin, invloved, Denpending
- Add dev installation instructions (pip install -e ".[dev]")
- Add note on checksum algorithm trade-offs (imohash sampling behaviour)
- Add note that both CSVs must use the same algorithm before comparing
- Clarify prepare-rsync examples (works/fails sections)
- Add Contributing and License sections with links
- Update license statement: MIT → GPL v3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Delete ptool/utils.py: dead code that referenced a non-existent
  ptool_config.yaml and imported yaml (not a declared dependency)
- Remove standalone cli() entry point from checksums.py: the registered
  entry point is ptool.cli:cli; the duplicate was never invoked
- Fix prepare_rsync double-open bug: replace open(outfile, "w") (which
  passes a file object to open()) with outfile.write() on the
  Click-provided handle; fix hardcoded "sync_cmd.sh" in success message
- Clarify sanitise() docstring: document the AWI shared-filesystem
  rationale for the awi.de special case
- Fix DataFrame site/filename attributes: replace monkey-patching with a
  _SnapshotFrame subclass using pandas _metadata + _constructor so those
  attributes survive merges, filters, and copies
- Fix docstring escape sequences in split(): \, -> \\, (DeprecationWarning)
- Replace groupby.apply in merge() with groupby.size() + idxmax():
  eliminates FutureWarning about grouping columns being excluded from
  apply; same max-association-wins algorithm, no apply() needed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace raw file-count scoring in merge() with TF-IDF weighting:
- Filenames appearing in many left directories (generic sequential names
  like my_list00001.out, com_info00001.out) receive low weight
  IDF = log(total_dirs / dirs_containing_fname)
- Filenames unique to one directory carry full weight, correctly anchoring
  the folder pairing to content-specific files rather than structural noise
- Also accept 'md5' as a column name alias for 'checksum' in read_csv(),
  enabling compatibility with older CSV snapshots

Add TestTFIDFFolderMapping with two tests:
- test_tfidf_prefers_specific_over_generic: constructs a case where raw
  count maps dir_a to the wrong directory (3 generic matches > 2 specific
  matches); TF-IDF correctly maps via the unique anchor file
- test_unique_files_still_map_correctly: confirms TF-IDF degrades
  gracefully to correct behaviour when all filenames are unique

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a folder pair shares filenames but all matched files differ in
content (0 identical, 100% modified), the mapping is likely spurious —
e.g. MPI partition files generated for different core counts. Adds
_suspicious_pairs() to detect this and marks affected rows with [!] in
the summary directory map table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
imohash samples only 3×16 KB windows (start, middle, end) for files
≥ 128 KB, so two files that differ only in the unsampled region produce
the same digest. Tests demonstrate: (1) the false-negative exists for
large files, (2) small files are fully read and immune, (3) different
file sizes never collide because size is encoded in the digest.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant