Skip to content

Skip crawling of excluded directories during path resolution#356

Open
mdrum wants to merge 3 commits into
NASA-PDS:mainfrom
sbn-psi:sbnpsi/exclusions-efficiency
Open

Skip crawling of excluded directories during path resolution#356
mdrum wants to merge 3 commits into
NASA-PDS:mainfrom
sbn-psi:sbnpsi/exclusions-efficiency

Conversation

@mdrum

@mdrum mdrum commented May 29, 2026

Copy link
Copy Markdown

🗒️ Summary

Updates ingress path resolution so excluded directory trees are pruned before traversal instead of being walked and filtered only after child files are discovered. This keeps excluded files out of path resolution, checksum manifest preparation, backend verification, and upload work.

Also updates path-resolution progress totals to use the same include/exclude/symlink-aware traversal logic as resolution, documents directory-pruning semantics, and adds regression coverage for directory excludes, deep file globs, include/exclude interaction, and symlink skipping.

🤖 AI Assistance Disclosure

  • No AI assistance used
  • AI used for light assistance (e.g., suggestions, refactoring, documentation help, minor edits)
  • AI used for moderate content generation (AI generated some code or logic, but the developer authored or heavily revised the majority)
  • AI generated substantial portions of this code

Estimated % of code influenced by AI: 100 %

⚙️ Test Data and/or Report

Code

Automated tests added/updated in tests/pds/ingress/util/test_path_util.py.

Local verification:

python -m py_compile src/pds/ingress/util/path_util.py src/pds/ingress/util/progress_util.py src/pds/ingress/client/pds_ingress_client.py tests/pds/ingress/util/test_path_util.py
git diff --check -- src/pds/ingress/util/path_util.py src/pds/ingress/util/progress_util.py src/pds/ingress/client/pds_ingress_client.py tests/pds/ingress/util/test_path_util.py
flake8 src
python -m pytest tests/pds/ingress/util/test_path_util.py
python -m pytest tests

Results:

  • Path utility tests: 9 passed
  • Full test suite: 36 passed
  • Flake8: passed

Live Run

Used this updated version of DUM on live data to upload collection inventories and labels in a directory while excluding any data in subdirectories:

/opt/dum-dummer/bin/pds-ingress-client --log-level warn --prefix /dsk8/catalina/ -c /home/dum/conf.default.ini -n sbn /dsk8/catalina/gbo.ast.catalina.survey/data_raw --num-threads 12 --report-path /var/log/dum_pipeline/reports/data_raw.json --manifest-path manifests/dsk8_catalina/data_raw/manifest.json --exclude '/dsk8/catalina/gbo.ast.catalina.survey/data_raw/*/*'

It had the expected behavior change: Previously, "Resolving ingress paths" would have included all files in excluded directories (in our case, tens of millions of files). Now, it only resolved paths for the expected number of files (2826, due to incrementing collection version every day).

Resolving ingress paths: 100%|████████████████████████████████████████████████████████████| 2826/2826 Files

♻️ Related Issues

No GitHub issue was provided for this change.

🤓 Reviewer Checklist

Reviewers: Please verify the following before approving this pull request.

Documentation and PR Content

  • Documentation: README, Wiki, or inline documentation (Sphinx, Javadoc, Docstrings) have been updated to reflect these changes.
  • Issue Traceability: The PR is linked to a valid GitHub Issue
  • PR Title: The PR title is "user-friendly" clearly identifying what is being fixed or the new feature being added, that if you saw it in the Release Notes for a tool, you would be able to get the gist of what was done.

Security & Quality

  • SonarCloud: Confirmed no new High or Critical security findings.
  • Secrets Detection: Verified that the Secrets Detection scan passed and no sensitive information (keys, tokens, PII) is exposed.
  • Code Quality: Code follows organization style guidelines and best practices for the specific language (e.g., PEP 8, Google Java Style).

Testing & Validation

  • Test Accuracy: Verified that test data is accurate, representative of real-world PDS4 scenarios, and sufficient for the logic being tested.
  • Coverage: Automated tests cover new logic and edge cases.
  • Local Verification: (If applicable) Successfully built and ran the changes in a local or staging environment.

Maintenance

  • Backward Compatibility: Confirmed that these changes do not break existing downstream dependencies or API contracts (or that breaking changes are clearly documented).

@mdrum mdrum requested a review from a team as a code owner May 29, 2026 17:56

@nutjob4life nutjob4life left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a nice refactor and the shared iterator is definitely the right move. Feel free to incorporate or ignore my other comments.

Update: one other quick note, this PR doesn't seem compatible with tox, which what our CI/CD will use to test the code, check formatting, create the documentation, etc.:

$ python3.13 -m venv .venv
$ . .venv/bin/activate
(.venv) $ pip install --editable '.[dev]'
…
(.venv) $ tox
.pkg: _optional_hooks> python /Users/kelly/Documents/Clients/JPL/PDS/Development/nasa-pds/data-upload-manager/.venv/lib/python3.13/site-packages/pyproject_api/_backend.py True setuptools.build_meta
…
Fix End of Files.........................................................Passed
Check that executables have shebangs.....................................Passed
Check for merge conflicts................................................Passed
Debug Statements (Python)................................................Passed
Check Yaml...............................................................Passed
Reorder python imports...................................................Failed
- hook id: reorder-python-imports
- exit code: 1
- files were modified by this hook

Reordering imports in src/pds/ingress/util/progress_util.py

flake8...................................................................Passed
Detect secrets...........................................................Passed
lint: exit 1 (2.56 seconds) /Users/kelly/Documents/Clients/JPL/PDS/Development/nasa-pds/data-upload-manager> python -m pre_commit run --color=always --all pid=34937
  py313: OK (23.97=setup[1.28]+cmd[22.70] seconds)
  docs: OK (1.63=setup[0.60]+cmd[1.03] seconds)
  lint: FAIL code 1 (2.57=setup[0.01]+cmd[2.56] seconds)
  evaluation failed :( (28.31 seconds)

But the following should fix it:

--- a/src/pds/ingress/util/progress_util.py
+++ b/src/pds/ingress/util/progress_util.py
@@ -9,11 +9,10 @@ states of an upload request.
 """
 import multiprocessing
 
+from pds.ingress.util.path_util import PathUtil
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
 
-from pds.ingress.util.path_util import PathUtil
-
 PATH_BAR = None
 MANIFEST_BAR = None
 TOTAL_INGRESS_BAR = None

Comment thread src/pds/ingress/util/path_util.py
Comment thread src/pds/ingress/util/path_util.py
@mdrum

mdrum commented May 30, 2026

Copy link
Copy Markdown
Author

Fixed the linter issue and exclusions edge case. The assumption that passed exclusions are relative is consistent with the file filter, and prob worth a new issue/PR if we wanna change that. @nutjob4life could you have another look?

@nutjob4life nutjob4life self-requested a review May 30, 2026 00:52

@nutjob4life nutjob4life left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding @mdrum! Thanks for taking the time with this.

Tests: ✓
Delta: ✓
Approval: ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants