Skip to content

Fix #624: Refine ExtraFieldWarning handling in validation#1443

Open
sejalpunwatkar wants to merge 20 commits into
hdmf-dev:devfrom
sejalpunwatkar:feature/624-validation-warnings-v2
Open

Fix #624: Refine ExtraFieldWarning handling in validation#1443
sejalpunwatkar wants to merge 20 commits into
hdmf-dev:devfrom
sejalpunwatkar:feature/624-validation-warnings-v2

Conversation

@sejalpunwatkar
Copy link
Copy Markdown
Contributor

Motivation

Fixes #624

This PR refines how ExtraFieldWarning is generated during validation.

Currently, some attributes that are valid in extended or inherited specifications may still be incorrectly flagged as unexpected. Additionally, unexpected child elements are not always consistently reported depending on matching conditions.

This PR aims to:

  • Improve handling of ExtraFieldWarning for attributes across related specs
  • Ensure unexpected elements are consistently reported during validation

How to test the behavior?

pytest tests/unit/validator_tests/test_validate.py
pytest tests/unit/common/test_table.py

Focus on:

  • TestExtraFieldIntersection
  • validation behavior for unexpected attributes and elements

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

@sejalpunwatkar
Copy link
Copy Markdown
Contributor Author

@rly I opened this new PR after closing the previous one because it had a large diff (-283) due to merge conflicts, which made it hard to review.

In this version, I kept the changes minimal and focused on aligning with the intended behavior from #542 and #624 (warnings instead of errors).

Right now I’m seeing two edge cases:

An attribute like resolution (defined in a parent/extended spec) is still being flagged as an extra field
An unmatched dataset is not always producing an "Unexpected element" warning

Before I refine further, I wanted to confirm if these cases should be handled at the validator level or if this is expected behavior from the spec/matcher side. Thank You!

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 18, 2026

Codecov Report

❌ Patch coverage is 4.03226% with 119 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.48%. Comparing base (0d61982) to head (47699ef).

Files with missing lines Patch % Lines
src/hdmf/validate/validator.py 0.83% 119 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (0d61982) and HEAD (47699ef). Click for more details.

HEAD has 9 uploads less than BASE
Flag BASE (0d61982) HEAD (47699ef)
10 1
Additional details and impacted files
@@             Coverage Diff             @@
##              dev    #1443       +/-   ##
===========================================
- Coverage   93.18%   37.48%   -55.71%     
===========================================
  Files          41       41               
  Lines       10176    10291      +115     
  Branches     2103     2142       +39     
===========================================
- Hits         9483     3858     -5625     
- Misses        415     6071     +5656     
- Partials      278      362       +84     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sejalpunwatkar sejalpunwatkar force-pushed the feature/624-validation-warnings-v2 branch from fd470a8 to 274161a Compare April 23, 2026 14:52
sejalpunwatkar and others added 9 commits April 24, 2026 05:15
…f-dev#1429)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ryan Ly <310197+rly@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ryan Ly <310197+rly@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ryan Ly <310197+rly@users.noreply.github.com>
…dmf-dev#1439)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: rly <310197+rly@users.noreply.github.com>
…umns and extended type attributes

- Skip unmatched typed children whose type is expected by parent spec hierarchy (MissingDataType covers them)
- Skip unmatched untyped children in DynamicTable subtypes (columns like id, foo, VectorData etc.)
- Skip unmatched untyped children when spec defines untyped dataset children (wildcard datasets)
- Check subtypes when looking up attributes to avoid false ExtraFieldWarning for extended types
- Return early from AttributeValidator after ExpectedArrayError to avoid duplicate errors
@sejalpunwatkar sejalpunwatkar force-pushed the feature/624-validation-warnings-v2 branch from f94d6b1 to 27aa910 Compare April 24, 2026 05:17
Comment thread src/hdmf/validate/errors.py Outdated
Comment thread src/hdmf/validate/errors.py Outdated
@rly
Copy link
Copy Markdown
Contributor

rly commented Apr 24, 2026

An attribute like resolution (defined in a parent/extended spec) is still being flagged as an extra field

This should not be flagged as an extra field. Validation should happen against the full, extended spec, not the base VectorData spec.

An unmatched dataset is not always producing an "Unexpected element" warning

Could you provide concrete examples of these? It should.

Comment thread tests/unit/validator_tests/test_validate.py
@sejalpunwatkar sejalpunwatkar force-pushed the feature/624-validation-warnings-v2 branch from 31938db to 0115dc1 Compare April 25, 2026 04:42
@sejalpunwatkar
Copy link
Copy Markdown
Contributor Author

Thank you for the feedback! I've moved resolution into the inline dataset spec within GroupWithBase and removed the subtype lookup. I'll look at #542 for the ValidationWarning/UserWarning approach and update accordingly.

@rly
Copy link
Copy Markdown
Contributor

rly commented Apr 30, 2026

Thanks for the changes @sejalpunwatkar . I'll finish reviewing later this week. In the meantime, could you address the ruff errors? Also there is a significant amount of new code that is not covered by tests. Could you review that code and ensure that it is necessary to implement the fix here? Some of the new code looks like it is hard-coded against some of the attributes that we have in existing tests. Instead of excluding specially named fields during validation, I think the tests should be updated. Feel free to wait on that until I have the bandwidth to do a more complete review.

@sejalpunwatkar
Copy link
Copy Markdown
Contributor Author

Thank you for the feedback @rly ! I'll fix the ruff errors and remove the hardcoded field names; those were added as a workaround and I agree the tests should be updated instead. I'll hold off on further changes until your complete review so I can address everything at once.

sejalpunwatkar and others added 8 commits May 9, 2026 04:55
…f-dev#1429)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ryan Ly <310197+rly@users.noreply.github.com>
…#1440)

* Set reasonable default chunk sizes (~4 MB) when chunks=True

h5py's default auto-chunking targets ~32 KB per chunk, which is too
small for cloud access where each chunk requires a separate HTTP range
request. Add _compute_chunk_shape() that targets ~4 MB chunks (in the
recommended 2-16 MB range for cloud-hosted files).

The method is applied in __list_fill__, __setup_chunked_dset__, and
__setup_empty_dset__ whenever chunks=True (boolean). Also ensures
chunks=True is explicitly set when maxshape forces chunking, so the
compute logic always triggers.

Closes hdmf-dev#1438

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update src/hdmf/backends/hdf5/h5tools.py

Co-authored-by: Ryan Ly <310197+rly@users.noreply.github.com>

* Address review: public API, mesoscale-safe chunking, CHANGELOG

- Rename _compute_chunk_shape → compute_default_chunk_shape so users can
  inspect or override the shape that would be used when chunks=True.
- Add optional neurodata_type hook (unused in the default implementation)
  so subclasses can specialize chunking per type without signature churn.
- When a single first-axis slice already exceeds the target (e.g. a
  (1000, 4096, 4096) uint16 mesoscale volume at 32 MB/slice), halve the
  largest trailing axis until the chunk fits within the target. Prevents
  auto-chunks from ballooning well past 16 MB on large-frame datasets.
- Add tests for the new trailing-dim split path and the bytes_per_row==0
  fallback branch flagged in review.
- CHANGELOG entry for the ~4 MB default and the new public helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ryan Ly <310197+rly@users.noreply.github.com>
…-dev#1445)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ryan Ly <310197+rly@users.noreply.github.com>
@sejalpunwatkar sejalpunwatkar force-pushed the feature/624-validation-warnings-v2 branch from 6578f41 to acfb1ab Compare May 9, 2026 04:55
@sejalpunwatkar
Copy link
Copy Markdown
Contributor Author

Hi @rly! Just checking in on PR #1443 whenever you get a chance. Happy to address all the feedback together once you've had time for the complete review!

The shape of the dataset.
dtype : numpy.dtype or type
The data type, used to determine bytes per element.
target_chunk_bytes : int, optional
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these changes made in this file?

Comment thread .pre-commit-config.yaml
# hooks:
# - id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.11
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change as well

Comment thread CHANGELOG.md
@@ -1,5 +1,13 @@
# HDMF Changelog

## Unreleased
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change as well. These were released in HDMF 6.0.0. It seems like these came from resolving conflicts from a git merge incorrectly.

@rly
Copy link
Copy Markdown
Contributor

rly commented May 23, 2026

@sejalpunwatkar Thanks for continuing your work on this! As I review the PR, I'm realizing this is a significant change to the validator, and we should think through the implementation plan carefully and break it up into smaller pieces.

I owe you a clarification on my earlier suggestion. When I suggested making ValidationWarning a subclass of UserWarning, I didn't think through how these objects would actually be wired into the validator's existing model. The HDMF validator does not raise Error objects at the point of detection. It collects them and returns a list, leaving it up to the caller to decide what to do with them (display, log, fail the run, etc.). Downstream applications like PyNWB and NWB Inspector treat a non-empty return list as "this file is invalid", so we cannot simply mix ValidationWarning objects into that list. Doing so would silently flip previously-valid files to invalid the moment any warning fires.

We also don't want to emit warnings through Python's warnings.warn() machinery. That would break the validator's "collect and return" model and force downstream tools into warnings.catch_warnings() gymnastics just to render warnings alongside errors in the same report.

I've opened #1479 to track the refactor that needs to land before the warning logic from this PR can go in. The short version: introduce a small ValidationResult wrapper that holds both an errors list and a warnings list, and have ValidatorMap.validate() return it instead of a bare list. The wrapper implements __iter__ / __len__ / __bool__ / __getitem__ over the errors list, so every common pattern downstream uses today keeps working unchanged, and new callers opt in to warnings via result.warnings. No coordinated migration with PyNWB / NWB Inspector required. See #1479 for the full rationale and alternatives considered.

Suggested PR breakdown

Given that, I'd suggest splitting this work into two PRs:

PR 1 (addresses #1479): the plumbing only, no new warnings yet.

  • Add ValidationWarning as a sibling of Error (same .name / .reason / .location shape, not a UserWarning subclass).
  • Add the ValidationResult wrapper.
  • Change ValidatorMap.validate() (and the per-type validators it delegates to) to return ValidationResult instead of a bare list, with warnings always empty for now.
  • Verify the existing validator test suite passes unchanged. The whole point of the wrapper is that it should.

PR 2 (addresses #624): the actual feature.

  • Add ExtraFieldWarning(ValidationWarning).
  • Add the logic to detect attributes and children that no applicable spec recognizes.
  • Add focused tests that assert against result.warnings for the specific cases from Create validation warnings on extra fields #624.

Splitting it this way keeps each PR small enough to review carefully, decouples the API change from the harder semantic questions about when a field is really "extra", and gives us a clean rollback point if either piece turns out to need more thought.

Does this plan make sense? Happy to discuss further before you start cutting it up.

@sejalpunwatkar
Copy link
Copy Markdown
Contributor Author

Hi @rly! Thank you for the detailed explanation and for opening #1479 , this really clarifies the path forward. The ValidationResult wrapper approach makes a lot of sense to me. Keeping the iter/len/bool/getitem interface over the errors list ensures all existing downstream behavior stays intact, while new callers can opt into result.warnings when needed; no coordinated migration required with PyNWB or NWB Inspector.
Yes, the two-PR split sounds like the right approach. PR 1 as pure plumbing keeps the API change isolated and reviewable on its own, and gives a clean rollback point if anything needs more thought. PR 2 can then focus purely on the detection logic for #624 without mixing concerns.
I'm happy to start on PR 1 following the structure outlined in #1479. Does that work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create validation warnings on extra fields

3 participants