Canonical RDF ontology, graph-native v2 bundles & SHACL validation by jfrench9 · Pull Request #707 · RoboFinSystems/robosystems

jfrench9 · 2026-05-29T22:18:15Z

Summary

This PR introduces a canonical RDF ontology for the taxonomy, migrates bundles to a graph-native v2 serialization format, and adds opt-in SHACL validation at publish time. It also consolidates validation tooling across all demo examples and adds statement-level reconciliation capabilities.

Key Accomplishments

Canonical RDF Ontology (`frameworks/ontology/v1/`)

ontology.ttl — Formal OWL/RDF ontology that reifies XBRL arcs and xbrli concept attributes as first-class RDF properties and classes.
shapes.ttl — SHACL shapes graph for validating bundle conformance against the ontology.
context.jsonld — Shared JSON-LD context enabling compact, human-readable serialization aligned with the ontology.

Graph-Native v2 Bundle Serialization

XBRL dimensional context (entity, period, units, explicit/typed dimensions) is collapsed directly onto fact nodes, eliminating the separate context-to-fact indirection and producing a cleaner RDF graph.
Bundle ontology version stamp corrected to v1 (was incorrectly set to v2).
All taxonomy packages (FAC, RS-GAAP, bridges) updated to use reified arc representations and the new JSON-LD context, resulting in large but mechanical diffs across ~20 taxonomy files.

SHACL Validation at Publish

New opt-in SHACL validation step integrated into the bundle publish pipeline; results are captured on the Report object.
Dedicated test suites validate SHACL conformance of sample bundles and verify the publish-time validation hook.

Consolidated Validation Tooling

New shared examples/_common/validate.py module provides container-free SHACL and Arelle (XBRL 2.1) validation, replacing the per-demo xbrl_validate.py script.
All three demos (Roboledger, Seattle Method, World Online) updated to use the common validator and now emit both SHACL and XBRL validation reports as sample outputs.

Statement-Level Reconciliation (World Online)

New statement_reconcile.py performs line-item reconciliation of generated financial statements against Charlie's reference data.
Reconciliation output captured as a new sample artifact (world-online-statement-reconciliation.md).

Breaking Changes

Bundle JSON-LD structure changed: Consumers parsing the .jsonld output will see a different graph shape — dimensional context fields now appear inline on fact nodes rather than as separate context objects. Any downstream tooling that relies on the v1 context/fact indirection will need to be updated.
Taxonomy JSON-LD format changed: Arc relationships are now reified RDF resources with explicit source/target/role properties instead of nested shorthand. Taxonomy loaders or external consumers reading these files directly will require updates.
@context prefix xbrl → link renamed: CANONICAL_CONTEXT now binds the XBRL linkbase namespace (http://www.xbrl.org/2003/linkbase#) to the prefix link (was xbrl). All in-repo seeds and bundles are regenerated against the new context, so the repo is internally consistent — but any external consumer, cached snapshot, or tool that references the old xbrl: prefix in JSON-LD will silently produce unresolvable IRIs and must update to link:.
xbrl_validate.py removed: The per-demo validation script in seattle_method_demo has been deleted in favor of the shared common module.

Testing

New tests added:
- test_publish_validation.py — Verifies opt-in SHACL validation fires during publish and results propagate to the Report.
- test_sample_bundles_shacl.py — Validates all sample bundle outputs against the SHACL shapes graph.
Updated tests:
- Bundle producer, JSON-LD encoder, XBRL emitter, and cross-encoder equivalence tests updated to reflect the v2 graph structure.
- Taxonomy fallback-bucket wiring tests adjusted for the reified arc format.
All three demo pipelines regenerated with updated sample outputs confirming end-to-end correctness.

Infrastructure Considerations

New dependency on a SHACL validation library (reflected in uv.lock / pyproject.toml changes).
The shared validation module supports running SHACL validation without requiring a container runtime, simplifying CI and local development workflows.
Justfile updated with convenience targets for the new validation and reconciliation workflows.

🤖 Generated with Claude Code

Branch Info:

Source: feature/ontology-refactor
Target: main
Type: feature

Co-Authored-By: Claude noreply@anthropic.com

@context

Collapse three drifted RDF dialects for the same concept into one canonical vocabulary (RS topology + XBRL vocabulary), per local/docs/specs/rdf-ontology.md. Ontology (frameworks/ontology/v1/): - context.jsonld: published canonical @context (superset of every seed term) - ontology.ttl: RDFS/OWL class + property declarations - shapes.ttl: SHACL — positive shapes + negative shapes banning the retired dialects (xbrli:contextRef, arcFrom, summationOf, …) Vocabulary (arelle/context.py): - balance/periodType -> xbrli:; bind link/xlink/xbrldi/iso4217 - structural arcs reified: from/to (xlink), arcrole/role (xlink), weight/order (link), associationType (rs); direct summationOf/parent/ generalOf/dimensionOf/hypercubeOf RETIRED - equivalence stays direct owl:equivalentClass (symmetric, no arc metadata) - absorb all domain terms (drules, rules, traits, style) so it is the superset extractor.py: emit reified rs:Association (weight/order/preferredLabel from Arelle) + xbrli concept attrs; deterministic content-hashed association IRIs. serializer.py: compact predicate keys via the context (readable seeds). loader.py: read the single canonical reified form + xbrli attrs; structural direct-predicates dropped (equivalence + drules kept). Seeds: all 18 frameworks/**/taxonomy.jsonld regenerated to canonical form (semantics-preserving — identical element/association counts). Deps: + pyshacl. Verified: tests/arelle + tests/taxonomy green (211); all 18 seeds SHACL-conform; ruff + format + basedpyright clean. Runtime reseed + demo round-trip next.

…nto facts Phase B of the canonical RDF ontology migration: the export StatementBundle becomes graph-native (RS topology + XBRL vocabulary), mirroring the LadybugDB reporting graph instead of re-encoding an XBRL instance. bundle.py: BundlePeriod nodes replace BundleContext; facts carry period_ref / unit_ref / entity_ref directly (the FACT_HAS_* edges); _mint_periods replaces _mint_contexts. The XBRL context is no longer stored on the bundle. rdf/jsonld.py: rewritten to emit rs:Fact with direct element/entity/period/unit edges, rs:Element (xbrli:balance/periodType), reified rs:Association under rs:Structure, and rs:Period/rs:Unit aspect nodes — using the canonical CANONICAL_CONTEXT. serializationVersion → 2.0. validate_graph now runs SHACL (frameworks/ontology/v1/shapes.ttl), so the same shapes that gate the seeds gate the export, including the negative shapes that ban xbrli:contextRef. xbrl/xbrl_21.py: _derive_contexts reconstructs <xbrli:context> from the period nodes + entity at emit time (XBRL 2.1 requires shared contexts), so the emitted instance.xml is unchanged and stays Arelle-valid. Verified: 95 serialization tests green (incl. cross-encoder fact-set equivalence and a negative-shape rejection of re-introduced contextRef); 10,306-test unit suite green. Both Seattle Method demos + the RoboLedger demo re-run on fresh graphs emit serializationVersion 2.0 JSON-LD (no contextRef) with Arelle-valid XBRL and unchanged reconcile figures; sample_output refreshed accordingly.

…rence Adds the rendered-statement reconcile discussed early on: diffs the four-statement Report's seven anchor totals against Charlie Hoffman's published XBRL reference instance (mini/ref-num/instance.xml — the source of his index2.html), complementing the GL-pivot reconcile (which validates ingestion vs SummaryOfTransactions.csv). Our values are read straight from the v2 graph-native bundle (rs:Fact nodes) — the export artifact is the reconciliation source, which is the payoff of the ontology reshape. A mini→rs-gaap anchor map bridges the vocabularies; matching is by period position (Charlie's reference is labelled FY2022/EUR, ours spans 2023→2028, amounts tie regardless). Result: 7/7 anchors tie to the penny, current + prior — Assets, Liabilities & Equity, Net Income, Receivables, PP&E, Long-term Debt, and the −€648K Cash (which is in Charlie's reference report too, not an ingestion error). Wired as demo step 11 + `just demo-world-online-statement-reconcile`.

The graph-native bundle is the first *published* bundle ontology — the XBRL-aligned draft never shipped beyond a one-day demo, so there is no released predecessor to supersede. Stamp it accordingly: - SERIALIZATION_VERSION "2.0" → "1.0" (the value on every bundle's root) - IB-envelope datatype IRI /datatype/v2/ → /datatype/v1/ - docstrings/comments describing the artifact as "v2.0" → "v1.0" - refreshed sample bundles carry serializationVersion "1.0" The design history (XBRL-aligned → graph-native) lives in the specs; the published artifact + the ontology dir (frameworks/ontology/v1/) are both v1.

…utputs Each demo now validates the artifacts it just downloaded — on the host, against the on-disk output/ files, with the stack down (no API, no DB, no container): - examples/_common/validate.py — one shared validator: JSON-LD → pyshacl vs frameworks/ontology/v1/shapes.ttl (semantic conformance) and XBRL zip → Arelle vs the XBRL 2.1 spec (structural conformance). Writes a markdown evidence report per projection. - Wired as a single `validate` step in all three demos (Seattle Method, World Online, RoboLedger), reading the downloaded .jsonld/.zip. Replaces the old seattle xbrl_validate.py, which re-fetched from the API + queried the DB (a container dependency) — removed. World Online and RoboLedger gain XBRL/Arelle validation they didn't have. - tests/operations/serialization/test_sample_bundles_shacl.py — pytest that SHACL-validates every committed demo sample bundle against the ontology, so a non-conformant sample can't land. Evidence committed: all three demos' sample_output now carries both a *-shacl-validation.md (conforms, 0 violations) and a *-xbrl-validation.md (valid XBRL 2.1, 0 errors).

…the Report Resolves the publish-path latency concern: SHACL validation of the report bundle is now opt-in, and when it runs its result is persisted with the Report. - jsonld.py: decouple validation from serialization. serialize_to_jsonld no longer auto-validates (serialization shouldn't block). Add shacl_report(graph) -> ShaclResult (non-raising: ran/conforms/violations/shapes_checked/report + as_dict() for storage); validate_graph stays as the raising/strict wrapper. - config: REPORT_BUNDLE_SHACL_VALIDATION = off | warn | strict (default off, so the publish path stays fast — the standalone validator + the SHACL regression test cover demos/CI). - reports.py: _record_bundle_validation in the publish hook — when not off, SHACL-checks the bundle and records the structured outcome on Report.metadata['bundle_validation'] (audit trail); strict also raises on non-conformance to block the publish. Uses the existing JSONB metadata column — no migration. Tests: shacl_report on conforming + violating graphs, as_dict bounding, and the off/warn/strict hook behaviors (7 cases). Full serialization + roboledger suites green (649).

jfrench9 · 2026-05-29T22:18:17Z

@claude please review this PR

claude · 2026-05-29T22:18:39Z

Claude finished @jfrench9's task in 4m 1s —— View job

PR #707 Review — Canonical RDF Ontology, v2 Bundles & SHACL Validation

Tasks

Understand the PR scope and breaking changes
Review ontology files (ontology.ttl, shapes.ttl, context.jsonld)
Review core serialization changes (jsonld.py, bundle.py)
Review SHACL validation integration (reports.py, env.py)
Review new test suites (test_publish_validation.py, test_sample_bundles_shacl.py)
Review shared validation module (examples/_common/validate.py)
Review Arelle extractor changes
Review statement reconciliation (statement_reconcile.py)
Synthesize findings and post full review

Overview

This is a large, well-motivated PR with a coherent architecture. The core ideas — collapsing XBRL contexts onto Fact nodes, reifying arcs as rs:Association nodes, and enforcing that shape with SHACL — are sound. The ontology is clean, the test coverage for the new validation hook is good, and the negative SHACL shapes as a regression guard are a nice idea. The findings below are mostly correctness/robustness concerns rather than design objections.

Findings

1. `_record_bundle_validation` doesn't guard against exceptions from `pyshacl` in `warn` mode — moderate

robosystems/operations/roboledger/commands/reports.py:358

result = shacl_report(build_graph(bundle))

shacl_report imports and calls pyshacl.validate at call time. If pyshacl raises (parse error, OOM, or any unexpected internal error), the exception propagates out of _record_bundle_validation, through _stamp_report_bundle, and unwinds create_report — causing the entire publish to fail even in warn mode. The env var docs say warn should "capture" the result without blocking. Only strict should ever block.

Suggested fix:

def _record_bundle_validation(bundle: StatementBundle, report_def: Report) -> None:
    mode = (env.REPORT_BUNDLE_SHACL_VALIDATION or "off").strip().lower()
    if mode == "off":
        return
    from robosystems.operations.serialization.rdf.jsonld import (
        BundleValidationError, build_graph, shacl_report,
    )
    try:
        result = shacl_report(build_graph(bundle))
    except Exception:
        logger.exception("SHACL validation error for report %s (mode=%s)", report_def.id, mode)
        if mode == "strict":
            raise
        return
    ...

Fix this →

2. `_load_actual` in `statement_reconcile.py` assumes `@graph` key is present — moderate

examples/seattle_method_world_online/statement_reconcile.py:144

graph = json.loads(path.read_text())["@graph"]

With rdflib's auto_compact=True, a graph with a single named root may be serialized as a bare compacted object (no @graph wrapper) rather than a {"@context": ..., "@graph": [...]} envelope. If the bundle is ever re-serialized and the root structure changes, this raises KeyError silently during the demo.

More robust: use rdflib.Graph().parse(str(path), format="json-ld") and iterate subjects, or do:

doc = json.loads(path.read_text())
raw_graph = doc.get("@graph", [doc])  # fall back to single-root bare object

Fix this →

3. `_record_bundle_validation` parameter typed `Any` — minor

robosystems/operations/roboledger/commands/reports.py:340

def _record_bundle_validation(bundle: Any, report_def: Report) -> None:

bundle is always a StatementBundle (the only callers are _stamp_report_bundle). Typing it Any loses basedpyright coverage for the interior build_graph(bundle) call. Should be bundle: StatementBundle with the import at the top of the file.

4. `ShaclResult.as_dict()` hard-codes the shapes path string — minor

robosystems/operations/serialization/rdf/jsonld.py:476

"shapes_version": "frameworks/ontology/v1/shapes.ttl",

_SHAPES_PATH is already defined at module level as a Path. Using str(_SHAPES_PATH.relative_to(_REPO_ROOT)) here would keep the two in sync automatically if the path is ever moved.

5. Module-level `_SHAPES_CACHE` has a benign but noteworthy write race — low

robosystems/operations/serialization/rdf/jsonld.py:480-491

_SHAPES_CACHE: Graph | None = None

def _shapes_graph() -> Graph | None:
    global _SHAPES_CACHE
    if _SHAPES_CACHE is None:
        ...
        _SHAPES_CACHE = Graph().parse(...)
    return _SHAPES_CACHE

Under CPython with the GIL and async I/O this is unlikely to cause corruption (two tasks seeing None simultaneously would just parse the file twice, then one assignment wins). In a thread-pool executor or under a true-threaded WSGI server, the double-parse is wasteful but still correct since rdflib Graph is replaced atomically. Flagging it for awareness; using a threading.Lock or functools.lru_cache would close it cleanly.

6. `validate_arelle` in `validate.py` hard-codes `"instance.xml"` as the root of the XBRL zip — low

examples/_common/validate.py:115

model = client.controller(str(Path(tmp) / "instance.xml"))

This is fine as long as the XBRL 2.1 report package always uses instance.xml as its root (consistent with the existing demos). If the producer ever changes the entry-point filename this will fail silently with a bad model. Consider reading the first .xml from the zip namelist or documenting the contract.

7. Taxonomy context rename `"xbrl"` → `"link"` is a breaking change not highlighted in the changeset — informational

robosystems/arelle/context.py:36

The CANONICAL_CONTEXT key "xbrl": "http://www.xbrl.org/2003/linkbase#" is renamed to "link". All taxonomy seeds in this PR are regenerated against the new context, so the repo is consistent. But any external consumer, snapshot, or cached seed that references the xbrl: prefix in JSON-LD output will silently produce unresolvable IRIs. The PR description calls out breaking changes to bundle and taxonomy JSON-LD formats, but this specific prefix rename deserves a migration note or a deprecation shim if external consumers are expected.

Ontology / Shapes

The ontology (ontology.ttl) and shapes (shapes.ttl) are well-structured. A few observations:

Positive shapes correctly require rs:element, rs:period, max-1 rs:unit, and xsd:decimal on numericValue. The rs:entity required property is not in FactShape — this is intentional (some Facts may not carry entity, e.g. summary nodes), but worth confirming against the instance layer where _add_facts always adds rs:entity (jsonld.py:381).
Negative shapes are a clean enforcement pattern. Banning xbrli:contextRef, rs:arcFrom, rs:summationOf, rs:parent, rs:generalOf, rs:dimensionOf makes regression from the old dialect a hard failure — good. Verify the SHACL engine (inference="none") correctly fires sh:targetSubjectsOf shapes; pyshacl does support this per SHACL spec without inference.
The link:weight / link:order SHACL comment about not pinning sh:datatype (shapes.ttl:49-51) is well-explained and pragmatic. The comment also documents why — JSON-LD's coercion of floats to xsd:decimal-typed literals.
context.jsonld is large (338 lines) but all terms map to the correct standard IRIs. The double-declaration of "hasAssociation" with "@type": "@id" in both CANONICAL_CONTEXT and _BUNDLE_CONTEXT_EXTRA (as "frameworkPins", etc.) should be fine since they cover different terms, but worth verifying there are no key collisions between the two dicts in _build_context().

Test Coverage

TestShaclReport covers conforming, violation-without-raise, and ShaclResult.as_dict() truncation — solid.
TestRecordBundleValidation covers all three modes (off, warn with conforming, warn with violation, strict raising). Missing: strict + conforming (should not raise) and warn with pyshacl exception (see finding Inital open source release of robosystems #1 above).
test_sample_bundles_shacl.py is a good regression guard. Parametrized over all examples/*/sample_output/*.jsonld — consider whether this glob should also match the roboledger_demo sample (it does, via *).

Summary

Approve with the following recommended fixes before merge:

Moderate: Guard shacl_report() calls in warn mode against unexpected pyshacl exceptions (finding Inital open source release of robosystems #1).
Moderate: Fix KeyError risk in statement_reconcile._load_actual (finding Fix MCP tools query results to include units and improve dimensions #2).

The remaining items (#3–#7) are minor polish. The architecture is sound, the SHACL shapes design is good, the test coverage is meaningful, and the new validate.py consolidation is a genuine improvement.

@graph

- reports.py: guard _record_bundle_validation against validation-infra exceptions so a pyshacl/build_graph failure can't break a warn-mode publish (only strict re-raises); type bundle as StatementBundle. - jsonld.py: derive shapes_version from _SHAPES_PATH instead of a duplicated literal; close the _SHAPES_CACHE write race with a double-checked lock. - statement_reconcile.py: tolerate a bare (non-@graph) compacted bundle root in _load_actual. - _common/validate.py: resolve the XBRL entry point from the zip namelist instead of hard-coding instance.xml. - Make implicit string concatenations explicit (CodeQL) in validate.py and statement_reconcile.py. - Add tests: strict+conforming no-raise, warn swallows validation exception, strict re-raises it.

jfrench9 added 6 commits May 29, 2026 14:58

github-code-quality Bot found potential problems May 29, 2026

View reviewed changes

jfrench9 merged commit 68c2c61 into main May 30, 2026
7 checks passed

jfrench9 deleted the feature/ontology-refactor branch May 30, 2026 00:05

jfrench9 mentioned this pull request May 30, 2026

Clean up XBRL serialization output and add bundle support #708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonical RDF ontology, graph-native v2 bundles & SHACL validation#707

Canonical RDF ontology, graph-native v2 bundles & SHACL validation#707
jfrench9 merged 7 commits into
mainfrom
feature/ontology-refactor

jfrench9 commented May 29, 2026 •

edited

Loading

Uh oh!

jfrench9 commented May 29, 2026

Uh oh!

claude Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Accomplishments

Canonical RDF Ontology (frameworks/ontology/v1/)

Graph-Native v2 Bundle Serialization

SHACL Validation at Publish

Consolidated Validation Tooling

Statement-Level Reconciliation (World Online)

Breaking Changes

Testing

Infrastructure Considerations

Uh oh!

jfrench9 commented May 29, 2026

Uh oh!

claude Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #707 Review — Canonical RDF Ontology, v2 Bundles & SHACL Validation

Overview

Findings

1. _record_bundle_validation doesn't guard against exceptions from pyshacl in warn mode — moderate

2. _load_actual in statement_reconcile.py assumes @graph key is present — moderate

3. _record_bundle_validation parameter typed Any — minor

4. ShaclResult.as_dict() hard-codes the shapes path string — minor

5. Module-level _SHAPES_CACHE has a benign but noteworthy write race — low

6. validate_arelle in validate.py hard-codes "instance.xml" as the root of the XBRL zip — low

7. Taxonomy context rename "xbrl" → "link" is a breaking change not highlighted in the changeset — informational

Ontology / Shapes

Test Coverage

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jfrench9 commented May 29, 2026 •

edited

Loading

Canonical RDF Ontology (`frameworks/ontology/v1/`)

claude Bot commented May 29, 2026 •

edited

Loading

1. `_record_bundle_validation` doesn't guard against exceptions from `pyshacl` in `warn` mode — moderate

2. `_load_actual` in `statement_reconcile.py` assumes `@graph` key is present — moderate

3. `_record_bundle_validation` parameter typed `Any` — minor

4. `ShaclResult.as_dict()` hard-codes the shapes path string — minor

5. Module-level `_SHAPES_CACHE` has a benign but noteworthy write race — low

6. `validate_arelle` in `validate.py` hard-codes `"instance.xml"` as the root of the XBRL zip — low

7. Taxonomy context rename `"xbrl"` → `"link"` is a breaking change not highlighted in the changeset — informational