Skip to content

Canonical RDF ontology, graph-native v2 bundles & SHACL validation#707

Merged
jfrench9 merged 7 commits into
mainfrom
feature/ontology-refactor
May 30, 2026
Merged

Canonical RDF ontology, graph-native v2 bundles & SHACL validation#707
jfrench9 merged 7 commits into
mainfrom
feature/ontology-refactor

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented May 29, 2026

Summary

This PR introduces a canonical RDF ontology for the taxonomy, migrates bundles to a graph-native v2 serialization format, and adds opt-in SHACL validation at publish time. It also consolidates validation tooling across all demo examples and adds statement-level reconciliation capabilities.

Key Accomplishments

Canonical RDF Ontology (frameworks/ontology/v1/)

  • ontology.ttl — Formal OWL/RDF ontology that reifies XBRL arcs and xbrli concept attributes as first-class RDF properties and classes.
  • shapes.ttl — SHACL shapes graph for validating bundle conformance against the ontology.
  • context.jsonld — Shared JSON-LD context enabling compact, human-readable serialization aligned with the ontology.

Graph-Native v2 Bundle Serialization

  • XBRL dimensional context (entity, period, units, explicit/typed dimensions) is collapsed directly onto fact nodes, eliminating the separate context-to-fact indirection and producing a cleaner RDF graph.
  • Bundle ontology version stamp corrected to v1 (was incorrectly set to v2).
  • All taxonomy packages (FAC, RS-GAAP, bridges) updated to use reified arc representations and the new JSON-LD context, resulting in large but mechanical diffs across ~20 taxonomy files.

SHACL Validation at Publish

  • New opt-in SHACL validation step integrated into the bundle publish pipeline; results are captured on the Report object.
  • Dedicated test suites validate SHACL conformance of sample bundles and verify the publish-time validation hook.

Consolidated Validation Tooling

  • New shared examples/_common/validate.py module provides container-free SHACL and Arelle (XBRL 2.1) validation, replacing the per-demo xbrl_validate.py script.
  • All three demos (Roboledger, Seattle Method, World Online) updated to use the common validator and now emit both SHACL and XBRL validation reports as sample outputs.

Statement-Level Reconciliation (World Online)

  • New statement_reconcile.py performs line-item reconciliation of generated financial statements against Charlie's reference data.
  • Reconciliation output captured as a new sample artifact (world-online-statement-reconciliation.md).

Breaking Changes

  • Bundle JSON-LD structure changed: Consumers parsing the .jsonld output will see a different graph shape — dimensional context fields now appear inline on fact nodes rather than as separate context objects. Any downstream tooling that relies on the v1 context/fact indirection will need to be updated.
  • Taxonomy JSON-LD format changed: Arc relationships are now reified RDF resources with explicit source/target/role properties instead of nested shorthand. Taxonomy loaders or external consumers reading these files directly will require updates.
  • @context prefix xbrllink renamed: CANONICAL_CONTEXT now binds the XBRL linkbase namespace (http://www.xbrl.org/2003/linkbase#) to the prefix link (was xbrl). All in-repo seeds and bundles are regenerated against the new context, so the repo is internally consistent — but any external consumer, cached snapshot, or tool that references the old xbrl: prefix in JSON-LD will silently produce unresolvable IRIs and must update to link:.
  • xbrl_validate.py removed: The per-demo validation script in seattle_method_demo has been deleted in favor of the shared common module.

Testing

  • New tests added:
    • test_publish_validation.py — Verifies opt-in SHACL validation fires during publish and results propagate to the Report.
    • test_sample_bundles_shacl.py — Validates all sample bundle outputs against the SHACL shapes graph.
  • Updated tests:
    • Bundle producer, JSON-LD encoder, XBRL emitter, and cross-encoder equivalence tests updated to reflect the v2 graph structure.
    • Taxonomy fallback-bucket wiring tests adjusted for the reified arc format.
  • All three demo pipelines regenerated with updated sample outputs confirming end-to-end correctness.

Infrastructure Considerations

  • New dependency on a SHACL validation library (reflected in uv.lock / pyproject.toml changes).
  • The shared validation module supports running SHACL validation without requiring a container runtime, simplifying CI and local development workflows.
  • Justfile updated with convenience targets for the new validation and reconciliation workflows.

🤖 Generated with Claude Code

Branch Info:

  • Source: feature/ontology-refactor
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

jfrench9 added 6 commits May 29, 2026 14:58
Collapse three drifted RDF dialects for the same concept into one canonical
vocabulary (RS topology + XBRL vocabulary), per local/docs/specs/rdf-ontology.md.

Ontology (frameworks/ontology/v1/):
- context.jsonld: published canonical @context (superset of every seed term)
- ontology.ttl: RDFS/OWL class + property declarations
- shapes.ttl: SHACL — positive shapes + negative shapes banning the retired
  dialects (xbrli:contextRef, arcFrom, summationOf, …)

Vocabulary (arelle/context.py):
- balance/periodType -> xbrli:; bind link/xlink/xbrldi/iso4217
- structural arcs reified: from/to (xlink), arcrole/role (xlink),
  weight/order (link), associationType (rs); direct summationOf/parent/
  generalOf/dimensionOf/hypercubeOf RETIRED
- equivalence stays direct owl:equivalentClass (symmetric, no arc metadata)
- absorb all domain terms (drules, rules, traits, style) so it is the superset

extractor.py: emit reified rs:Association (weight/order/preferredLabel from
Arelle) + xbrli concept attrs; deterministic content-hashed association IRIs.
serializer.py: compact predicate keys via the context (readable seeds).
loader.py: read the single canonical reified form + xbrli attrs; structural
direct-predicates dropped (equivalence + drules kept).

Seeds: all 18 frameworks/**/taxonomy.jsonld regenerated to canonical form
(semantics-preserving — identical element/association counts). Deps: + pyshacl.

Verified: tests/arelle + tests/taxonomy green (211); all 18 seeds SHACL-conform;
ruff + format + basedpyright clean. Runtime reseed + demo round-trip next.
…nto facts

Phase B of the canonical RDF ontology migration: the export StatementBundle
becomes graph-native (RS topology + XBRL vocabulary), mirroring the LadybugDB
reporting graph instead of re-encoding an XBRL instance.

bundle.py: BundlePeriod nodes replace BundleContext; facts carry period_ref /
unit_ref / entity_ref directly (the FACT_HAS_* edges); _mint_periods replaces
_mint_contexts. The XBRL context is no longer stored on the bundle.

rdf/jsonld.py: rewritten to emit rs:Fact with direct element/entity/period/unit
edges, rs:Element (xbrli:balance/periodType), reified rs:Association under
rs:Structure, and rs:Period/rs:Unit aspect nodes — using the canonical
CANONICAL_CONTEXT. serializationVersion → 2.0. validate_graph now runs SHACL
(frameworks/ontology/v1/shapes.ttl), so the same shapes that gate the seeds gate
the export, including the negative shapes that ban xbrli:contextRef.

xbrl/xbrl_21.py: _derive_contexts reconstructs <xbrli:context> from the period
nodes + entity at emit time (XBRL 2.1 requires shared contexts), so the emitted
instance.xml is unchanged and stays Arelle-valid.

Verified: 95 serialization tests green (incl. cross-encoder fact-set equivalence
and a negative-shape rejection of re-introduced contextRef); 10,306-test unit
suite green. Both Seattle Method demos + the RoboLedger demo re-run on fresh
graphs emit serializationVersion 2.0 JSON-LD (no contextRef) with Arelle-valid
XBRL and unchanged reconcile figures; sample_output refreshed accordingly.
…rence

Adds the rendered-statement reconcile discussed early on: diffs the four-statement
Report's seven anchor totals against Charlie Hoffman's published XBRL reference
instance (mini/ref-num/instance.xml — the source of his index2.html), complementing
the GL-pivot reconcile (which validates ingestion vs SummaryOfTransactions.csv).

Our values are read straight from the v2 graph-native bundle (rs:Fact nodes) — the
export artifact is the reconciliation source, which is the payoff of the ontology
reshape. A mini→rs-gaap anchor map bridges the vocabularies; matching is by period
position (Charlie's reference is labelled FY2022/EUR, ours spans 2023→2028, amounts
tie regardless).

Result: 7/7 anchors tie to the penny, current + prior — Assets, Liabilities &
Equity, Net Income, Receivables, PP&E, Long-term Debt, and the −€648K Cash (which
is in Charlie's reference report too, not an ingestion error).

Wired as demo step 11 + `just demo-world-online-statement-reconcile`.
The graph-native bundle is the first *published* bundle ontology — the
XBRL-aligned draft never shipped beyond a one-day demo, so there is no released
predecessor to supersede. Stamp it accordingly:

- SERIALIZATION_VERSION "2.0" → "1.0" (the value on every bundle's root)
- IB-envelope datatype IRI /datatype/v2/ → /datatype/v1/
- docstrings/comments describing the artifact as "v2.0" → "v1.0"
- refreshed sample bundles carry serializationVersion "1.0"

The design history (XBRL-aligned → graph-native) lives in the specs; the
published artifact + the ontology dir (frameworks/ontology/v1/) are both v1.
…utputs

Each demo now validates the artifacts it just downloaded — on the host, against
the on-disk output/ files, with the stack down (no API, no DB, no container):

- examples/_common/validate.py — one shared validator: JSON-LD → pyshacl vs
  frameworks/ontology/v1/shapes.ttl (semantic conformance) and XBRL zip → Arelle
  vs the XBRL 2.1 spec (structural conformance). Writes a markdown evidence
  report per projection.
- Wired as a single `validate` step in all three demos (Seattle Method, World
  Online, RoboLedger), reading the downloaded .jsonld/.zip. Replaces the old
  seattle xbrl_validate.py, which re-fetched from the API + queried the DB (a
  container dependency) — removed. World Online and RoboLedger gain XBRL/Arelle
  validation they didn't have.
- tests/operations/serialization/test_sample_bundles_shacl.py — pytest that
  SHACL-validates every committed demo sample bundle against the ontology, so a
  non-conformant sample can't land.

Evidence committed: all three demos' sample_output now carries both a
*-shacl-validation.md (conforms, 0 violations) and a *-xbrl-validation.md (valid
XBRL 2.1, 0 errors).
…the Report

Resolves the publish-path latency concern: SHACL validation of the report
bundle is now opt-in, and when it runs its result is persisted with the Report.

- jsonld.py: decouple validation from serialization. serialize_to_jsonld no
  longer auto-validates (serialization shouldn't block). Add shacl_report(graph)
  -> ShaclResult (non-raising: ran/conforms/violations/shapes_checked/report +
  as_dict() for storage); validate_graph stays as the raising/strict wrapper.
- config: REPORT_BUNDLE_SHACL_VALIDATION = off | warn | strict (default off, so
  the publish path stays fast — the standalone validator + the SHACL regression
  test cover demos/CI).
- reports.py: _record_bundle_validation in the publish hook — when not off,
  SHACL-checks the bundle and records the structured outcome on
  Report.metadata['bundle_validation'] (audit trail); strict also raises on
  non-conformance to block the publish. Uses the existing JSONB metadata column
  — no migration.

Tests: shacl_report on conforming + violating graphs, as_dict bounding, and the
off/warn/strict hook behaviors (7 cases). Full serialization + roboledger suites
green (649).
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Claude finished @jfrench9's task in 4m 1s —— View job


PR #707 Review — Canonical RDF Ontology, v2 Bundles & SHACL Validation

Tasks

  • Understand the PR scope and breaking changes
  • Review ontology files (ontology.ttl, shapes.ttl, context.jsonld)
  • Review core serialization changes (jsonld.py, bundle.py)
  • Review SHACL validation integration (reports.py, env.py)
  • Review new test suites (test_publish_validation.py, test_sample_bundles_shacl.py)
  • Review shared validation module (examples/_common/validate.py)
  • Review Arelle extractor changes
  • Review statement reconciliation (statement_reconcile.py)
  • Synthesize findings and post full review

Overview

This is a large, well-motivated PR with a coherent architecture. The core ideas — collapsing XBRL contexts onto Fact nodes, reifying arcs as rs:Association nodes, and enforcing that shape with SHACL — are sound. The ontology is clean, the test coverage for the new validation hook is good, and the negative SHACL shapes as a regression guard are a nice idea. The findings below are mostly correctness/robustness concerns rather than design objections.


Findings

1. _record_bundle_validation doesn't guard against exceptions from pyshacl in warn mode — moderate

robosystems/operations/roboledger/commands/reports.py:358

result = shacl_report(build_graph(bundle))

shacl_report imports and calls pyshacl.validate at call time. If pyshacl raises (parse error, OOM, or any unexpected internal error), the exception propagates out of _record_bundle_validation, through _stamp_report_bundle, and unwinds create_report — causing the entire publish to fail even in warn mode. The env var docs say warn should "capture" the result without blocking. Only strict should ever block.

Suggested fix:

def _record_bundle_validation(bundle: StatementBundle, report_def: Report) -> None:
    mode = (env.REPORT_BUNDLE_SHACL_VALIDATION or "off").strip().lower()
    if mode == "off":
        return
    from robosystems.operations.serialization.rdf.jsonld import (
        BundleValidationError, build_graph, shacl_report,
    )
    try:
        result = shacl_report(build_graph(bundle))
    except Exception:
        logger.exception("SHACL validation error for report %s (mode=%s)", report_def.id, mode)
        if mode == "strict":
            raise
        return
    ...

Fix this →

2. _load_actual in statement_reconcile.py assumes @graph key is present — moderate

examples/seattle_method_world_online/statement_reconcile.py:144

graph = json.loads(path.read_text())["@graph"]

With rdflib's auto_compact=True, a graph with a single named root may be serialized as a bare compacted object (no @graph wrapper) rather than a {"@context": ..., "@graph": [...]} envelope. If the bundle is ever re-serialized and the root structure changes, this raises KeyError silently during the demo.

More robust: use rdflib.Graph().parse(str(path), format="json-ld") and iterate subjects, or do:

doc = json.loads(path.read_text())
raw_graph = doc.get("@graph", [doc])  # fall back to single-root bare object

Fix this →

3. _record_bundle_validation parameter typed Anyminor

robosystems/operations/roboledger/commands/reports.py:340

def _record_bundle_validation(bundle: Any, report_def: Report) -> None:

bundle is always a StatementBundle (the only callers are _stamp_report_bundle). Typing it Any loses basedpyright coverage for the interior build_graph(bundle) call. Should be bundle: StatementBundle with the import at the top of the file.

4. ShaclResult.as_dict() hard-codes the shapes path string — minor

robosystems/operations/serialization/rdf/jsonld.py:476

"shapes_version": "frameworks/ontology/v1/shapes.ttl",

_SHAPES_PATH is already defined at module level as a Path. Using str(_SHAPES_PATH.relative_to(_REPO_ROOT)) here would keep the two in sync automatically if the path is ever moved.

5. Module-level _SHAPES_CACHE has a benign but noteworthy write race — low

robosystems/operations/serialization/rdf/jsonld.py:480-491

_SHAPES_CACHE: Graph | None = None

def _shapes_graph() -> Graph | None:
    global _SHAPES_CACHE
    if _SHAPES_CACHE is None:
        ...
        _SHAPES_CACHE = Graph().parse(...)
    return _SHAPES_CACHE

Under CPython with the GIL and async I/O this is unlikely to cause corruption (two tasks seeing None simultaneously would just parse the file twice, then one assignment wins). In a thread-pool executor or under a true-threaded WSGI server, the double-parse is wasteful but still correct since rdflib Graph is replaced atomically. Flagging it for awareness; using a threading.Lock or functools.lru_cache would close it cleanly.

6. validate_arelle in validate.py hard-codes "instance.xml" as the root of the XBRL zip — low

examples/_common/validate.py:115

model = client.controller(str(Path(tmp) / "instance.xml"))

This is fine as long as the XBRL 2.1 report package always uses instance.xml as its root (consistent with the existing demos). If the producer ever changes the entry-point filename this will fail silently with a bad model. Consider reading the first .xml from the zip namelist or documenting the contract.

7. Taxonomy context rename "xbrl""link" is a breaking change not highlighted in the changeset — informational

robosystems/arelle/context.py:36

The CANONICAL_CONTEXT key "xbrl": "http://www.xbrl.org/2003/linkbase#" is renamed to "link". All taxonomy seeds in this PR are regenerated against the new context, so the repo is consistent. But any external consumer, snapshot, or cached seed that references the xbrl: prefix in JSON-LD output will silently produce unresolvable IRIs. The PR description calls out breaking changes to bundle and taxonomy JSON-LD formats, but this specific prefix rename deserves a migration note or a deprecation shim if external consumers are expected.


Ontology / Shapes

The ontology (ontology.ttl) and shapes (shapes.ttl) are well-structured. A few observations:

  • Positive shapes correctly require rs:element, rs:period, max-1 rs:unit, and xsd:decimal on numericValue. The rs:entity required property is not in FactShape — this is intentional (some Facts may not carry entity, e.g. summary nodes), but worth confirming against the instance layer where _add_facts always adds rs:entity (jsonld.py:381).

  • Negative shapes are a clean enforcement pattern. Banning xbrli:contextRef, rs:arcFrom, rs:summationOf, rs:parent, rs:generalOf, rs:dimensionOf makes regression from the old dialect a hard failure — good. Verify the SHACL engine (inference="none") correctly fires sh:targetSubjectsOf shapes; pyshacl does support this per SHACL spec without inference.

  • The link:weight / link:order SHACL comment about not pinning sh:datatype (shapes.ttl:49-51) is well-explained and pragmatic. The comment also documents why — JSON-LD's coercion of floats to xsd:decimal-typed literals.

  • context.jsonld is large (338 lines) but all terms map to the correct standard IRIs. The double-declaration of "hasAssociation" with "@type": "@id" in both CANONICAL_CONTEXT and _BUNDLE_CONTEXT_EXTRA (as "frameworkPins", etc.) should be fine since they cover different terms, but worth verifying there are no key collisions between the two dicts in _build_context().


Test Coverage

  • TestShaclReport covers conforming, violation-without-raise, and ShaclResult.as_dict() truncation — solid.
  • TestRecordBundleValidation covers all three modes (off, warn with conforming, warn with violation, strict raising). Missing: strict + conforming (should not raise) and warn with pyshacl exception (see finding Inital open source release of robosystems #1 above).
  • test_sample_bundles_shacl.py is a good regression guard. Parametrized over all examples/*/sample_output/*.jsonld — consider whether this glob should also match the roboledger_demo sample (it does, via *).

Summary

Approve with the following recommended fixes before merge:

  1. Moderate: Guard shacl_report() calls in warn mode against unexpected pyshacl exceptions (finding Inital open source release of robosystems #1).
  2. Moderate: Fix KeyError risk in statement_reconcile._load_actual (finding Fix MCP tools query results to include units and improve dimensions #2).

The remaining items (#3–#7) are minor polish. The architecture is sound, the SHACL shapes design is good, the test coverage is meaningful, and the new validate.py consolidation is a genuine improvement.

Comment thread examples/seattle_method_world_online/statement_reconcile.py Fixed
Comment thread examples/seattle_method_world_online/statement_reconcile.py Fixed
Comment thread examples/_common/validate.py Fixed
Comment thread examples/_common/validate.py Fixed
Comment thread examples/_common/validate.py Fixed
- reports.py: guard _record_bundle_validation against validation-infra
  exceptions so a pyshacl/build_graph failure can't break a warn-mode
  publish (only strict re-raises); type bundle as StatementBundle.
- jsonld.py: derive shapes_version from _SHAPES_PATH instead of a
  duplicated literal; close the _SHAPES_CACHE write race with a
  double-checked lock.
- statement_reconcile.py: tolerate a bare (non-@graph) compacted bundle
  root in _load_actual.
- _common/validate.py: resolve the XBRL entry point from the zip
  namelist instead of hard-coding instance.xml.
- Make implicit string concatenations explicit (CodeQL) in validate.py
  and statement_reconcile.py.
- Add tests: strict+conforming no-raise, warn swallows validation
  exception, strict re-raises it.
@jfrench9 jfrench9 merged commit 68c2c61 into main May 30, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/ontology-refactor branch May 30, 2026 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant