FACTSlab · aaronstevenwhite · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -180,6 +180,8 @@ jobs:
     needs: [ts-build]  # Python tests need TypeScript compiled
     steps:
       - uses: actions/checkout@v4
+        with:
+          submodules: true  # vendor/layers lexicons for interop validation
 
       - name: Install pnpm
         uses: pnpm/action-setup@v4

diff --git a/.gitignore b/.gitignore
@@ -34,4 +34,6 @@ tests/fixtures/cli_work/
 /exports/
 /trial_config_*.json
 /*.jzip
-.claude/
+.claude/
+# Hypothesis example database
+.hypothesis/
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,5 @@
+[submodule "vendor/layers"]
+	path = vendor/layers
+	url = https://github.com/layers-pub/layers.git
+	branch = main
+	shallow = true
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,94 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [Unreleased]
+
+## [0.6.0] - 2026-05-29
+
+### Added
+
+#### `bead.corpus` — streaming corpus ingestion and structural sampling
+
+- New subpackage `bead.corpus` for turning raw text corpora into experimental
+  `Item`s. `CorpusRecord` carries text plus flat provenance; `CorpusSource` is
+  a streaming-source protocol.
+- Sources: `JsonlCorpusSource` (JSON Lines, transparently decompressing
+  Zstandard `.zst` files), `CsvCorpusSource` (CSV/TSV), and
+  `CompletionCorpusSource` (a language model as a corpus source, via the new
+  `TextGenerator` protocol on the OpenAI and Anthropic adapters).
+- Lazy pipeline: `parse_records`, `filter_by_structure`, `sample_corpus`, and
+  `record_to_item` stream records through a dependency parser and keep only
+  those whose parse satisfies a structural DSL constraint, producing `Item`s
+  with standoff parse annotations and source provenance. The pipeline never
+  loads the full corpus into memory.
+- New `corpus` optional-dependency extra (`zstandard`).
+
+#### Dependency parsing in `bead.tokenization`
+
+- New `bead.tokenization.parsers`: `SpacyParser`, `StanzaParser`, and
+  `create_parser` produce a per-sentence `ParsedSentence` of `ParsedToken`
+  records (token, lemma, upos, xpos, head, deprel, morphology, offsets).
+- `parse_to_spans` projects a dependency parse onto the standoff `Span` +
+  `SpanRelation` models: one single-token span per token (with its governor as
+  `head_index` and its features in `span_metadata`) and one directed
+  head-to-dependent relation per syntactic arc.
+
+#### Structural-query builtins in the constraint DSL
+
+- New `bead.dsl` standard-library functions query a dependency parse stored on
+  an `Item`: `upos`, `xpos`, `lemma_of`, `form_of`, `deprel`, `morph`, `head`,
+  `dependents`, `has_relation`, `root`, `subtree`, `path_to_root`,
+  `tokens_with_upos`, `tokens_with_deprel`, `any_deprel`, and `filter_upos`.
+  Constraints can now match syntactic structure, e.g.
+  `upos(self, root(self)) == "VERB" and len(dependents(self, root(self), "obj")) > 0`.
+
+#### Text transforms for corpus cleanup
+
+- New transforms in `bead.transforms.text`: `MarkdownStripTransform`,
+  `RedditCleanupTransform`, and the `split_sentences` helper (parser-backed or
+  regex fallback). The first two are registered in the default transform
+  registry.
+
+#### `bead.corpus` buffering graph tier
+
+- New `bead.corpus.graph`: `CorpusGraph`, a typed directed multidigraph of
+  `CorpusNode`s and `CorpusEdge`s (parallel typed edges allowed; trees are a
+  special case), with traversal helpers (`children`, `parents`, `roots`,
+  `out_edges`, `in_edges`, `subtree`, `node_by_id`).
+- New `bead.corpus.assemble`: `assemble_graph` buffers a record stream into a
+  `CorpusGraph`, building edges from declarative `EdgeSpec`s or a runtime edge
+  function. Reconstructs thread structure such as Reddit reply trees from
+  `parent_id`/`link_id`. This tier is opt-in and layered on top of the
+  streaming pipeline, which is untouched.
+
+#### `bead.interop.layers` — lossless layers interop
+
+- New subpackage mapping bead data to and from the
+  [layers](https://github.com/layers-pub/layers) linguistic-annotation schema
+  as law-verified didactic lenses (`dx.Iso` for bijections, `dx.Lens` with a
+  complement for projections), so every round-trip is exact and verified.
+- Faithful mirror models for the layers shared defs and record types, each with
+  a generic lossless `MirrorIso` to and from layers-shaped JSON (snake/camel
+  case, feature maps, slug+uri enums, integer confidence, `$type` unions).
+- Bridge lenses map bead-native models onto layers constructs: `CorpusRecord`
+  to an `expression`, `CorpusGraph` to a property graph (`expression`s,
+  `graphNode`s, and a `graphEdgeSet`), and a dependency-parsed `ParsedSentence`
+  to a `tokenization` plus part-of-speech and dependency `annotationLayer`s. The
+  lens complement holds the bead-only remainder (framework identity and fields
+  layers has no slot for). Resource-overlap lenses map lexical items, lexicons,
+  and templates to the layers resource constructs.
+- Mappings are validated against the layers lexicons, vendored as the
+  `vendor/layers` git submodule, using the ATProto lexicon validator
+  (`@atproto/lexicon`), proving every mapping produces schema-valid layers.
+
+### Changed
+
+- Minimum `didactic` raised to `>=0.7.2` and `panproto` to `>=0.51.0`.
+- Streaming corpus ingestion is now lossless by default: `JsonlCorpusSource`
+  and `CsvCorpusSource` retain every field (not just a configured subset), and
+  non-scalar values round-trip through JSON rather than being stringified, so
+  no source information is dropped at ingestion.
+
 ## [0.5.0] - 2026-05-12
 
 ### Added
@@ -440,6 +528,10 @@ guards as type-checkers.
 - CI/CD: GitHub Actions for testing, docs, PyPI publishing
 - Read the Docs integration
 
-[Unreleased]: https://github.com/FACTSlab/bead/compare/v0.2.0...HEAD
+[Unreleased]: https://github.com/FACTSlab/bead/compare/v0.6.0...HEAD
+[0.6.0]: https://github.com/FACTSlab/bead/compare/v0.5.0...v0.6.0
+[0.5.0]: https://github.com/FACTSlab/bead/compare/v0.4.0...v0.5.0
+[0.4.0]: https://github.com/FACTSlab/bead/compare/v0.3.0...v0.4.0
+[0.3.0]: https://github.com/FACTSlab/bead/compare/v0.2.0...v0.3.0
 [0.2.0]: https://github.com/FACTSlab/bead/compare/v0.1.0...v0.2.0
 [0.1.0]: https://github.com/FACTSlab/bead/releases/tag/v0.1.0
diff --git a/README.md b/README.md
@@ -30,12 +30,18 @@ uv pip install bead[training]  # PyTorch Lightning, TensorBoard
 ### Development
 
 ```bash
-git clone https://github.com/FACTSlab/bead.git
+git clone --recurse-submodules https://github.com/FACTSlab/bead.git
 cd bead
 uv sync --all-extras
 uv run pytest tests/
 ```
 
+The `vendor/layers` submodule holds the layers lexicons that the interop tests
+validate against. If you cloned without `--recurse-submodules`, fetch them with
+`git submodule update --init vendor/layers`, and refresh to the latest published
+lexicons with `git submodule update --remote vendor/layers`. The lexicon
+validation tests skip automatically when the submodule is absent.
+
 Always use `uv run` to execute commands.
 
 ## Quick Start

diff --git a/bead/__init__.py b/bead/__init__.py
@@ -6,6 +6,6 @@
 
 from __future__ import annotations
 
-__version__ = "0.5.0"
+__version__ = "0.6.0"
 __author__ = "Aaron Steven White"
 __email__ = "aaron.white@rochester.edu"
diff --git a/bead/corpus/__init__.py b/bead/corpus/__init__.py
@@ -0,0 +1,43 @@
+"""Streaming corpus ingestion and structural sampling.
+
+Turns raw external text (JSONL, optionally Zstandard-compressed; CSV/TSV) into
+structurally filtered experimental ``Item``s: stream ``CorpusRecord``s from a
+``CorpusSource``, dependency-parse them, and keep only those whose parse
+satisfies a structural DSL constraint.
+"""
+
+from __future__ import annotations
+
+from bead.corpus.assemble import EdgeSpec, assemble_graph
+from bead.corpus.base import CorpusSource
+from bead.corpus.graph import CorpusEdge, CorpusGraph, CorpusNode
+from bead.corpus.pipeline import (
+    filter_by_structure,
+    parse_records,
+    record_to_item,
+    sample_corpus,
+)
+from bead.corpus.records import CorpusRecord, ProvenanceValue
+from bead.corpus.sources import (
+    CompletionCorpusSource,
+    CsvCorpusSource,
+    JsonlCorpusSource,
+)
+
+__all__ = [
+    "CompletionCorpusSource",
+    "CorpusEdge",
+    "CorpusGraph",
+    "CorpusNode",
+    "CorpusRecord",
+    "CorpusSource",
+    "CsvCorpusSource",
+    "EdgeSpec",
+    "JsonlCorpusSource",
+    "ProvenanceValue",
+    "assemble_graph",
+    "filter_by_structure",
+    "parse_records",
+    "record_to_item",
+    "sample_corpus",
+]
diff --git a/bead/corpus/assemble.py b/bead/corpus/assemble.py
@@ -0,0 +1,118 @@
+"""Buffer a record stream into a typed multidigraph.
+
+``assemble_graph`` is the opt-in buffering tier that sits on top of the lazy
+streaming sources: it consumes ``CorpusRecord``s and reconstructs the structure
+between them (e.g. a Reddit reply tree from ``parent_id``, or an arbitrary typed
+graph) as a :class:`~bead.corpus.graph.CorpusGraph`. It holds the records in
+memory, so it is a deliberate, explicit step distinct from streaming.
+"""
+
+from __future__ import annotations
+
+from collections.abc import Callable, Iterable, Sequence
+
+import didactic.api as dx
+
+from bead.corpus.graph import CorpusEdge, CorpusGraph, CorpusNode
+from bead.corpus.records import CorpusRecord
+from bead.data.base import BeadBaseModel
+
+
+class EdgeSpec(BeadBaseModel):
+    """Declarative rule for deriving one typed edge per record from a field.
+
+    For each record, if ``target_field`` is present in the record's provenance,
+    an edge ``record_node -> target`` is created with type ``edge_type``. The
+    target id is the field value with any matching ``strip_prefixes`` removed
+    (e.g. Reddit's ``t1_``/``t3_`` fullname prefixes).
+
+    Attributes
+    ----------
+    target_field : str
+        Provenance field naming the other endpoint (e.g. ``"parent_id"``).
+    edge_type : str
+        Edge type slug for the created edge (e.g. ``"reply-to"``).
+    edge_type_uri : str | None
+        Optional canonical edge-type URI.
+    strip_prefixes : tuple[str, ...]
+        Prefixes to strip from the field value to recover the bare node id.
+    directed : bool
+        Whether the created edge is directed.
+    """
+
+    target_field: str
+    edge_type: str
+    edge_type_uri: str | None = None
+    strip_prefixes: tuple[str, ...] = ()
+    directed: bool = True
+
+    @dx.validates("target_field", "edge_type")
+    def _check_non_empty(self, value: str) -> str:
+        if not value or not value.strip():
+            raise ValueError("must be non-empty")
+        return value.strip()
+
+
+def _strip_prefix(value: str, prefixes: tuple[str, ...]) -> str:
+    """Strip the first matching prefix from *value*."""
+    for prefix in prefixes:
+        if prefix and value.startswith(prefix):
+            return value[len(prefix) :]
+    return value
+
+
+def assemble_graph(
+    records: Iterable[CorpusRecord],
+    *,
+    node_id_field: str,
+    edge_specs: Sequence[EdgeSpec] = (),
+    edge_fn: Callable[[CorpusRecord, str], Iterable[CorpusEdge]] | None = None,
+) -> CorpusGraph:
+    """Buffer a record stream into a typed multidigraph.
+
+    Each record with a ``node_id_field`` value becomes one expression node.
+    Edges are derived from the declarative ``edge_specs`` and/or a runtime
+    ``edge_fn`` (given the record and its node id) for arbitrary extraction.
+
+    Parameters
+    ----------
+    records : Iterable[CorpusRecord]
+        The records to buffer (typically a streaming source).
+    node_id_field : str
+        Provenance field holding each record's stable node id.
+    edge_specs : Sequence[EdgeSpec]
+        Declarative field-to-edge rules (the common case).
+    edge_fn : Callable[[CorpusRecord, str], Iterable[CorpusEdge]] | None
+        Optional function yielding extra edges for arbitrary structure.
+
+    Returns
+    -------
+    CorpusGraph
+        The assembled graph. Edges may reference target ids that have no node
+        (dangling references are preserved, not dropped).
+    """
+    nodes: list[CorpusNode] = []
+    edges: list[CorpusEdge] = []
+    for record in records:
+        node_id_raw = record.provenance.get(node_id_field)
+        if node_id_raw is None:
+            continue
+        node_id = str(node_id_raw)
+        nodes.append(CorpusNode(node_id=node_id, record=record))
+        for spec in edge_specs:
+            target_raw = record.provenance.get(spec.target_field)
+            if target_raw is None:
+                continue
+            target_id = _strip_prefix(str(target_raw), spec.strip_prefixes)
+            edges.append(
+                CorpusEdge(
+                    source_id=node_id,
+                    target_id=target_id,
+                    edge_type=spec.edge_type,
+                    edge_type_uri=spec.edge_type_uri,
+                    directed=spec.directed,
+                )
+            )
+        if edge_fn is not None:
+            edges.extend(edge_fn(record, node_id))
+    return CorpusGraph(nodes=tuple(nodes), edges=tuple(edges))
diff --git a/bead/corpus/base.py b/bead/corpus/base.py
@@ -0,0 +1,30 @@
+"""Corpus source protocol.
+
+A ``CorpusSource`` is anything that streams ``CorpusRecord``s. It is modeled as
+a runtime-checkable ``Protocol`` (behavior, not data) rather than a didactic
+model, mirroring the transform protocols elsewhere in bead.
+"""
+
+from __future__ import annotations
+
+from collections.abc import Iterator
+from typing import Protocol, runtime_checkable
+
+from bead.corpus.records import CorpusRecord
+
+
+@runtime_checkable
+class CorpusSource(Protocol):
+    """A streaming source of corpus records.
+
+    Attributes
+    ----------
+    source_name : str
+        Identifier stamped onto every record's ``source_name``.
+    """
+
+    source_name: str
+
+    def __iter__(self) -> Iterator[CorpusRecord]:
+        """Iterate the records of the source."""
+        ...