Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1393f1f
Upgrades didactic 0.6.2 -> 0.7.2 and panproto 0.44.0 -> 0.51.0
aaronstevenwhite May 29, 2026
e14c68f
Adds dependency parsing into standoff spans + DSL structural querying
aaronstevenwhite May 29, 2026
846ba69
Adds streaming corpus ingestion + structural rejection sampling
aaronstevenwhite May 29, 2026
d25328b
Adds LM completion corpus source + Reddit/markdown text transforms
aaronstevenwhite May 29, 2026
49c76a5
Strengthens Stanza integration tests to exercise the real pipeline
aaronstevenwhite May 29, 2026
39f4898
Removes all type/lint suppressions and Any/object hints from the new …
aaronstevenwhite May 29, 2026
a28d4de
Documents corpus ingestion, dependency parsing, structural DSL, and t…
aaronstevenwhite May 29, 2026
812d23a
Rewrites the DSL evaluator's Any as a precise DslValue type
aaronstevenwhite May 29, 2026
07f7138
Removes redundant test suppressions
aaronstevenwhite May 29, 2026
e313253
Makes streaming corpus ingestion lossless by default
aaronstevenwhite May 29, 2026
2438905
Adds buffering corpus graph tier (typed multidigraph + assembler)
aaronstevenwhite May 29, 2026
8e754d3
Adds lossless CorpusGraph <-> layers graph lens (didactic dx.Lens)
aaronstevenwhite May 29, 2026
785ee7a
Adds CorpusRecord <-> layers expression bridge lens
aaronstevenwhite May 29, 2026
4ced3c6
Adds ParsedSentence <-> layers annotation iso (dependency parse)
aaronstevenwhite May 29, 2026
5cf8a65
Adds faithful mirror models + generic lossless iso for layers shared …
aaronstevenwhite May 29, 2026
7c9f26b
Adds faithful mirrors + isos for the linguistic layers record types
aaronstevenwhite May 29, 2026
c7d5a85
Documents layers interop + buffering graph; adds coverage-guard test
aaronstevenwhite May 29, 2026
7df5f5c
Adds resource-overlap lenses (lexical item / lexicon / template <-> l…
aaronstevenwhite May 29, 2026
b52cdfb
Rewrites interop docstrings as plain documentation
aaronstevenwhite May 29, 2026
8f7da28
Validates layers mappings against vendored lexicons (ATProto)
aaronstevenwhite May 29, 2026
9d93672
Bumps version to 0.6.0 and updates the changelog
aaronstevenwhite May 29, 2026
6425481
Updates uv.lock for 0.6.0 version bump
aaronstevenwhite May 29, 2026
39fefae
Fixes CI: hypothesis in dev extra, ruff formatting
aaronstevenwhite May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,8 @@ jobs:
needs: [ts-build] # Python tests need TypeScript compiled
steps:
- uses: actions/checkout@v4
with:
submodules: true # vendor/layers lexicons for interop validation

- name: Install pnpm
uses: pnpm/action-setup@v4
Expand Down
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,6 @@ tests/fixtures/cli_work/
/exports/
/trial_config_*.json
/*.jzip
.claude/
.claude/
# Hypothesis example database
.hypothesis/
5 changes: 5 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[submodule "vendor/layers"]
path = vendor/layers
url = https://github.com/layers-pub/layers.git
branch = main
shallow = true
94 changes: 93 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,94 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.6.0] - 2026-05-29

### Added

#### `bead.corpus` — streaming corpus ingestion and structural sampling

- New subpackage `bead.corpus` for turning raw text corpora into experimental
`Item`s. `CorpusRecord` carries text plus flat provenance; `CorpusSource` is
a streaming-source protocol.
- Sources: `JsonlCorpusSource` (JSON Lines, transparently decompressing
Zstandard `.zst` files), `CsvCorpusSource` (CSV/TSV), and
`CompletionCorpusSource` (a language model as a corpus source, via the new
`TextGenerator` protocol on the OpenAI and Anthropic adapters).
- Lazy pipeline: `parse_records`, `filter_by_structure`, `sample_corpus`, and
`record_to_item` stream records through a dependency parser and keep only
those whose parse satisfies a structural DSL constraint, producing `Item`s
with standoff parse annotations and source provenance. The pipeline never
loads the full corpus into memory.
- New `corpus` optional-dependency extra (`zstandard`).

#### Dependency parsing in `bead.tokenization`

- New `bead.tokenization.parsers`: `SpacyParser`, `StanzaParser`, and
`create_parser` produce a per-sentence `ParsedSentence` of `ParsedToken`
records (token, lemma, upos, xpos, head, deprel, morphology, offsets).
- `parse_to_spans` projects a dependency parse onto the standoff `Span` +
`SpanRelation` models: one single-token span per token (with its governor as
`head_index` and its features in `span_metadata`) and one directed
head-to-dependent relation per syntactic arc.

#### Structural-query builtins in the constraint DSL

- New `bead.dsl` standard-library functions query a dependency parse stored on
an `Item`: `upos`, `xpos`, `lemma_of`, `form_of`, `deprel`, `morph`, `head`,
`dependents`, `has_relation`, `root`, `subtree`, `path_to_root`,
`tokens_with_upos`, `tokens_with_deprel`, `any_deprel`, and `filter_upos`.
Constraints can now match syntactic structure, e.g.
`upos(self, root(self)) == "VERB" and len(dependents(self, root(self), "obj")) > 0`.

#### Text transforms for corpus cleanup

- New transforms in `bead.transforms.text`: `MarkdownStripTransform`,
`RedditCleanupTransform`, and the `split_sentences` helper (parser-backed or
regex fallback). The first two are registered in the default transform
registry.

#### `bead.corpus` buffering graph tier

- New `bead.corpus.graph`: `CorpusGraph`, a typed directed multidigraph of
`CorpusNode`s and `CorpusEdge`s (parallel typed edges allowed; trees are a
special case), with traversal helpers (`children`, `parents`, `roots`,
`out_edges`, `in_edges`, `subtree`, `node_by_id`).
- New `bead.corpus.assemble`: `assemble_graph` buffers a record stream into a
`CorpusGraph`, building edges from declarative `EdgeSpec`s or a runtime edge
function. Reconstructs thread structure such as Reddit reply trees from
`parent_id`/`link_id`. This tier is opt-in and layered on top of the
streaming pipeline, which is untouched.

#### `bead.interop.layers` — lossless layers interop

- New subpackage mapping bead data to and from the
[layers](https://github.com/layers-pub/layers) linguistic-annotation schema
as law-verified didactic lenses (`dx.Iso` for bijections, `dx.Lens` with a
complement for projections), so every round-trip is exact and verified.
- Faithful mirror models for the layers shared defs and record types, each with
a generic lossless `MirrorIso` to and from layers-shaped JSON (snake/camel
case, feature maps, slug+uri enums, integer confidence, `$type` unions).
- Bridge lenses map bead-native models onto layers constructs: `CorpusRecord`
to an `expression`, `CorpusGraph` to a property graph (`expression`s,
`graphNode`s, and a `graphEdgeSet`), and a dependency-parsed `ParsedSentence`
to a `tokenization` plus part-of-speech and dependency `annotationLayer`s. The
lens complement holds the bead-only remainder (framework identity and fields
layers has no slot for). Resource-overlap lenses map lexical items, lexicons,
and templates to the layers resource constructs.
- Mappings are validated against the layers lexicons, vendored as the
`vendor/layers` git submodule, using the ATProto lexicon validator
(`@atproto/lexicon`), proving every mapping produces schema-valid layers.

### Changed

- Minimum `didactic` raised to `>=0.7.2` and `panproto` to `>=0.51.0`.
- Streaming corpus ingestion is now lossless by default: `JsonlCorpusSource`
and `CsvCorpusSource` retain every field (not just a configured subset), and
non-scalar values round-trip through JSON rather than being stringified, so
no source information is dropped at ingestion.

## [0.5.0] - 2026-05-12

### Added
Expand Down Expand Up @@ -440,6 +528,10 @@ guards as type-checkers.
- CI/CD: GitHub Actions for testing, docs, PyPI publishing
- Read the Docs integration

[Unreleased]: https://github.com/FACTSlab/bead/compare/v0.2.0...HEAD
[Unreleased]: https://github.com/FACTSlab/bead/compare/v0.6.0...HEAD
[0.6.0]: https://github.com/FACTSlab/bead/compare/v0.5.0...v0.6.0
[0.5.0]: https://github.com/FACTSlab/bead/compare/v0.4.0...v0.5.0
[0.4.0]: https://github.com/FACTSlab/bead/compare/v0.3.0...v0.4.0
[0.3.0]: https://github.com/FACTSlab/bead/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/FACTSlab/bead/compare/v0.1.0...v0.2.0
[0.1.0]: https://github.com/FACTSlab/bead/releases/tag/v0.1.0
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,18 @@ uv pip install bead[training] # PyTorch Lightning, TensorBoard
### Development

```bash
git clone https://github.com/FACTSlab/bead.git
git clone --recurse-submodules https://github.com/FACTSlab/bead.git
cd bead
uv sync --all-extras
uv run pytest tests/
```

The `vendor/layers` submodule holds the layers lexicons that the interop tests
validate against. If you cloned without `--recurse-submodules`, fetch them with
`git submodule update --init vendor/layers`, and refresh to the latest published
lexicons with `git submodule update --remote vendor/layers`. The lexicon
validation tests skip automatically when the submodule is absent.

Always use `uv run` to execute commands.

## Quick Start
Expand Down
2 changes: 1 addition & 1 deletion bead/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@

from __future__ import annotations

__version__ = "0.5.0"
__version__ = "0.6.0"
__author__ = "Aaron Steven White"
__email__ = "aaron.white@rochester.edu"
43 changes: 43 additions & 0 deletions bead/corpus/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
"""Streaming corpus ingestion and structural sampling.

Turns raw external text (JSONL, optionally Zstandard-compressed; CSV/TSV) into
structurally filtered experimental ``Item``s: stream ``CorpusRecord``s from a
``CorpusSource``, dependency-parse them, and keep only those whose parse
satisfies a structural DSL constraint.
"""

from __future__ import annotations

from bead.corpus.assemble import EdgeSpec, assemble_graph
from bead.corpus.base import CorpusSource
from bead.corpus.graph import CorpusEdge, CorpusGraph, CorpusNode
from bead.corpus.pipeline import (
filter_by_structure,
parse_records,
record_to_item,
sample_corpus,
)
from bead.corpus.records import CorpusRecord, ProvenanceValue
from bead.corpus.sources import (
CompletionCorpusSource,
CsvCorpusSource,
JsonlCorpusSource,
)

__all__ = [
"CompletionCorpusSource",
"CorpusEdge",
"CorpusGraph",
"CorpusNode",
"CorpusRecord",
"CorpusSource",
"CsvCorpusSource",
"EdgeSpec",
"JsonlCorpusSource",
"ProvenanceValue",
"assemble_graph",
"filter_by_structure",
"parse_records",
"record_to_item",
"sample_corpus",
]
118 changes: 118 additions & 0 deletions bead/corpus/assemble.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
"""Buffer a record stream into a typed multidigraph.

``assemble_graph`` is the opt-in buffering tier that sits on top of the lazy
streaming sources: it consumes ``CorpusRecord``s and reconstructs the structure
between them (e.g. a Reddit reply tree from ``parent_id``, or an arbitrary typed
graph) as a :class:`~bead.corpus.graph.CorpusGraph`. It holds the records in
memory, so it is a deliberate, explicit step distinct from streaming.
"""

from __future__ import annotations

from collections.abc import Callable, Iterable, Sequence

import didactic.api as dx

from bead.corpus.graph import CorpusEdge, CorpusGraph, CorpusNode
from bead.corpus.records import CorpusRecord
from bead.data.base import BeadBaseModel


class EdgeSpec(BeadBaseModel):
"""Declarative rule for deriving one typed edge per record from a field.

For each record, if ``target_field`` is present in the record's provenance,
an edge ``record_node -> target`` is created with type ``edge_type``. The
target id is the field value with any matching ``strip_prefixes`` removed
(e.g. Reddit's ``t1_``/``t3_`` fullname prefixes).

Attributes
----------
target_field : str
Provenance field naming the other endpoint (e.g. ``"parent_id"``).
edge_type : str
Edge type slug for the created edge (e.g. ``"reply-to"``).
edge_type_uri : str | None
Optional canonical edge-type URI.
strip_prefixes : tuple[str, ...]
Prefixes to strip from the field value to recover the bare node id.
directed : bool
Whether the created edge is directed.
"""

target_field: str
edge_type: str
edge_type_uri: str | None = None
strip_prefixes: tuple[str, ...] = ()
directed: bool = True

@dx.validates("target_field", "edge_type")
def _check_non_empty(self, value: str) -> str:
if not value or not value.strip():
raise ValueError("must be non-empty")
return value.strip()


def _strip_prefix(value: str, prefixes: tuple[str, ...]) -> str:
"""Strip the first matching prefix from *value*."""
for prefix in prefixes:
if prefix and value.startswith(prefix):
return value[len(prefix) :]
return value


def assemble_graph(
records: Iterable[CorpusRecord],
*,
node_id_field: str,
edge_specs: Sequence[EdgeSpec] = (),
edge_fn: Callable[[CorpusRecord, str], Iterable[CorpusEdge]] | None = None,
) -> CorpusGraph:
"""Buffer a record stream into a typed multidigraph.

Each record with a ``node_id_field`` value becomes one expression node.
Edges are derived from the declarative ``edge_specs`` and/or a runtime
``edge_fn`` (given the record and its node id) for arbitrary extraction.

Parameters
----------
records : Iterable[CorpusRecord]
The records to buffer (typically a streaming source).
node_id_field : str
Provenance field holding each record's stable node id.
edge_specs : Sequence[EdgeSpec]
Declarative field-to-edge rules (the common case).
edge_fn : Callable[[CorpusRecord, str], Iterable[CorpusEdge]] | None
Optional function yielding extra edges for arbitrary structure.

Returns
-------
CorpusGraph
The assembled graph. Edges may reference target ids that have no node
(dangling references are preserved, not dropped).
"""
nodes: list[CorpusNode] = []
edges: list[CorpusEdge] = []
for record in records:
node_id_raw = record.provenance.get(node_id_field)
if node_id_raw is None:
continue
node_id = str(node_id_raw)
nodes.append(CorpusNode(node_id=node_id, record=record))
for spec in edge_specs:
target_raw = record.provenance.get(spec.target_field)
if target_raw is None:
continue
target_id = _strip_prefix(str(target_raw), spec.strip_prefixes)
edges.append(
CorpusEdge(
source_id=node_id,
target_id=target_id,
edge_type=spec.edge_type,
edge_type_uri=spec.edge_type_uri,
directed=spec.directed,
)
)
if edge_fn is not None:
edges.extend(edge_fn(record, node_id))
return CorpusGraph(nodes=tuple(nodes), edges=tuple(edges))
30 changes: 30 additions & 0 deletions bead/corpus/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""Corpus source protocol.

A ``CorpusSource`` is anything that streams ``CorpusRecord``s. It is modeled as
a runtime-checkable ``Protocol`` (behavior, not data) rather than a didactic
model, mirroring the transform protocols elsewhere in bead.
"""

from __future__ import annotations

from collections.abc import Iterator
from typing import Protocol, runtime_checkable

from bead.corpus.records import CorpusRecord


@runtime_checkable
class CorpusSource(Protocol):
"""A streaming source of corpus records.

Attributes
----------
source_name : str
Identifier stamped onto every record's ``source_name``.
"""

source_name: str

def __iter__(self) -> Iterator[CorpusRecord]:
"""Iterate the records of the source."""
...
Loading
Loading