Skip to content

[content-hash 1/5] refactor: record passage_id_scheme in meta.json#330

Open
raoabinav wants to merge 1 commit into
StarTrail-org:mainfrom
raoabinav:refactor/passage-id-scheme-field
Open

[content-hash 1/5] refactor: record passage_id_scheme in meta.json#330
raoabinav wants to merge 1 commit into
StarTrail-org:mainfrom
raoabinav:refactor/passage-id-scheme-field

Conversation

@raoabinav
Copy link
Copy Markdown
Contributor

@raoabinav raoabinav commented May 20, 2026

Sub-PR 1 of 5 from #329.

Purely additive. Writes a new passage_id_scheme: "sequential" field into the .meta.json produced by both build_index and build_index_from_arrays. Existing index loaders ignore the field, so this changes nothing for any caller.

Also bumps meta_data["version"] from "1.0" to "1.1". No code currently reads version, so the bump is safe; it's documentation of the schema evolution for future migration logic.

Two module-level constants (PASSAGE_ID_SCHEME_SEQUENTIAL, PASSAGE_ID_SCHEME_CONTENT_HASH) document the value space. The content-hash scheme itself lands in sub-PR 2.

Content-hash passage IDs train (#329)

@ASuresh0524
Copy link
Copy Markdown
Collaborator

@raoabinav thanks for the pr! fix CI error before I can merge please for all content-hash PRs

@raoabinav raoabinav force-pushed the refactor/passage-id-scheme-field branch from a922ff4 to 4b0b883 Compare May 25, 2026 18:29
Sub-PR 1 of 5 from the plan in StarTrail-org#329. Purely additive — no behavior change
for any caller, existing index loaders ignore the field.

Writes a new `passage_id_scheme: "sequential"` field into the .meta.json
produced by both build_index and build_index_from_arrays. Bumps version
to "1.1" for human-inspectable schema tracking (no code reads version today,
so the bump is safe).

Module-level constants PASSAGE_ID_SCHEME_SEQUENTIAL / _CONTENT_HASH document
the value space; the content-hash scheme itself ships in sub-PR 2.
raoabinav added a commit to raoabinav/LEANN that referenced this pull request Jun 1, 2026
Sub-PR 2 of 5 from StarTrail-org#329. Builds on StarTrail-org#330 (which added the meta.json field).

New behavior:
- `LeannBuilder(..., passage_id_scheme="content-hash")` makes add_text() key
  passages by sha256(text)[:16] instead of insertion index. Stable across file
  moves, reorderings, and re-runs of the same corpus.
- `leann build --id-scheme content-hash` exposes it at the CLI.
- Default unchanged ("sequential"). Existing indexes continue to work
  identically; no migration triggered.

Identical-text chunks collide (same hash). For this sub-PR the second
occurrence overwrites the first in the offset map — that's the dedup
behavior I'd want by default. A `--preserve-duplicates` escape hatch can
land later if needed (see the open question in StarTrail-org#329).
@raoabinav raoabinav force-pushed the refactor/passage-id-scheme-field branch from 4b0b883 to 2a308f2 Compare June 1, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants