openspec: text-extraction-office-completeness scaffolding#1590
Open
rjzondervan wants to merge 1 commit into
Open
openspec: text-extraction-office-completeness scaffolding#1590rjzondervan wants to merge 1 commit into
rjzondervan wants to merge 1 commit into
Conversation
Scaffold the openspec change for a deeper walker over .docx and .odt documents, covering tables (recursive), lists, headers, footers, footnotes, endnotes, and text frames on both extraction and anonymisation paths. Adds ODT as a first-class supported format alongside DOCX. Pairs with office-document-sanitization (sister change): sanitiser strips identity-bearing wrappers; walker traverses surviving content. Walker is coverage-expansion; sanitiser is correctness-fix. Pure openspec — proposal, design, capability spec (9 ADDED Requirements), tasks (10 sections). No implementation in this PR; awaiting team review before /opsx:apply.
7 tasks
Contributor
Quality Report — ConductionNL/openregister @
|
| Check | PHP | Vue | Security | License | Tests |
|---|---|---|---|---|---|
| lint | ✅ | ||||
| phpcs | ✅ | ||||
| phpmd | ✅ | ||||
| psalm | ✅ | ||||
| phpstan | ✅ | ||||
| phpmetrics | ✅ | ||||
| eslint | ✅ | ||||
| stylelint | ✅ | ||||
| composer | ✅ | ✅ 162/162 | |||
| npm | ✅ | ✅ 602/602 | |||
| PHPUnit | ✅ | ||||
| Newman | ⏭️ | ||||
| Playwright | ⏭️ |
Quality workflow — 2026-05-19 09:29 UTC
Download the full PDF report from the workflow artifacts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
text-extraction-office-completeness— a singleOfficeDocumentWalkerdriving both extraction and anonymisation over the same content surface.ListItemRun), headers (per section, per variant), footers (per section, per variant), footnotes, endnotes, and text frames.What's in this PR
Pure openspec scaffolding. Four artifacts:
proposal.md— gap analysis (the two-level depth, missing structures, ODT-unsupported reality), pipeline shape, capability scopedesign.md— 10 decisions: single walker class for both read+mutate (D1), element visitation rules (D2), output ordering with section markers (D3), ODT integration via PhpWord's ODText reader/writer (D4),strtrlongest-match-first semantics (D5), ADR-005 logging (D6), backwards-compat-for-DOCX-consumers (D7),forceReExtractopt-in for stale records (D8),extractWord→extractOfficeDocumentrename with one-cycle alias (D9), fixture strategy (D10)specs/text-extraction-office-completeness/spec.md— 9 ADDED Requirements with scenarios: walker coverage,extractTextshape,replacemutation, ODT extraction, ODT writer-back, DOCX entity substitution in all walker-covered structures, ADR-005 logging, pre-change DOCX superset guarantee, BLOCKING reopen-clean validation gatetasks.md— 10 sections from the walker class throughTextExtractionService+DocumentProcessingHandlerrefactors, ODT-specific writer-back path, fixtures, unit + integration tests, the manual Word/LibreOffice validation gate (BLOCKING), docs, and quality gatesopenspec validate text-extraction-office-completeness— clean.Why now
Two specific bugs surfaced in recent operator testing on Woo dossiers:
replaceWordsInTextDocumentpath that doesstr_ireplaceon the binary ZIP container — corrupts the file. Operators see "Anonymisation succeeded" then can't open the result.This change closes both. Per design D7, DOCX extraction is strictly additive — no regression for existing consumers; ODT goes from broken to working.
Composition
Pairs with sister
office-document-sanitization(separate PR #1589). Sanitiser strips wrappers; walker traverses surviving content. Both target DOCX + ODT. Can land independently or together.Status
Ready for team review. No implementation has started —
/opsx:applyruns only after this PR merges.Test plan
extractWorddeprecation window (D9) — one cycle, or longer?text-extraction-office-completenessis acceptable