Skip to content

WIP: 1.0.0 — major release (roadmap)#693

Draft
bosd wants to merge 78 commits into
masterfrom
next
Draft

WIP: 1.0.0 — major release (roadmap)#693
bosd wants to merge 78 commits into
masterfrom
next

Conversation

@bosd
Copy link
Copy Markdown
Collaborator

@bosd bosd commented May 23, 2026

Long-lived integration branch for the 1.0.0 major release. Draft until ready; do not squash-merge to master until cutting 1.0.0 (that triggers the PyPI release).

Roadmap & rationale: see the approved plan. Phased: A foundation/breaking → B parsers/perf → C build/features.

Landed

  • A2 — faster regex: compile-once cache + pluggable engine (extract/_regex.py); stdlib re default, regex opt-in via INVOICE2DATA_REGEX_ENGINE=regex. Golden suite passes under both engines.

Next (Phase A)

  • A1 breaking-change audit + deprecation policy
  • A3 field schema + validation (Odoo-aligned)
  • A4 tax_lines standardization + CSV flatten
  • A6 pluggable input-backend interface
  • A5 uv-forge (copier) migration

Then

  • Phase B: pypdfium2/pdfsink-rs/hotpdf backends + benchmark harness + restore gvision
  • Phase C: Camelot + Excalibur, mypyc + cibuildwheel, cut 1.0.0

(Python 3.10→3.11 floor = separate later minor. AI features = next session, pre-1.0.)

bosd added 30 commits May 23, 2026 02:57
All extract-layer regex calls now go through extract/_regex.py, which
caches compiled patterns (lru_cache) instead of recompiling on every
call across parsers/regex.py, parsers/lines.py, plugins/tables.py and
invoice_template.py.

The engine is selected once at import: stdlib re by default (behaviour
unchanged), or the API-compatible regex package when
INVOICE2DATA_REGEX_ENGINE=regex — which also gives the previously-declared
but unused 'regex' dependency a purpose. RE2/pyre2 deliberately avoided
(no lookaround/backrefs that user templates rely on).

Golden suite passes under both engines. First step of the 1.0.0 roadmap
(prerequisite for the parser benchmark harness). Part of #v1 / next branch.
Add input/__interface__.py documenting the backend contract and move the
hand-maintained input_mapping into a single registry in input/__init__.py
(INPUT_MODULES) with helpers supports_area()/is_available()/available_modules().

Backends now declare capabilities/availability instead of being hardcoded:
- SUPPORTS_AREA on pdftotext/tesseract/ocrmypdf replaces the hardcoded
  '(pdftotext, ocrmypdf, tesseract)' tuple in invoice_template._handle_area.
- is_available() added per backend (binary check for pdftotext/tesseract,
  import check for pdfplumber/pdfminer, alias to existing checks for
  ocrmypdf/gvision) so backends self-exclude when deps are missing.

__main__.py now sources input_mapping from the registry. Fully backward
compatible: extract_data() still accepts both module objects and string keys.
Adds a contract test. Gates the Phase B backends (pypdfium2/pdfsink-rs/hotpdf).
- Add docs/migration-1.0.md (in the toctree): Python support, the one
  deprecation (the legacy 'lines' plugin / top-level 'lines:' key -> use
  'parser: lines'), the prefix-magic family explicitly RETAINED in 1.0
  (static_ is used by ~half the built-in templates), clarified contracts,
  and forward-looking 1.0 changes (validation, tax_lines/CSV, regex, backends).
- Emit a DeprecationWarning from the 'lines' plugin (it already self-documents
  as superseded by the parser); deduped once per process.
- Fix the extract_data() docstring: it returns {} on failure, not False.
- Add a deprecation-warning test.

Deliberately does NOT deprecate the widely-used prefix magic (static_/sum_/
auto-typing) to avoid ecosystem churn; documented as under-review instead.
- output/to_csv.py: JSON-encode list/dict cells (lines/tax_lines) so the CSV
  is valid and machine-readable instead of a Python repr. New --csv-lines
  option: 'json' (default) or 'explode' (one row per line item, line_<key>
  columns). Dates inside arrays are formatted too (reuses to_json.format_item).
- invoice_template: compute a missing tax_lines line_tax_amount from
  price_subtotal * line_tax_percent/100 (never overwrites; product 'lines'
  left untouched to avoid golden churn), plus an advisory tolerance warning
  when tax_lines don't sum to amount_tax.
- Wire --csv-lines through the CLI.
- Tests for json/explode CSV and tax computation.

Golden suite unchanged (57 passed). Bulk migration of the 7 tax_lines
templates to a single canonical schema is left as a follow-up.
Add extract/schema.py: the canonical invoice/line/tax_line field vocabulary
(mirrors docs/recommended-template-fields.md + the OCA Odoo module) and
validate_output(), the single source of truth for field names.

invoice_template now validates output field names after extraction:
- Quiet by default — a field is only warned about when it looks like a TYPO of
  a canonical name (custom fields are legitimate and stay silent). Verified
  zero false positives across all 215 built-in templates.
- Opt-in per template via options.strict_fields: true (raises on any
  unrecognized field) with options.extra_fields: [...] to whitelist customs.

Updated docs/migration-1.0.md (validation/tax/CSV/regex/backends now landed).
Tests for the schema + validation.
… (B1, B2)

B1: add input/pdfium.py (pypdfium2) and input/hotpdf.py behind the A6
interface (is_available via importlib; optional-deps extras 'pdfium'/'hotpdf';
registered in INPUT_MODULES). pdfsink-rs is not on PyPI, so not included.

B2: benchmarks/run.py scores each backend on speed AND accuracy (field-match
vs the golden outputs, via extract_data — i.e. real template compatibility).

Findings (11 compare PDFs): pdftotext 85.9% acc / 19.5 ms; pdfium 20.6% /
4.4 ms; pdfminer 24.7% / 77.5 ms; hotpdf 4.1% / 367.9 ms; pdfplumber backend
errors (0%, separate bug). Conclusion: pdftotext stays the default — the fast
backends lose too much accuracy against pdftotext-tuned templates; pypdfium2
is a good optional fast backend for re-tuned templates / triage.

(incl. a ruff-format wrap fix in test_schema.py)
pdf-oxide (Rust, MIT/Apache, py3.8-3.14) as input/pdfoxide.py behind the A6
interface, registered as --input-reader pdfoxide, optional extra 'pdfoxide'.

Benchmark (11 compare PDFs): pdf-oxide is the best of the fast backends —
~4.5 ms/file (pypdfium2-class speed) at 31.2% accuracy vs pdfium 20.6% /
pdfminer 24.7% / hotpdf 4.1%. Still below pdftotext (85.9%) since templates
are pdftotext-tuned. Uses basic extract_text; its auto/markdown/table modes
are worth exploring for higher accuracy.
Benchmark of pd-oxide's extraction modes: to_plain_text_all scores 34.7%
vs 31.2% for the basic extract_text (and is layout-aware, the better fit for
invoices). Still below pdftotext (85.9%) — no fast backend approaches it on
the current pdftotext-tuned templates without re-tuning.
`to_text()` accumulated per-page text into `raw_text` but then overwrote
it with `res_to_raw_text([res])`, where `res` only had "all"/"first" keys
and never a "text" key, so the helper always returned "". The pdfplumber
backend was effectively dead: `extract_data` logged "Failed to extract
text" for every file routed through it.

Drop the dead `res` dict, the `res_to_raw_text([res])` overwrite, and the
now-unused `res_to_raw_text` helper; return the text gathered in the page
loop. Guard against `extract_text()` returning None for empty pages with
`or ""`. The layout/tolerance params are kept unchanged: they emulate
`pdftotext -layout`, which the templates rely on.

Add a regression test asserting `pdfplumber.to_text()` returns non-empty
text for a bundled invoice.

(cherry picked from commit c27248a)
A commented-out YAML example in the `tables` plugin contained a line
`#      type: float`, which mypy parses as a PEP 484 type comment. mypy
1.10.1 then reports a spurious "expected an indented block after 'elif'"
syntax error and aborts, so `nox -s mypy` failed before checking any
file — masking the rest of the type errors in the tree.

Reword the example onto one inline line so no comment starts with
`# type:`. No behaviour change (it was already dead comment text).
Generalise the hard-coded ocrmypdf fallback in `extract_data` into a
configurable, ordered backend cascade. When no backend is forced
(`input_module=None`), each backend in `DEFAULT_INPUT_READERS`
(currently `pdftotext`, then `pdfium`/pypdfium2) is tried until a
template matches with all required fields; OCR (ocrmypdf) remains the
last resort. Unavailable backends self-exclude via `is_available()`.

A template may pin the backend it was authored for with a top-level
`input_module:` key (e.g. a layout-sensitive or area template that needs
poppler). When a template matches under one backend but declares another,
we re-extract with the declared one. This is the correctness escape hatch
for backends that *silently* mis-extract (which the retry can't detect),
and it short-circuits straight to the right backend once a faster default
leads the cascade.

The cascade order is deliberately left `pdftotext`-first; flipping to a
faster default is a benchmark-gated follow-up so we don't risk silent
extraction regressions on the existing pdftotext-tuned templates.

Behaviour: a matched-but-incomplete extraction now returns `{}` (the
documented contract) instead of propagating `ValueError`. Forcing
`--input-reader`/`input_module=` keeps the single-pass behaviour.

Refactors `extract_data` into small helpers (`_resolve_readers`,
`_safe_to_text`, `_match_template`, `_preferred_module`, `_run_template`,
`_ocr_last_resort`); removes the tuple-returning
`extract_data_fallback_ocrmypdf`. Adds tests and migration notes.
The `mypy --strict` session had been silently failing (and is disabled in
CI). With the tables.py parse error fixed, it surfaced 22 pre-existing
errors. Clear them and turn the gate back on:

- Centralise optional-backend import handling in `[[tool.mypy.overrides]]`
  (pdfminer, pdfplumber, pypdfium2, pdf_oxide, hotpdf, docutils,
  sphinxmermaid) with `ignore_missing_imports`; google.cloud is a partial
  namespace package so it gets a `follow_imports = "skip"` override. Remove
  the now-redundant inline `# type: ignore[import-*]` comments.
- Type `_regex.compile` -> `re.Pattern[str]` (cast) and `_regex.search` ->
  `re.Match[str] | None`, killing the "Returning Any" errors in the regex
  wrappers and `parsers/lines.py`; update the regex-cache test for the
  (correct) Optional return.
- Export `Invoice2Data`/`extract_data` via `__all__` so the re-export is
  explicit. Annotate `docs/conf.py` (setup/skip_mermaid) and the
  CSV/deprecation tests; use an OrderedDict template in the deprecation test
  to match the plugin signature.

Verified green with `mypy --python-version` 3.10/3.11/3.12/3.13. Re-enable
the `mypy` session in CI on the floor + latest (3.10, 3.13), matching the
noxfile decorator.
Normalise PDFium's text artifacts so its output plays well with the line
parser and templates (which work in terms of `\n`): collapse `\r\n`/`\r`
to `\n` and strip stray zero-width markers (U+FEFF / U+FFFE) that PDFium
emits on some documents (e.g. around hyphenated line breaks; see the
py-pdf/benchmarks post-processing). On oyo.pdf this removes 41 carriage
returns.

Make the signature interface-conformant (`area_details=None`): PDFium has
no layout mode and uses a different coordinate system, so area extraction
is unsupported — log a warning and ignore it. Area/layout-sensitive
templates should pin `input_module: pdftotext` (see the backend cascade).

Adds a regression test asserting non-empty, CR-free output.
…lates

Run the B2 benchmark and record results in docs/backend-benchmark.md.
Headline: pdftotext is the accuracy anchor (85.9%); pypdfium2 is the best
fast backend (42.9% in isolation, ~5x faster) and the cascade recovers the
rest via fallback. With the four layout/area/table-sensitive bundled
templates pinned to pdftotext, a pypdfium2-first cascade reaches the same
85.9% accuracy at ~1.5x the speed.

Pin `input_module: pdftotext` on those templates (com.amazon.aws,
nl.be.coolblue, fr.free.adsl-fiber, fr.publicationannoncelegale) — their
line-item tables / area-based date need poppler's -layout, which pypdfium2
cannot reproduce. No behaviour change under the current pdftotext-first
default; this readies them for a future default flip.
A template's `input_module:` pin is a default-mode hint, but it was also
honoured when a backend was explicitly forced
(`extract_data(..., "pdfium")` / `--input-reader`). That silently
re-routed a forced backend to the pinned one — surprising for the API and,
worse, it made the benchmark unable to measure a backend on pinned
templates (it secretly used pdftotext).

Gate the pin behind auto mode (`input_module is None`). An explicit
backend is now taken at face value. Add a test asserting a forced backend
bypasses the pin.
Flip DEFAULT_INPUT_READERS to [pdfium, pdftotext]: the fast,
dependency-light pypdfium2 backend now leads, with pdftotext (poppler
-layout) as the fallback. The benchmark shows this keeps accuracy at the
pdftotext level (85.9%) while running faster, because the cascade falls
back automatically.

Add a soft-completeness signal: if a matched template declares a
lines/tables block (or a `parser: lines` field) but the backend produced
no line items, keep the result as a fallback and try the next backend,
which may recover the table. This auto-recovers layout-less backends that
return an empty table (e.g. AWS, free_fiber) without a per-template pin.

Pin the remaining area template (nl.buijtendijk) — pypdfium2 cannot do
area extraction at all. Document the flip and the (small) migration for
template authors whose non-required field comes back populated-but-wrong.

Remaining manual pins are only for "populated-but-wrong" degraders
(area fields, column-aligned tables); total failures and empty tables are
handled automatically by the cascade.
Add a new optional `camelot` plugin that detects ruled/whitespace-aligned
tables by re-reading the PDF with camelot-py (the current read_pdf API,
mapping each table to a list of dicts under a configurable output field).
It is opt-in: install `invoice2data[camelot]`, add a top-level `camelot:`
block to a template, and the plugin self-excludes via is_available() when
camelot is absent.

Plugin interface: plugins now receive the source `invoice_file` (the C1
prerequisite — text plugins ignore it, path-based ones like camelot need
it). lines/tables gain the (ignored) parameter; the doc is updated.

Packaging: the published camelot-py requires pdfminer.six>=20240706 while
pdfplumber==0.11.4 hard-pins ==20231228 — a genuine library conflict — so
the `camelot` extra is declared mutually exclusive with the `pdfplumber`
and `pdfminer-six` extras via `[tool.uv] conflicts`. camelot is treated as
Any by mypy (untyped, implicit re-exports).

Tests: header/no-header row mapping (always run) + a lattice-table
integration test on a bundled fixture (skipped when camelot is absent).
The fixture is kept out of the generic sample sweep via inputparser_specific.
Normalize line-item / tax-line output keys to one vocabulary, two ways:

(B) Output normalization layer: schema.normalize_line_fields() maps
non-canonical keys in lines/tax_lines to canonical ones at extraction time
(description→name, unit_price/unitprice→price_unit,
vat_rate/tax_percent→line_tax_percent). Wired into InvoiceTemplate.extract
before tax computation. So any template (incl. community ones) emits the
standard vocabulary with no manual change.

(A) Tidy the bundled templates that used those aliases (14 of them) to the
canonical names directly, so they ship as good examples. Update the one
affected golden (AmazonWebServices: description→name).

Corrected against the OCA Odoo v14 module source: `product` is a DISTINCT
line field (product matching), NOT a synonym for `name`, so it is added to
the canonical LINE_FIELDS (with `taxes`) and never aliased — the 17
`product` templates are already correct. A line's label is `name`;
`description` is only an invoice-level field, hence description→name.

Docs: add `product`/`taxes` to the recommended line-fields table, clarify
`name` is the label, note `description` is an accepted alias, and fix the
misleading Example Usage (it used `(?P<description>)` for a line item).

Tests: normalize_line_fields cases (aliases, product-not-aliased, canonical
wins). All 215 templates still load; suite green.
Establish the copier link to the uv-forge template:
- Add .copier-answers.yml (gh:bosd/uv-forge @ b387b46) so `copier update`
  works going forward; retire the old .cookiecutter.json.

Switch the build backend setuptools -> hatchling (what uv-forge generates;
the build-backend change is done as its own validated step, ahead of mypyc):
- [tool.hatch.build.targets.wheel] packages = ["src/invoice2data"] — the
  wheel ships the package + py.typed + the 215 bundled templates and nothing
  else (test fixtures live outside src/, so no PDFs/PNGs in the wheel).
- [tool.hatch.build.targets.sdist] excludes the heavy binary test fixtures
  (PDFs/PNGs) and build/cache cruft, keeping the sdist lean (~328K vs ~1.5M).

Bump the typing toolchain to uv-forge's versions (mypy >= 1.13) and add the
`ty` group (Astral's type checker — the template switch the ty follow-up was
waiting on). typeguard/xdoctest bumped to match.

Deliberately deferred to separate validated steps (kept current for safety):
the expanded ruff rule set + ruff bump, coverage fail_under=100, the
docs theme (furo->shibuya/sphinx 8), and regenerating the workflows
(the auto-publishing release flow + CI system deps stay as-is).
The uv-forge toolchain bump (mypy >= 1.13 -> resolves 2.x, + the new `ty`
checker) surfaced real typing issues; fix them so both checkers pass.

mypy 2.x:
- Drop the now-redundant `cast(dict[str, Any], tpl)` in loader.py.
- Guard ColorLogFormatter: `LOG_LEVEL_COLOR.get(level, {}).get(...)` so an
  unknown log level can't hit None (latent crash).
- Route the defusedxml/ocrmypdf optional imports through the mypy
  ignore_missing_imports overrides; drop their inline ignores.

ty (newly adopted — the template switch the ty follow-up waited for):
- Type the line/table extractors' `self`/`template` as `InvoiceTemplate`
  (via a TYPE_CHECKING import) so `coerce_type`/`parse_date`/`parse_number`
  resolve in both checkers; this lets us delete the `# type: ignore
  [attr-defined]` suppressions in tables.py and parsers/lines.py.
- `# ty: ignore[unresolved-import]` on the guarded optional imports
  (ocrmypdf, defusedxml) that ty can't see when the extra isn't installed.
- Replace deprecated `codecs.open` with builtin `open`.
- Build a real InvoiceTemplate in the deprecation test.

Green: mypy --strict (3.10 + 3.13), `ty check src`, ruff, pydoclint,
79 passed / 3 skipped.
Continue the uv-forge template adoption:

- ty gate: add a `ty` nox session (checks src; mypy stays authoritative) +
  a 3.13 CI job + `nox.options.sessions`. `[tool.ty.rules] unresolved-import
  = "ignore"` because invoice2data has conflicting optional backends ty
  can't all resolve. Drop the now-redundant inline `# ty: ignore` comments.
- Release model: replace the salsify token auto-publish with uv-forge's
  flow — on push to master: draft release notes + publish to TestPyPI; on a
  published GitHub Release: build + provenance attestation + sigstore signing
  + publish to PyPI via Trusted Publishing (no long-lived token). NOTE: needs
  PyPI Trusted Publishing configured for invoice-x/invoice2data + a
  TEST_PYPI_TOKEN secret before the next master release.
- Docs theme furo -> shibuya (in the pyproject docs group AND
  docs/requirements.txt for RTD AND conf.py), which requires sphinx >= 8 +
  myst-parser >= 4; sphinx-mermaid / rsvg badge handling preserved. HTML
  build verified.
- Fix pre-commit: trailing blank line in .copier-answers.yml (end-of-file).

Green: ty (src), mypy --strict (3.10/3.13), 79 passed, docs HTML build,
workflows valid YAML.
…ild)

The README badges were pulled into the docs via the README include, and the
LaTeX/PDF builder ran them through rsvg -> PDF, which produced corrupt /
PDF-1.7 badge PDFs (pdflatex caps included PDFs at 1.5) AND left extra PDFs
in the output dir -> RTD "Build output directory contains multiple files".

Add a `<!-- docs-body -->` marker after the badge block and `start-after`
it in docs/index.md (with an explicit page title), so the badges stay on
GitHub/PyPI but never enter the Sphinx build. HTML verified: badges gone,
title + content intact, build succeeds. PDF/LaTeX verifies on RTD.
In regex.parse, `result` starts as a list of matches then becomes a coerced
scalar/grouped value. Annotate it `Any` so mypyc doesn't strict-check it
against the initial list type (it raised TypeError at runtime on the compiled
build). No behaviour change; mypy + suite stay green. Part of C2 (mypyc).
Add optional mypyc compilation of the hot-path leaf modules (extract/utils,
_regex, parsers/regex, parsers/lines, plugins/tables) — the codebase is
mypy --strict clean so mypyc compiles them; the suite passes against the
compiled build (the regex.parse `result: Any` fix in e1a479a was the only
type-precision change needed).

Following uv-forge's mypyc extension:
- Switch build-backend hatchling -> setuptools; add a conditional setup.py
  that mypycifies the modules only when INVOICE2DATA_COMPILE_MYPYC=1, so the
  default `uv build`/sdist install stays pure Python. Build requires list the
  deps mypyc needs to see types (click/dateparser/PyYAML/regex/mypy).
- Restore [tool.setuptools] package-data (py.typed + templates) — wheel/sdist
  stay clean (no PDFs/tests).
- release.yml: build cross-platform compiled wheels with cibuildwheel
  (linux x86_64/aarch64, windows, macos intel/arm) + a pure-Python sdist on a
  published GitHub Release, then sign + publish via Trusted Publishing. Keep
  the TestPyPI-on-push staging step. [tool.cibuildwheel] skips PyPy/musllinux
  and smoke-tests that the compiled extensions import on every platform.
- .copier-answers: extension none -> mypyc.

Verified locally: default build = pure-Python py3-none-any wheel (215
templates, 0 .so); INVOICE2DATA_COMPILE_MYPYC=1 build = cp313 wheel with 6
.so that installs and extracts correctly. Cross-platform wheels verify in CI
(cibuildwheel runs only on release).
- A6 conformance: declare SUPPORTS_AREA = False (gvision has no area mode;
  is_available alias was already present).
- Wire the previously-dead `language` arg into the request via
  vision.ImageContext(language_hints=[language]).
- Replace print() with logger.info for the "waiting for OCR" message.
- Harden: raise a clear OSError when Vision returns no result blob (was an
  AttributeError on None).
- Document the modern bucket-free alternative (Google Document AI OCR
  processor; cf. OCA account_invoice_google_document_ai) as a future backend.

Mocked tests still pass; mypy/ruff/pydoclint clean. The live GCS/Vision flow
remains creds-gated (mocked test is the CI gate).
Adopt the larger uv-forge lint families on top of the existing set:
ASYNC, C4, FURB, LOG, PERF, PIE, RET, SIM, TRY (ruff bumped >=0.8 ->
resolves 0.15.14; FURB/LOG need >=0.8).

Fixed the ~38 surfaced violations (auto + manual): drop superfluous
else-after-return (RET505), collapse same-arm/nested ifs (SIM102/114),
list->generator in all()/any() (C419), ternaries (SIM108), `key in dict`
(SIM118), return-without-temp (RET504), drop unused unpacked vars
(RUF059), logging.exception in except (TRY400). One PERF203 noqa for the
intentional per-file error-isolation try/except in the CLI loop.

Deferred (documented in extend-ignore):
- PTH (~70 os.path -> pathlib sites) — a focused follow-up migration.
- TRY003 (long messages in raise) — would need custom exception classes.

ruff format (0.15) reflowed 3 test files. Suite green (78 passed);
mypy/pydoclint/ty clean.
Convert the 17 production os.path/open sites to pathlib.Path: open() ->
Path.open(), os.path.exists -> Path.exists, os.path.join -> Path/"x",
os.path.dirname -> Path.parent, os.path.abspath -> Path.resolve,
os.path.basename -> Path.name. Drop now-unused `import os` /
`from os.path import join` in __main__ and pdftotext; keep os in loader
(os.walk) and gvision (os.getenv).

Removes the global PTH ignore. Test scaffolding (~55 fixture sites,
mostly test_cli.py) is exempted via a `tests/** = ["PTH"]` per-file-ignore
since pathlib there is churn with no runtime benefit.

Suite green (78 passed); ruff/mypy/pydoclint/ty clean.
…s) + docs

Templates can already be loaded from a string instead of disk
(extract_data(file, templates=ordered_load(db_text))), the Odoo "templates
from the DB" path. But ordered_load's `loader` param was dead — the body
hardcoded json.loads, so YAML-from-string silently failed.

- Use the supplied `loader` (default json.loads; pass yaml.safe_load for YAML);
  broaden the parse-error catch to (ValueError, YAMLError); generic warning.
- prepare_template: tolerate a streamed template lacking template_name in the
  missing-keywords warning (was a KeyError).
- Add a YAML-stream test; document the string/DB pattern in the README library
  section.

Suite green (79 passed); ruff/mypy/pydoclint clean.
…iguation (AUTH-2b)

First piece of the template-authoring toolkit: a dependency-free validator layer
that classifies a captured value by *validating* it, not by regex alone — since
field-type patterns overlap (an IBAN and an EU VAT number can match each other).

- validate_iban: ISO 13616 structure + mod-97 checksum (whitespace/case tolerant).
- validate_vat: EU VAT format check (full country set; format, not per-country
  checksum — python-stdnum can layer on later).
- validate_bic: ISO 9362 8/11-char format.
- classify_identifier: tries validators strongest-first (IBAN checksum before
  VAT/BIC format) so an IBAN is never mistaken for a VAT number.

New extract/validators.py + 26 tests. Will feed AUTH candidate typing/ranking,
AI-1 drafting, and an optional runtime soft-validation. Suite green.
…AUTH-1)

New extract/candidates.py: scans extracted text for typed candidates with their
positions, the engine the copier-style CLI builder and AI template generation
will consume.

- find_dates (dateparser-validated), find_amounts (separator-aware float parse),
  find_identifiers (IBAN/VAT/BIC, typed via the AUTH-2b validators; two-pass so a
  label like "IBAN" never swallows the number, and space-grouped IBANs are caught).
- find_candidates merges them, drops amounts that fall inside a date (the 12.05 in
  12.05.2024), sorted by position.

Candidate dataclass carries kind/value/start/end/parsed. 6 tests; suite green.
bosd added 15 commits May 24, 2026 20:30
…rop rework)

Area extraction had zero golden coverage, which made the planned "parse once +
crop in Python" optimization (#2) unverifiable/risky. These tests lock the
current pdftotext area contract on a real PDF (oyo.pdf): an `area` returns ONLY
the text inside the requested rectangle (header band present; content lower on the
page excluded; smaller than the full page). When #2 is implemented, the Python
crop must satisfy the same assertions -- so it can be done safely.

Skipped when pdftotext (poppler) is absent. Suite green (171 passed).
Date parsing was ~58% of extraction time. New extract/_dates.py parses
fastest-applicable-tier first:
  1. the template's date_formats via stdlib strptime (microseconds, deterministic)
  2. dateutil (fast, fuzzy, English-centric)
  3. dateparser (multilingual / localized month names) -- now OPTIONAL

- dateparser moved from core deps to the [dateparser] extra; python-dateutil added
  to core. Without dateparser, numeric/English dates still parse (tiers 1-2);
  localized month names need `pip install invoice2data[dateparser]`.
- Centralized date parsing in _dates.parse_date (lru_cached). invoice_template,
  candidates and ai/fallback all route through it; no module imports dateparser at
  top level any more (it's lazily, guardedly imported in _dates only, returning
  None when absent).

Benchmark (11 compare PDFs x3): parse_date cumulative 0.685s -> ~0 (compare
templates set date_formats -> strptime tier); wall ~1.19s -> 0.80s. Goldens
unchanged. 5 tests (each tier + the dateparser-absent path). Suite green (175).
…eparser

- noxfile tests session now also syncs the `ai` and `dateparser` extras so the
  httpx-mocked AI request test and the dateparser tier-3 date test run in CI
  (both importorskip otherwise).
- README: document the tiered date parsing (strptime -> dateutil -> dateparser)
  and that dateparser is now an optional extra for localized dates.
Two hotspots from re-profiling (after the date fast-path):
- _match_template re-sorted all ~215 templates on every call (per reader, per
  cascade step). Now sorted once via _by_priority() in extract_data;
  _match_template iterates the pre-sorted list. (#4 from the perf list)
- _check_required_fields built pformat(output) as a logger.debug ARGUMENT, so it
  ran on every successful extraction even with debug disabled. Guarded with
  logger.isEnabledFor(DEBUG).

Warm benchmark (11 compare PDFs x3): ~0.174s -> 0.133s. (The earlier enum/regex
compile overhead was dateparser's internals -- gone with the date fast-path; we
use stdlib `re` by default, so "re vs regex" is moot.) Suite green (175).
…otext)

Area (region) fields re-ran `pdftotext -x -y -W -H` per area (a subprocess each).
Now pdftotext.py reads word positions once via `pdftotext -bbox-layout` (cached
per file by mtime) and crops the requested rectangle in Python:
- _words(): parse + cache the bbox-layout word boxes.
- _crop(): convert the area (pixels at dpi r) to points, select words in the rect,
  group into lines, join left-to-right.
- to_text(area) routes through them; full-page extraction is unchanged.

Proven: 3 distinct areas on one doc -> 1 subprocess (was 3). Verified by the area
harness (region content present, out-of-region excluded) + a new end-to-end test
(cropped text feeds a field regex -> correct value). Suite green (176).

NOTE: output is single-space-joined, not pdftotext -layout column spacing. The 2
built-in area templates (nl.buijtendijk, fr.publicationannoncelegale) have no test
PDFs and couldn't be auto-verified -- their \s-based regexes should still match,
but worth a manual check.
pypdfium2 can extract a bounded region in-process, so area/region fields no longer
require the poppler pdftotext binary on the default backend.

- SUPPORTS_AREA = True for pdfium.
- _crop_pages(): convert the area (pixels at dpi r, top-left origin) to PDFium's
  points / bottom-left coords (pt = px*72/r, y flipped by page height) and call
  get_text_bounded(left, bottom, right, top) per page in the range.

The 2 built-in area templates pin input_module: pdftotext, so they're unaffected;
this adds area support to the pdfium-first cascade for new templates. PDFium text
isn't byte-identical to pdftotext -layout, so an area template targets one backend.

Test: pdfium area crop on oyo.pdf (region present, out-of-region excluded). Suite
green (177).
Product lines that carry a line_tax_code are now enriched with the
line_tax_percent from the matching tax_lines summary row, and their
line_tax_amount is computed from price_subtotal. Existing line values
and code-less lines are left untouched. Completes the second task of
issue #535 (the first, documenting tax_lines, landed via #536).
When invoice lines carry no tax code but the tax_lines summary has
exactly one active (non-zero) rate, apply that rate to every line and
compute line_tax_amount. Mixed-rate summaries remain ambiguous and are
left untouched (use the tax_lines global-adjustment fallback). Adds
_to_float/_set_line_rate/_single_active_rate helpers and 7 tests.
Provide an environment.yml so conda/mamba users get invoice2data with the
common PDF backends and the OCR system tools (poppler, tesseract,
ghostscript) from conda-forge in one step, no manual system-library
install. Documented in installation.md.
mypy (src tests docs/conf.py):
- annotate test_issue_535 output dicts as dict[str, Any] (heterogeneous
  literals inferred as object, not indexable)
- narrow _match_template's InvoiceTemplate|None before indexing in
  test_issue_608
- yaml import-untyped ignores in test_loader/test_template_builder
  (matching src/template_builder pattern); arg-type ignore in
  test_lines_replace stub call

xdoctest:
- extract_data docstring example now includes the template_name key
  added to default output (#618)

pre-commit (ruff, expanded ruleset):
- benchmarks/run.py: open()->Path.open() (PTH123),
  try/except/pass->contextlib.suppress (SIM105)
- docs/conf.py: os.path.abspath()->Path.resolve() (PTH100), drop os
to_json/to_csv/to_xml write_to_file doctests wrote invoice.{json,csv,xml}
to the cwd and never cleaned up (xdoctest left them in the repo root).
Write to a tempfile dir and assert path.exists() instead. Add an
anchored .gitignore safety net for the default output names.
- static.py 73->100%: missing-value warning path
- output/__init__.py 92->100%: /dev/stderr stream alias
- to_xml.py 79->94%: date + list value branches in dict_to_tags
- invoice_template: cover _apply_tax_to_lines non-list guards and
  _single_active_rate non-dict-row skip

6 new tests (191 passing); overall coverage steady at 89%.
Fixes the lone 'document isn't included in any toctree' warning so the
shibuya docs build is warning-free.
CI:
- coverage job posts a PR coverage comment via
  py-cov-action/python-coverage-comment-action (GITHUB_TOKEN only, no
  external account/secret); badge published to its data branch on master
- companion coverage-comment.yml (workflow_run) posts comments for
  fork PRs; drop the codecov step (needed an org CODECOV_TOKEN + had a
  misconfigured file: input)
- coverage.run.relative_files=true so the action maps paths

Coverage (90% total):
- _regex.py 85->100% (engine-selector env toggle)
- loader.py 88->95% (malformed-JSON skip + missing-keywords path)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 24, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/invoice2data
  __main__.py 92-103, 132, 141, 205, 253, 326-327, 413-418, 422-428, 475-483, 508, 538, 550-552, 571-572, 575-577, 593, 731, 733, 736, 764-765, 832-833
  src/invoice2data/ai
  __interface__.py 33, 52
  fallback.py 83, 86-87
  template_generator.py 76, 83, 133
  src/invoice2data/extract
  _dates.py 34-35, 115
  candidates.py 77-78, 146, 148
  invoice_template.py 222, 230, 496, 505-506, 523-529, 556-557
  loader.py
  template_builder.py 48, 64
  src/invoice2data/extract/parsers
  lines.py 145
  regex.py 45-46, 52, 119
  src/invoice2data/extract/plugins
  tables.py 148, 154, 198-199, 210
  src/invoice2data/input
  gvision.py 124
  hotpdf.py 36-43
  ocrmypdf.py 83-87, 145-146
  paddleocr.py 32, 66-72
  pdfoxide.py 36-42
  pdfplumber.py 36-38
  pdftotext.py 54-55, 165
  tests
  test_camelot.py 157-165
  test_cli.py 23
  test_extraction.py
  test_invoice_template.py 43
  test_lib.py 37-38, 52, 238, 253, 287-295, 339
  test_template_generator.py 63
Project Total  

The report is truncated to 25 files out of 81. To see the full report, please visit the workflow summary page.

This report was generated by python-coverage-comment-action

bosd added 14 commits May 25, 2026 01:02
Mock the tesseract/ImageMagick/pdftotext binaries + Popen pipeline so the
backend is fully exercised without real OCR tooling or image fixtures
(is_available, PDF + image paths, area_details, both timeout branches,
get_languages parse/error paths).

Also fix a latent bug: when pdftotext timed out, extracted_str was never
assigned -> UnboundLocalError at return; initialise it to b"" so a
timeout returns empty text and just logs a warning.

13 tests; tesseract.py 100%; overall coverage 90%->92%.
Mock the camelot module + read_pdf so extract()'s logic (rule list,
kwarg forwarding, header/field/tables handling, table selection,
read_pdf failure, not-installed + no-file guards) is exercised in CI
where camelot-py is absent. Plus a real, skipif-camelot integration
test against the bol.com invoice.

Salvage tests/files/camelot-bol100649863.pdf from the
bosd/camelot-test-files branch (camelot-example.pdf there was identical
to the existing one). The 'camelot-' name prefix keeps it out of the
get_sample_files sweep (inputparser_specific), so test_copy stays exact.

11 tests; camelot.py 100%.
…banner

- reference.md: document the whole current surface — pdfium (default),
  doctr/paddleocr/hotpdf/pdfoxide/tesseract, camelot, the ai/* subsystem,
  schema/validators/candidates/suggestions/_dates/_regex/template_builder,
  output streams, backend registry. Library API section for extract_data.
  (:no-index: on InvoiceTemplate works around an autodoc dup with the
  OrderedDict[str, Any] generic base; loader gets __all__ so the imported
  class isn't re-documented.)
- how-it-works.md: refreshed mermaid (pdfium default, docTR/paddleocr,
  cascade, schema/validation, optional AI), internal cross-refs instead of
  fragile GitHub-master links, MyST header.
- installation.md: pdfium is the default (no system deps); poppler/OCR are
  optional; add extras table + docTR/PaddleOCR; MyST headers.
- faq.md: fix typos, replace the dated 'Gemini chat' section with the
  built-in AI (--new-template --ai, --ai-fallback); add a 'Comparison with
  other tools' section; MyST header.
- new docs/ai.md (provider config + the three AI entry points), in toctree.
- README: add SVG banner, drop broken 'Test' badge, swap codecov->coverage
  badge, fix run-on intro + stale backend list + stale output example.
- new docs/_static/banner.svg.

Docs build is warning-free.
The README doubled as the docs landing page and had grown to ~370 lines.
Move the detailed CLI how-to into docs/usage.md (input readers, output
streaming, --copy, debug flags, AI, library use) and rely on the already
comprehensive docs/tutorial.md for the template system; the README now
keeps a short Quickstart + a Documentation index and the repo-meta
sections. Net: 369 -> 150 lines.

Also migrated the two genuinely newer notes into tutorial.md: the
fastest-first date parsing tiers (strptime->dateutil->dateparser, with
dateparser optional) and the area-extraction DPI math + supported
backends. Added reference-style link targets in both README (github-only
block) and docs/index.md so the shared body resolves in both contexts.
Docs build is warning-free.
Replace the old logo.png with a vector logo.svg in the same style as the
new banner — a self-contained dark app-icon tile (invoice card + accent
bar + extraction arrow) that stays crisp at any size and reads on both
light and dark sidebars. Point html_logo at it; drop the orphaned PNG.
Docs build is warning-free.
- Fix the one real latent typing issue both ty and mypy flag when the
  ocrmypdf extra is installed: annotate OPTIONS_DEFAULT as dict[str, Any]
  so **ocrmypdf_conf matches ocrmypdf.ocr()'s typed kwargs. Type-checkers
  are now clean with or without the extra present.
- Pin ty (== 0.0.39): beta + fast-moving, so a new release can't silently
  add diagnostics that break CI (same approach as the pinned ruff).
- Document why ty stays on src (it doesn't honour mypy's # type: ignore in
  the test fixtures); mypy remains authoritative for tests.

ty check src: All checks passed. mypy: 93 files clean.
The Windows job was disabled due to flaky Chocolatey system-dep installs.
Re-enable it, but make it green whether or not the binaries land:

- Windows system-dep install is now best-effort (continue-on-error); the
  default backend (pypdfium2) needs no system binaries.
- Skip-guard the poppler-dependent tests when pdftotext is absent:
  the whole TestCLI golden suite (templates pin input_module: pdftotext)
  and test_lib.test_extract_data_pdftotext. tesseract/area tests were
  already guarded; the OCR/DL backend tests are fully mocked.

So Windows validates the pure-Python core + pypdfium2 path reliably, and
runs the full golden suite too when choco succeeds. Linux/macOS unchanged
(217 passed, 5 skipped — guards are no-ops when poppler is present).
Add tests for CLI helpers that lacked coverage: the three log formatters
(Color/Plain/JSON), _default_template_path slugification, _preferred_module
unknown/unavailable input_module warnings, and _run_new_template's no-text
SystemExit. 8 tests; overall coverage 93%.
conda-recipe/meta.yaml + README. mypyc compilation is opt-in
(INVOICE2DATA_COMPILE_MYPYC=1, off by default), so a plain pip install is
pure-Python and the package ships as a single noarch build — no compiler,
no wheel matrix. The README documents the staged-recipes submission flow
(fill sha256 from the released sdist, open the PR, feedstock is created).
Recipe renders + parses; conda-build validation noted.
extract_data(..., raise_on_error=True) now raises instead of returning {}:
- NoTemplateFoundError when nothing matched
- RequiredFieldsMissingError (with .fields + .template_name) when a template
  matched but a required field couldn't be parsed

New invoice2data/exceptions.py: InvoiceProcessingError base; the two errors
subclass it, and RequiredFieldsMissingError also subclasses ValueError so the
cascade's existing 'except ValueError' retry is unaffected. _check_required_fields
now raises the typed error; the cascade records missing-field reasons and the
boundary re-raises the most specific one. Default stays {} (non-breaking).
Re-exported from invoice2data; documented (reference/usage/migration-1.0).
4 tests; full suite 229 passed; mypy/ty/ruff/pydoclint clean.
Invoices are full of predictable 'label: value' pairs. New extract/labels.py
recognises known labels with multilingual synonyms (BTW/VAT, KvK/CoC/Chamber
of Commerce, Invoice No/Factuurnummer, IBAN, BIC, Due Date, ...) and captures
the value next to them. This complements the value-pattern candidates: the
label both disambiguates the value (a CoC number is just digits — only the
label says what it is) and gives a robust regex anchor.

suggested_template now merges these in, so --new-template drafts label-only
fields (partner_coc, invoice_number) it previously couldn't, each anchored on
its label. Also fixes a latent bug in the regex anchor builder: re.escape
escapes spaces, so multi-word labels were mangled — now split + escape + join
with \s+ (new _anchor helper, used by both code paths).

12 tests; full suite 234 passed; mypy/ty/ruff/pydoclint clean.
--new-template --interactive: copier-style review of each drafted field --
shows what it captures (after cleanup), keep / edit-regex / skip, then offer to
add fields the builder missed (_interactive_template + preview_field).

Capture-then-clean: label specs can carry a 'cleanup' so the drafted field is a
{regex, replace} dict that captures the noisy form and sanitises it -- VAT keeps
its separators in the capture but strips them on output (NL12.34.56.789.B01 ->
NL123456789B01), and a CoC number drops an appended place (12345678 Amsterdam ->
12345678). suggested_template now prefers the labeled (anchored + cleaned) field
for ids and label-only fields, keeping the value heuristics for date/amount.

template_builder gains field_regex/set_field_regex/preview_field helpers (str or
dict fields); preview_template handles dict fields + replace. New --interactive
flag. 9 tests; full suite 240 passed; mypy/ty/ruff/pydoclint clean.
The two fast spike backends were exercised only by the benchmark; add
mocked unit tests (registration, is_available, to_text via an injected
fake module) that lock their contract without installing the optional
deps. Both modules now 100% covered, promoting them from spikes to
fully-committed backends.
Combined coverage is ~94%; the uv-forge template's old fail_under=75 no
longer reflects the suite. Set it to 90 — a regression guard with a small
buffer for cross-matrix variance (the template default of 100 is
unrealistic here). PTH migration already landed (229ecf9); the
publishing opt-in remains a uv-forge template TODO, not a change in this
repo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant