Conversation
added 30 commits
May 23, 2026 02:57
All extract-layer regex calls now go through extract/_regex.py, which caches compiled patterns (lru_cache) instead of recompiling on every call across parsers/regex.py, parsers/lines.py, plugins/tables.py and invoice_template.py. The engine is selected once at import: stdlib re by default (behaviour unchanged), or the API-compatible regex package when INVOICE2DATA_REGEX_ENGINE=regex — which also gives the previously-declared but unused 'regex' dependency a purpose. RE2/pyre2 deliberately avoided (no lookaround/backrefs that user templates rely on). Golden suite passes under both engines. First step of the 1.0.0 roadmap (prerequisite for the parser benchmark harness). Part of #v1 / next branch.
Add input/__interface__.py documenting the backend contract and move the hand-maintained input_mapping into a single registry in input/__init__.py (INPUT_MODULES) with helpers supports_area()/is_available()/available_modules(). Backends now declare capabilities/availability instead of being hardcoded: - SUPPORTS_AREA on pdftotext/tesseract/ocrmypdf replaces the hardcoded '(pdftotext, ocrmypdf, tesseract)' tuple in invoice_template._handle_area. - is_available() added per backend (binary check for pdftotext/tesseract, import check for pdfplumber/pdfminer, alias to existing checks for ocrmypdf/gvision) so backends self-exclude when deps are missing. __main__.py now sources input_mapping from the registry. Fully backward compatible: extract_data() still accepts both module objects and string keys. Adds a contract test. Gates the Phase B backends (pypdfium2/pdfsink-rs/hotpdf).
- Add docs/migration-1.0.md (in the toctree): Python support, the one
deprecation (the legacy 'lines' plugin / top-level 'lines:' key -> use
'parser: lines'), the prefix-magic family explicitly RETAINED in 1.0
(static_ is used by ~half the built-in templates), clarified contracts,
and forward-looking 1.0 changes (validation, tax_lines/CSV, regex, backends).
- Emit a DeprecationWarning from the 'lines' plugin (it already self-documents
as superseded by the parser); deduped once per process.
- Fix the extract_data() docstring: it returns {} on failure, not False.
- Add a deprecation-warning test.
Deliberately does NOT deprecate the widely-used prefix magic (static_/sum_/
auto-typing) to avoid ecosystem churn; documented as under-review instead.
- output/to_csv.py: JSON-encode list/dict cells (lines/tax_lines) so the CSV is valid and machine-readable instead of a Python repr. New --csv-lines option: 'json' (default) or 'explode' (one row per line item, line_<key> columns). Dates inside arrays are formatted too (reuses to_json.format_item). - invoice_template: compute a missing tax_lines line_tax_amount from price_subtotal * line_tax_percent/100 (never overwrites; product 'lines' left untouched to avoid golden churn), plus an advisory tolerance warning when tax_lines don't sum to amount_tax. - Wire --csv-lines through the CLI. - Tests for json/explode CSV and tax computation. Golden suite unchanged (57 passed). Bulk migration of the 7 tax_lines templates to a single canonical schema is left as a follow-up.
Add extract/schema.py: the canonical invoice/line/tax_line field vocabulary (mirrors docs/recommended-template-fields.md + the OCA Odoo module) and validate_output(), the single source of truth for field names. invoice_template now validates output field names after extraction: - Quiet by default — a field is only warned about when it looks like a TYPO of a canonical name (custom fields are legitimate and stay silent). Verified zero false positives across all 215 built-in templates. - Opt-in per template via options.strict_fields: true (raises on any unrecognized field) with options.extra_fields: [...] to whitelist customs. Updated docs/migration-1.0.md (validation/tax/CSV/regex/backends now landed). Tests for the schema + validation.
… (B1, B2) B1: add input/pdfium.py (pypdfium2) and input/hotpdf.py behind the A6 interface (is_available via importlib; optional-deps extras 'pdfium'/'hotpdf'; registered in INPUT_MODULES). pdfsink-rs is not on PyPI, so not included. B2: benchmarks/run.py scores each backend on speed AND accuracy (field-match vs the golden outputs, via extract_data — i.e. real template compatibility). Findings (11 compare PDFs): pdftotext 85.9% acc / 19.5 ms; pdfium 20.6% / 4.4 ms; pdfminer 24.7% / 77.5 ms; hotpdf 4.1% / 367.9 ms; pdfplumber backend errors (0%, separate bug). Conclusion: pdftotext stays the default — the fast backends lose too much accuracy against pdftotext-tuned templates; pypdfium2 is a good optional fast backend for re-tuned templates / triage. (incl. a ruff-format wrap fix in test_schema.py)
pdf-oxide (Rust, MIT/Apache, py3.8-3.14) as input/pdfoxide.py behind the A6 interface, registered as --input-reader pdfoxide, optional extra 'pdfoxide'. Benchmark (11 compare PDFs): pdf-oxide is the best of the fast backends — ~4.5 ms/file (pypdfium2-class speed) at 31.2% accuracy vs pdfium 20.6% / pdfminer 24.7% / hotpdf 4.1%. Still below pdftotext (85.9%) since templates are pdftotext-tuned. Uses basic extract_text; its auto/markdown/table modes are worth exploring for higher accuracy.
Benchmark of pd-oxide's extraction modes: to_plain_text_all scores 34.7% vs 31.2% for the basic extract_text (and is layout-aware, the better fit for invoices). Still below pdftotext (85.9%) — no fast backend approaches it on the current pdftotext-tuned templates without re-tuning.
`to_text()` accumulated per-page text into `raw_text` but then overwrote it with `res_to_raw_text([res])`, where `res` only had "all"/"first" keys and never a "text" key, so the helper always returned "". The pdfplumber backend was effectively dead: `extract_data` logged "Failed to extract text" for every file routed through it. Drop the dead `res` dict, the `res_to_raw_text([res])` overwrite, and the now-unused `res_to_raw_text` helper; return the text gathered in the page loop. Guard against `extract_text()` returning None for empty pages with `or ""`. The layout/tolerance params are kept unchanged: they emulate `pdftotext -layout`, which the templates rely on. Add a regression test asserting `pdfplumber.to_text()` returns non-empty text for a bundled invoice. (cherry picked from commit c27248a)
A commented-out YAML example in the `tables` plugin contained a line `# type: float`, which mypy parses as a PEP 484 type comment. mypy 1.10.1 then reports a spurious "expected an indented block after 'elif'" syntax error and aborts, so `nox -s mypy` failed before checking any file — masking the rest of the type errors in the tree. Reword the example onto one inline line so no comment starts with `# type:`. No behaviour change (it was already dead comment text).
Generalise the hard-coded ocrmypdf fallback in `extract_data` into a
configurable, ordered backend cascade. When no backend is forced
(`input_module=None`), each backend in `DEFAULT_INPUT_READERS`
(currently `pdftotext`, then `pdfium`/pypdfium2) is tried until a
template matches with all required fields; OCR (ocrmypdf) remains the
last resort. Unavailable backends self-exclude via `is_available()`.
A template may pin the backend it was authored for with a top-level
`input_module:` key (e.g. a layout-sensitive or area template that needs
poppler). When a template matches under one backend but declares another,
we re-extract with the declared one. This is the correctness escape hatch
for backends that *silently* mis-extract (which the retry can't detect),
and it short-circuits straight to the right backend once a faster default
leads the cascade.
The cascade order is deliberately left `pdftotext`-first; flipping to a
faster default is a benchmark-gated follow-up so we don't risk silent
extraction regressions on the existing pdftotext-tuned templates.
Behaviour: a matched-but-incomplete extraction now returns `{}` (the
documented contract) instead of propagating `ValueError`. Forcing
`--input-reader`/`input_module=` keeps the single-pass behaviour.
Refactors `extract_data` into small helpers (`_resolve_readers`,
`_safe_to_text`, `_match_template`, `_preferred_module`, `_run_template`,
`_ocr_last_resort`); removes the tuple-returning
`extract_data_fallback_ocrmypdf`. Adds tests and migration notes.
The `mypy --strict` session had been silently failing (and is disabled in CI). With the tables.py parse error fixed, it surfaced 22 pre-existing errors. Clear them and turn the gate back on: - Centralise optional-backend import handling in `[[tool.mypy.overrides]]` (pdfminer, pdfplumber, pypdfium2, pdf_oxide, hotpdf, docutils, sphinxmermaid) with `ignore_missing_imports`; google.cloud is a partial namespace package so it gets a `follow_imports = "skip"` override. Remove the now-redundant inline `# type: ignore[import-*]` comments. - Type `_regex.compile` -> `re.Pattern[str]` (cast) and `_regex.search` -> `re.Match[str] | None`, killing the "Returning Any" errors in the regex wrappers and `parsers/lines.py`; update the regex-cache test for the (correct) Optional return. - Export `Invoice2Data`/`extract_data` via `__all__` so the re-export is explicit. Annotate `docs/conf.py` (setup/skip_mermaid) and the CSV/deprecation tests; use an OrderedDict template in the deprecation test to match the plugin signature. Verified green with `mypy --python-version` 3.10/3.11/3.12/3.13. Re-enable the `mypy` session in CI on the floor + latest (3.10, 3.13), matching the noxfile decorator.
Normalise PDFium's text artifacts so its output plays well with the line parser and templates (which work in terms of `\n`): collapse `\r\n`/`\r` to `\n` and strip stray zero-width markers (U+FEFF / U+FFFE) that PDFium emits on some documents (e.g. around hyphenated line breaks; see the py-pdf/benchmarks post-processing). On oyo.pdf this removes 41 carriage returns. Make the signature interface-conformant (`area_details=None`): PDFium has no layout mode and uses a different coordinate system, so area extraction is unsupported — log a warning and ignore it. Area/layout-sensitive templates should pin `input_module: pdftotext` (see the backend cascade). Adds a regression test asserting non-empty, CR-free output.
…lates Run the B2 benchmark and record results in docs/backend-benchmark.md. Headline: pdftotext is the accuracy anchor (85.9%); pypdfium2 is the best fast backend (42.9% in isolation, ~5x faster) and the cascade recovers the rest via fallback. With the four layout/area/table-sensitive bundled templates pinned to pdftotext, a pypdfium2-first cascade reaches the same 85.9% accuracy at ~1.5x the speed. Pin `input_module: pdftotext` on those templates (com.amazon.aws, nl.be.coolblue, fr.free.adsl-fiber, fr.publicationannoncelegale) — their line-item tables / area-based date need poppler's -layout, which pypdfium2 cannot reproduce. No behaviour change under the current pdftotext-first default; this readies them for a future default flip.
A template's `input_module:` pin is a default-mode hint, but it was also honoured when a backend was explicitly forced (`extract_data(..., "pdfium")` / `--input-reader`). That silently re-routed a forced backend to the pinned one — surprising for the API and, worse, it made the benchmark unable to measure a backend on pinned templates (it secretly used pdftotext). Gate the pin behind auto mode (`input_module is None`). An explicit backend is now taken at face value. Add a test asserting a forced backend bypasses the pin.
Flip DEFAULT_INPUT_READERS to [pdfium, pdftotext]: the fast, dependency-light pypdfium2 backend now leads, with pdftotext (poppler -layout) as the fallback. The benchmark shows this keeps accuracy at the pdftotext level (85.9%) while running faster, because the cascade falls back automatically. Add a soft-completeness signal: if a matched template declares a lines/tables block (or a `parser: lines` field) but the backend produced no line items, keep the result as a fallback and try the next backend, which may recover the table. This auto-recovers layout-less backends that return an empty table (e.g. AWS, free_fiber) without a per-template pin. Pin the remaining area template (nl.buijtendijk) — pypdfium2 cannot do area extraction at all. Document the flip and the (small) migration for template authors whose non-required field comes back populated-but-wrong. Remaining manual pins are only for "populated-but-wrong" degraders (area fields, column-aligned tables); total failures and empty tables are handled automatically by the cascade.
Add a new optional `camelot` plugin that detects ruled/whitespace-aligned tables by re-reading the PDF with camelot-py (the current read_pdf API, mapping each table to a list of dicts under a configurable output field). It is opt-in: install `invoice2data[camelot]`, add a top-level `camelot:` block to a template, and the plugin self-excludes via is_available() when camelot is absent. Plugin interface: plugins now receive the source `invoice_file` (the C1 prerequisite — text plugins ignore it, path-based ones like camelot need it). lines/tables gain the (ignored) parameter; the doc is updated. Packaging: the published camelot-py requires pdfminer.six>=20240706 while pdfplumber==0.11.4 hard-pins ==20231228 — a genuine library conflict — so the `camelot` extra is declared mutually exclusive with the `pdfplumber` and `pdfminer-six` extras via `[tool.uv] conflicts`. camelot is treated as Any by mypy (untyped, implicit re-exports). Tests: header/no-header row mapping (always run) + a lattice-table integration test on a bundled fixture (skipped when camelot is absent). The fixture is kept out of the generic sample sweep via inputparser_specific.
Normalize line-item / tax-line output keys to one vocabulary, two ways: (B) Output normalization layer: schema.normalize_line_fields() maps non-canonical keys in lines/tax_lines to canonical ones at extraction time (description→name, unit_price/unitprice→price_unit, vat_rate/tax_percent→line_tax_percent). Wired into InvoiceTemplate.extract before tax computation. So any template (incl. community ones) emits the standard vocabulary with no manual change. (A) Tidy the bundled templates that used those aliases (14 of them) to the canonical names directly, so they ship as good examples. Update the one affected golden (AmazonWebServices: description→name). Corrected against the OCA Odoo v14 module source: `product` is a DISTINCT line field (product matching), NOT a synonym for `name`, so it is added to the canonical LINE_FIELDS (with `taxes`) and never aliased — the 17 `product` templates are already correct. A line's label is `name`; `description` is only an invoice-level field, hence description→name. Docs: add `product`/`taxes` to the recommended line-fields table, clarify `name` is the label, note `description` is an accepted alias, and fix the misleading Example Usage (it used `(?P<description>)` for a line item). Tests: normalize_line_fields cases (aliases, product-not-aliased, canonical wins). All 215 templates still load; suite green.
Establish the copier link to the uv-forge template: - Add .copier-answers.yml (gh:bosd/uv-forge @ b387b46) so `copier update` works going forward; retire the old .cookiecutter.json. Switch the build backend setuptools -> hatchling (what uv-forge generates; the build-backend change is done as its own validated step, ahead of mypyc): - [tool.hatch.build.targets.wheel] packages = ["src/invoice2data"] — the wheel ships the package + py.typed + the 215 bundled templates and nothing else (test fixtures live outside src/, so no PDFs/PNGs in the wheel). - [tool.hatch.build.targets.sdist] excludes the heavy binary test fixtures (PDFs/PNGs) and build/cache cruft, keeping the sdist lean (~328K vs ~1.5M). Bump the typing toolchain to uv-forge's versions (mypy >= 1.13) and add the `ty` group (Astral's type checker — the template switch the ty follow-up was waiting on). typeguard/xdoctest bumped to match. Deliberately deferred to separate validated steps (kept current for safety): the expanded ruff rule set + ruff bump, coverage fail_under=100, the docs theme (furo->shibuya/sphinx 8), and regenerating the workflows (the auto-publishing release flow + CI system deps stay as-is).
The uv-forge toolchain bump (mypy >= 1.13 -> resolves 2.x, + the new `ty`
checker) surfaced real typing issues; fix them so both checkers pass.
mypy 2.x:
- Drop the now-redundant `cast(dict[str, Any], tpl)` in loader.py.
- Guard ColorLogFormatter: `LOG_LEVEL_COLOR.get(level, {}).get(...)` so an
unknown log level can't hit None (latent crash).
- Route the defusedxml/ocrmypdf optional imports through the mypy
ignore_missing_imports overrides; drop their inline ignores.
ty (newly adopted — the template switch the ty follow-up waited for):
- Type the line/table extractors' `self`/`template` as `InvoiceTemplate`
(via a TYPE_CHECKING import) so `coerce_type`/`parse_date`/`parse_number`
resolve in both checkers; this lets us delete the `# type: ignore
[attr-defined]` suppressions in tables.py and parsers/lines.py.
- `# ty: ignore[unresolved-import]` on the guarded optional imports
(ocrmypdf, defusedxml) that ty can't see when the extra isn't installed.
- Replace deprecated `codecs.open` with builtin `open`.
- Build a real InvoiceTemplate in the deprecation test.
Green: mypy --strict (3.10 + 3.13), `ty check src`, ruff, pydoclint,
79 passed / 3 skipped.
Continue the uv-forge template adoption: - ty gate: add a `ty` nox session (checks src; mypy stays authoritative) + a 3.13 CI job + `nox.options.sessions`. `[tool.ty.rules] unresolved-import = "ignore"` because invoice2data has conflicting optional backends ty can't all resolve. Drop the now-redundant inline `# ty: ignore` comments. - Release model: replace the salsify token auto-publish with uv-forge's flow — on push to master: draft release notes + publish to TestPyPI; on a published GitHub Release: build + provenance attestation + sigstore signing + publish to PyPI via Trusted Publishing (no long-lived token). NOTE: needs PyPI Trusted Publishing configured for invoice-x/invoice2data + a TEST_PYPI_TOKEN secret before the next master release. - Docs theme furo -> shibuya (in the pyproject docs group AND docs/requirements.txt for RTD AND conf.py), which requires sphinx >= 8 + myst-parser >= 4; sphinx-mermaid / rsvg badge handling preserved. HTML build verified. - Fix pre-commit: trailing blank line in .copier-answers.yml (end-of-file). Green: ty (src), mypy --strict (3.10/3.13), 79 passed, docs HTML build, workflows valid YAML.
…ild) The README badges were pulled into the docs via the README include, and the LaTeX/PDF builder ran them through rsvg -> PDF, which produced corrupt / PDF-1.7 badge PDFs (pdflatex caps included PDFs at 1.5) AND left extra PDFs in the output dir -> RTD "Build output directory contains multiple files". Add a `<!-- docs-body -->` marker after the badge block and `start-after` it in docs/index.md (with an explicit page title), so the badges stay on GitHub/PyPI but never enter the Sphinx build. HTML verified: badges gone, title + content intact, build succeeds. PDF/LaTeX verifies on RTD.
In regex.parse, `result` starts as a list of matches then becomes a coerced scalar/grouped value. Annotate it `Any` so mypyc doesn't strict-check it against the initial list type (it raised TypeError at runtime on the compiled build). No behaviour change; mypy + suite stay green. Part of C2 (mypyc).
Add optional mypyc compilation of the hot-path leaf modules (extract/utils, _regex, parsers/regex, parsers/lines, plugins/tables) — the codebase is mypy --strict clean so mypyc compiles them; the suite passes against the compiled build (the regex.parse `result: Any` fix in e1a479a was the only type-precision change needed). Following uv-forge's mypyc extension: - Switch build-backend hatchling -> setuptools; add a conditional setup.py that mypycifies the modules only when INVOICE2DATA_COMPILE_MYPYC=1, so the default `uv build`/sdist install stays pure Python. Build requires list the deps mypyc needs to see types (click/dateparser/PyYAML/regex/mypy). - Restore [tool.setuptools] package-data (py.typed + templates) — wheel/sdist stay clean (no PDFs/tests). - release.yml: build cross-platform compiled wheels with cibuildwheel (linux x86_64/aarch64, windows, macos intel/arm) + a pure-Python sdist on a published GitHub Release, then sign + publish via Trusted Publishing. Keep the TestPyPI-on-push staging step. [tool.cibuildwheel] skips PyPy/musllinux and smoke-tests that the compiled extensions import on every platform. - .copier-answers: extension none -> mypyc. Verified locally: default build = pure-Python py3-none-any wheel (215 templates, 0 .so); INVOICE2DATA_COMPILE_MYPYC=1 build = cp313 wheel with 6 .so that installs and extracts correctly. Cross-platform wheels verify in CI (cibuildwheel runs only on release).
- A6 conformance: declare SUPPORTS_AREA = False (gvision has no area mode; is_available alias was already present). - Wire the previously-dead `language` arg into the request via vision.ImageContext(language_hints=[language]). - Replace print() with logger.info for the "waiting for OCR" message. - Harden: raise a clear OSError when Vision returns no result blob (was an AttributeError on None). - Document the modern bucket-free alternative (Google Document AI OCR processor; cf. OCA account_invoice_google_document_ai) as a future backend. Mocked tests still pass; mypy/ruff/pydoclint clean. The live GCS/Vision flow remains creds-gated (mocked test is the CI gate).
Adopt the larger uv-forge lint families on top of the existing set: ASYNC, C4, FURB, LOG, PERF, PIE, RET, SIM, TRY (ruff bumped >=0.8 -> resolves 0.15.14; FURB/LOG need >=0.8). Fixed the ~38 surfaced violations (auto + manual): drop superfluous else-after-return (RET505), collapse same-arm/nested ifs (SIM102/114), list->generator in all()/any() (C419), ternaries (SIM108), `key in dict` (SIM118), return-without-temp (RET504), drop unused unpacked vars (RUF059), logging.exception in except (TRY400). One PERF203 noqa for the intentional per-file error-isolation try/except in the CLI loop. Deferred (documented in extend-ignore): - PTH (~70 os.path -> pathlib sites) — a focused follow-up migration. - TRY003 (long messages in raise) — would need custom exception classes. ruff format (0.15) reflowed 3 test files. Suite green (78 passed); mypy/pydoclint/ty clean.
Convert the 17 production os.path/open sites to pathlib.Path: open() -> Path.open(), os.path.exists -> Path.exists, os.path.join -> Path/"x", os.path.dirname -> Path.parent, os.path.abspath -> Path.resolve, os.path.basename -> Path.name. Drop now-unused `import os` / `from os.path import join` in __main__ and pdftotext; keep os in loader (os.walk) and gvision (os.getenv). Removes the global PTH ignore. Test scaffolding (~55 fixture sites, mostly test_cli.py) is exempted via a `tests/** = ["PTH"]` per-file-ignore since pathlib there is churn with no runtime benefit. Suite green (78 passed); ruff/mypy/pydoclint/ty clean.
…s) + docs Templates can already be loaded from a string instead of disk (extract_data(file, templates=ordered_load(db_text))), the Odoo "templates from the DB" path. But ordered_load's `loader` param was dead — the body hardcoded json.loads, so YAML-from-string silently failed. - Use the supplied `loader` (default json.loads; pass yaml.safe_load for YAML); broaden the parse-error catch to (ValueError, YAMLError); generic warning. - prepare_template: tolerate a streamed template lacking template_name in the missing-keywords warning (was a KeyError). - Add a YAML-stream test; document the string/DB pattern in the README library section. Suite green (79 passed); ruff/mypy/pydoclint clean.
…iguation (AUTH-2b) First piece of the template-authoring toolkit: a dependency-free validator layer that classifies a captured value by *validating* it, not by regex alone — since field-type patterns overlap (an IBAN and an EU VAT number can match each other). - validate_iban: ISO 13616 structure + mod-97 checksum (whitespace/case tolerant). - validate_vat: EU VAT format check (full country set; format, not per-country checksum — python-stdnum can layer on later). - validate_bic: ISO 9362 8/11-char format. - classify_identifier: tries validators strongest-first (IBAN checksum before VAT/BIC format) so an IBAN is never mistaken for a VAT number. New extract/validators.py + 26 tests. Will feed AUTH candidate typing/ranking, AI-1 drafting, and an optional runtime soft-validation. Suite green.
…AUTH-1) New extract/candidates.py: scans extracted text for typed candidates with their positions, the engine the copier-style CLI builder and AI template generation will consume. - find_dates (dateparser-validated), find_amounts (separator-aware float parse), find_identifiers (IBAN/VAT/BIC, typed via the AUTH-2b validators; two-pass so a label like "IBAN" never swallows the number, and space-grouped IBANs are caught). - find_candidates merges them, drops amounts that fall inside a date (the 12.05 in 12.05.2024), sorted by position. Candidate dataclass carries kind/value/start/end/parsed. 6 tests; suite green.
added 15 commits
May 24, 2026 20:30
…rop rework) Area extraction had zero golden coverage, which made the planned "parse once + crop in Python" optimization (#2) unverifiable/risky. These tests lock the current pdftotext area contract on a real PDF (oyo.pdf): an `area` returns ONLY the text inside the requested rectangle (header band present; content lower on the page excluded; smaller than the full page). When #2 is implemented, the Python crop must satisfy the same assertions -- so it can be done safely. Skipped when pdftotext (poppler) is absent. Suite green (171 passed).
Date parsing was ~58% of extraction time. New extract/_dates.py parses fastest-applicable-tier first: 1. the template's date_formats via stdlib strptime (microseconds, deterministic) 2. dateutil (fast, fuzzy, English-centric) 3. dateparser (multilingual / localized month names) -- now OPTIONAL - dateparser moved from core deps to the [dateparser] extra; python-dateutil added to core. Without dateparser, numeric/English dates still parse (tiers 1-2); localized month names need `pip install invoice2data[dateparser]`. - Centralized date parsing in _dates.parse_date (lru_cached). invoice_template, candidates and ai/fallback all route through it; no module imports dateparser at top level any more (it's lazily, guardedly imported in _dates only, returning None when absent). Benchmark (11 compare PDFs x3): parse_date cumulative 0.685s -> ~0 (compare templates set date_formats -> strptime tier); wall ~1.19s -> 0.80s. Goldens unchanged. 5 tests (each tier + the dateparser-absent path). Suite green (175).
…eparser - noxfile tests session now also syncs the `ai` and `dateparser` extras so the httpx-mocked AI request test and the dateparser tier-3 date test run in CI (both importorskip otherwise). - README: document the tiered date parsing (strptime -> dateutil -> dateparser) and that dateparser is now an optional extra for localized dates.
Two hotspots from re-profiling (after the date fast-path): - _match_template re-sorted all ~215 templates on every call (per reader, per cascade step). Now sorted once via _by_priority() in extract_data; _match_template iterates the pre-sorted list. (#4 from the perf list) - _check_required_fields built pformat(output) as a logger.debug ARGUMENT, so it ran on every successful extraction even with debug disabled. Guarded with logger.isEnabledFor(DEBUG). Warm benchmark (11 compare PDFs x3): ~0.174s -> 0.133s. (The earlier enum/regex compile overhead was dateparser's internals -- gone with the date fast-path; we use stdlib `re` by default, so "re vs regex" is moot.) Suite green (175).
…otext) Area (region) fields re-ran `pdftotext -x -y -W -H` per area (a subprocess each). Now pdftotext.py reads word positions once via `pdftotext -bbox-layout` (cached per file by mtime) and crops the requested rectangle in Python: - _words(): parse + cache the bbox-layout word boxes. - _crop(): convert the area (pixels at dpi r) to points, select words in the rect, group into lines, join left-to-right. - to_text(area) routes through them; full-page extraction is unchanged. Proven: 3 distinct areas on one doc -> 1 subprocess (was 3). Verified by the area harness (region content present, out-of-region excluded) + a new end-to-end test (cropped text feeds a field regex -> correct value). Suite green (176). NOTE: output is single-space-joined, not pdftotext -layout column spacing. The 2 built-in area templates (nl.buijtendijk, fr.publicationannoncelegale) have no test PDFs and couldn't be auto-verified -- their \s-based regexes should still match, but worth a manual check.
pypdfium2 can extract a bounded region in-process, so area/region fields no longer require the poppler pdftotext binary on the default backend. - SUPPORTS_AREA = True for pdfium. - _crop_pages(): convert the area (pixels at dpi r, top-left origin) to PDFium's points / bottom-left coords (pt = px*72/r, y flipped by page height) and call get_text_bounded(left, bottom, right, top) per page in the range. The 2 built-in area templates pin input_module: pdftotext, so they're unaffected; this adds area support to the pdfium-first cascade for new templates. PDFium text isn't byte-identical to pdftotext -layout, so an area template targets one backend. Test: pdfium area crop on oyo.pdf (region present, out-of-region excluded). Suite green (177).
Product lines that carry a line_tax_code are now enriched with the line_tax_percent from the matching tax_lines summary row, and their line_tax_amount is computed from price_subtotal. Existing line values and code-less lines are left untouched. Completes the second task of issue #535 (the first, documenting tax_lines, landed via #536).
When invoice lines carry no tax code but the tax_lines summary has exactly one active (non-zero) rate, apply that rate to every line and compute line_tax_amount. Mixed-rate summaries remain ambiguous and are left untouched (use the tax_lines global-adjustment fallback). Adds _to_float/_set_line_rate/_single_active_rate helpers and 7 tests.
Provide an environment.yml so conda/mamba users get invoice2data with the common PDF backends and the OCR system tools (poppler, tesseract, ghostscript) from conda-forge in one step, no manual system-library install. Documented in installation.md.
mypy (src tests docs/conf.py): - annotate test_issue_535 output dicts as dict[str, Any] (heterogeneous literals inferred as object, not indexable) - narrow _match_template's InvoiceTemplate|None before indexing in test_issue_608 - yaml import-untyped ignores in test_loader/test_template_builder (matching src/template_builder pattern); arg-type ignore in test_lines_replace stub call xdoctest: - extract_data docstring example now includes the template_name key added to default output (#618) pre-commit (ruff, expanded ruleset): - benchmarks/run.py: open()->Path.open() (PTH123), try/except/pass->contextlib.suppress (SIM105) - docs/conf.py: os.path.abspath()->Path.resolve() (PTH100), drop os
to_json/to_csv/to_xml write_to_file doctests wrote invoice.{json,csv,xml}
to the cwd and never cleaned up (xdoctest left them in the repo root).
Write to a tempfile dir and assert path.exists() instead. Add an
anchored .gitignore safety net for the default output names.
- static.py 73->100%: missing-value warning path - output/__init__.py 92->100%: /dev/stderr stream alias - to_xml.py 79->94%: date + list value branches in dict_to_tags - invoice_template: cover _apply_tax_to_lines non-list guards and _single_active_rate non-dict-row skip 6 new tests (191 passing); overall coverage steady at 89%.
Fixes the lone 'document isn't included in any toctree' warning so the shibuya docs build is warning-free.
CI: - coverage job posts a PR coverage comment via py-cov-action/python-coverage-comment-action (GITHUB_TOKEN only, no external account/secret); badge published to its data branch on master - companion coverage-comment.yml (workflow_run) posts comments for fork PRs; drop the codecov step (needed an org CODECOV_TOKEN + had a misconfigured file: input) - coverage.run.relative_files=true so the action maps paths Coverage (90% total): - _regex.py 85->100% (engine-selector env toggle) - loader.py 88->95% (malformed-JSON skip + missing-keywords path)
added 14 commits
May 25, 2026 01:02
Mock the tesseract/ImageMagick/pdftotext binaries + Popen pipeline so the backend is fully exercised without real OCR tooling or image fixtures (is_available, PDF + image paths, area_details, both timeout branches, get_languages parse/error paths). Also fix a latent bug: when pdftotext timed out, extracted_str was never assigned -> UnboundLocalError at return; initialise it to b"" so a timeout returns empty text and just logs a warning. 13 tests; tesseract.py 100%; overall coverage 90%->92%.
Mock the camelot module + read_pdf so extract()'s logic (rule list, kwarg forwarding, header/field/tables handling, table selection, read_pdf failure, not-installed + no-file guards) is exercised in CI where camelot-py is absent. Plus a real, skipif-camelot integration test against the bol.com invoice. Salvage tests/files/camelot-bol100649863.pdf from the bosd/camelot-test-files branch (camelot-example.pdf there was identical to the existing one). The 'camelot-' name prefix keeps it out of the get_sample_files sweep (inputparser_specific), so test_copy stays exact. 11 tests; camelot.py 100%.
…banner - reference.md: document the whole current surface — pdfium (default), doctr/paddleocr/hotpdf/pdfoxide/tesseract, camelot, the ai/* subsystem, schema/validators/candidates/suggestions/_dates/_regex/template_builder, output streams, backend registry. Library API section for extract_data. (:no-index: on InvoiceTemplate works around an autodoc dup with the OrderedDict[str, Any] generic base; loader gets __all__ so the imported class isn't re-documented.) - how-it-works.md: refreshed mermaid (pdfium default, docTR/paddleocr, cascade, schema/validation, optional AI), internal cross-refs instead of fragile GitHub-master links, MyST header. - installation.md: pdfium is the default (no system deps); poppler/OCR are optional; add extras table + docTR/PaddleOCR; MyST headers. - faq.md: fix typos, replace the dated 'Gemini chat' section with the built-in AI (--new-template --ai, --ai-fallback); add a 'Comparison with other tools' section; MyST header. - new docs/ai.md (provider config + the three AI entry points), in toctree. - README: add SVG banner, drop broken 'Test' badge, swap codecov->coverage badge, fix run-on intro + stale backend list + stale output example. - new docs/_static/banner.svg. Docs build is warning-free.
The README doubled as the docs landing page and had grown to ~370 lines. Move the detailed CLI how-to into docs/usage.md (input readers, output streaming, --copy, debug flags, AI, library use) and rely on the already comprehensive docs/tutorial.md for the template system; the README now keeps a short Quickstart + a Documentation index and the repo-meta sections. Net: 369 -> 150 lines. Also migrated the two genuinely newer notes into tutorial.md: the fastest-first date parsing tiers (strptime->dateutil->dateparser, with dateparser optional) and the area-extraction DPI math + supported backends. Added reference-style link targets in both README (github-only block) and docs/index.md so the shared body resolves in both contexts. Docs build is warning-free.
Replace the old logo.png with a vector logo.svg in the same style as the new banner — a self-contained dark app-icon tile (invoice card + accent bar + extraction arrow) that stays crisp at any size and reads on both light and dark sidebars. Point html_logo at it; drop the orphaned PNG. Docs build is warning-free.
- Fix the one real latent typing issue both ty and mypy flag when the ocrmypdf extra is installed: annotate OPTIONS_DEFAULT as dict[str, Any] so **ocrmypdf_conf matches ocrmypdf.ocr()'s typed kwargs. Type-checkers are now clean with or without the extra present. - Pin ty (== 0.0.39): beta + fast-moving, so a new release can't silently add diagnostics that break CI (same approach as the pinned ruff). - Document why ty stays on src (it doesn't honour mypy's # type: ignore in the test fixtures); mypy remains authoritative for tests. ty check src: All checks passed. mypy: 93 files clean.
The Windows job was disabled due to flaky Chocolatey system-dep installs. Re-enable it, but make it green whether or not the binaries land: - Windows system-dep install is now best-effort (continue-on-error); the default backend (pypdfium2) needs no system binaries. - Skip-guard the poppler-dependent tests when pdftotext is absent: the whole TestCLI golden suite (templates pin input_module: pdftotext) and test_lib.test_extract_data_pdftotext. tesseract/area tests were already guarded; the OCR/DL backend tests are fully mocked. So Windows validates the pure-Python core + pypdfium2 path reliably, and runs the full golden suite too when choco succeeds. Linux/macOS unchanged (217 passed, 5 skipped — guards are no-ops when poppler is present).
Add tests for CLI helpers that lacked coverage: the three log formatters (Color/Plain/JSON), _default_template_path slugification, _preferred_module unknown/unavailable input_module warnings, and _run_new_template's no-text SystemExit. 8 tests; overall coverage 93%.
conda-recipe/meta.yaml + README. mypyc compilation is opt-in (INVOICE2DATA_COMPILE_MYPYC=1, off by default), so a plain pip install is pure-Python and the package ships as a single noarch build — no compiler, no wheel matrix. The README documents the staged-recipes submission flow (fill sha256 from the released sdist, open the PR, feedstock is created). Recipe renders + parses; conda-build validation noted.
extract_data(..., raise_on_error=True) now raises instead of returning {}:
- NoTemplateFoundError when nothing matched
- RequiredFieldsMissingError (with .fields + .template_name) when a template
matched but a required field couldn't be parsed
New invoice2data/exceptions.py: InvoiceProcessingError base; the two errors
subclass it, and RequiredFieldsMissingError also subclasses ValueError so the
cascade's existing 'except ValueError' retry is unaffected. _check_required_fields
now raises the typed error; the cascade records missing-field reasons and the
boundary re-raises the most specific one. Default stays {} (non-breaking).
Re-exported from invoice2data; documented (reference/usage/migration-1.0).
4 tests; full suite 229 passed; mypy/ty/ruff/pydoclint clean.
Invoices are full of predictable 'label: value' pairs. New extract/labels.py recognises known labels with multilingual synonyms (BTW/VAT, KvK/CoC/Chamber of Commerce, Invoice No/Factuurnummer, IBAN, BIC, Due Date, ...) and captures the value next to them. This complements the value-pattern candidates: the label both disambiguates the value (a CoC number is just digits — only the label says what it is) and gives a robust regex anchor. suggested_template now merges these in, so --new-template drafts label-only fields (partner_coc, invoice_number) it previously couldn't, each anchored on its label. Also fixes a latent bug in the regex anchor builder: re.escape escapes spaces, so multi-word labels were mangled — now split + escape + join with \s+ (new _anchor helper, used by both code paths). 12 tests; full suite 234 passed; mypy/ty/ruff/pydoclint clean.
--new-template --interactive: copier-style review of each drafted field --
shows what it captures (after cleanup), keep / edit-regex / skip, then offer to
add fields the builder missed (_interactive_template + preview_field).
Capture-then-clean: label specs can carry a 'cleanup' so the drafted field is a
{regex, replace} dict that captures the noisy form and sanitises it -- VAT keeps
its separators in the capture but strips them on output (NL12.34.56.789.B01 ->
NL123456789B01), and a CoC number drops an appended place (12345678 Amsterdam ->
12345678). suggested_template now prefers the labeled (anchored + cleaned) field
for ids and label-only fields, keeping the value heuristics for date/amount.
template_builder gains field_regex/set_field_regex/preview_field helpers (str or
dict fields); preview_template handles dict fields + replace. New --interactive
flag. 9 tests; full suite 240 passed; mypy/ty/ruff/pydoclint clean.
The two fast spike backends were exercised only by the benchmark; add mocked unit tests (registration, is_available, to_text via an injected fake module) that lock their contract without installing the optional deps. Both modules now 100% covered, promoting them from spikes to fully-committed backends.
Combined coverage is ~94%; the uv-forge template's old fail_under=75 no longer reflects the suite. Set it to 90 — a regression guard with a small buffer for cross-matrix variance (the template default of 100 is unrealistic here). PTH migration already landed (229ecf9); the publishing opt-in remains a uv-forge template TODO, not a change in this repo.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Long-lived integration branch for the 1.0.0 major release. Draft until ready; do not squash-merge to master until cutting 1.0.0 (that triggers the PyPI release).
Roadmap & rationale: see the approved plan. Phased: A foundation/breaking → B parsers/perf → C build/features.
Landed
extract/_regex.py); stdlibredefault,regexopt-in viaINVOICE2DATA_REGEX_ENGINE=regex. Golden suite passes under both engines.Next (Phase A)
Then
(Python 3.10→3.11 floor = separate later minor. AI features = next session, pre-1.0.)