Add number extraction method and line skipping functionality by avonbuttlar · Pull Request #652 · invoice-x/invoice2data

avonbuttlar · 2025-09-19T10:29:31Z

Implemented extract_number_from_text to extract numeric values from strings with text.
Updated coerce_type to utilize the new extraction method for converting strings to int/float.
Added support for skipping lines based on provided patterns in the parse_block function.

Since i dont get the tests to run (i do not have tesseract), i hope someone can help :).

Since this is my first pr to a public repo ever, please give feedback. Ik i need to do some documentation still, but i tought i first add the functionalltiy.

- Implemented `extract_number_from_text` to extract numeric values from strings with text. - Updated `coerce_type` to utilize the new extraction method for converting strings to int/float. - Added support for skipping lines based on provided patterns in the `parse_block` function.

bosd · 2026-05-22T16:59:21Z

Thanks for the PR @avonbuttlar, and welcome — nice first contribution! 🎉

The skip_line half is a useful, well-contained addition. The extract_number_from_text half needs rework before it can be merged, though, because coerce_type applies it to every int/float field globally, and the current regex truncates numbers ≥ 1000 that do not use a thousands separator.

The pattern [-+]?\d{1,3}(?:[.,\s']\d{3})*(?:[.,]\d+)? assumes thousands-grouping, so:

input	extracted	expected
`1234`	`123`	1234
`1234.56`	`123`	1234.56
`1939,00`	`193`	1939
`12123 Stk.`	`121`	12123

So even the 12123 Stk. example from the description comes out as 121. Applied to all numeric fields, this would change/break extraction of thousands-range amounts across existing templates (and the comparison fixtures in tests/compare/).

Suggestions:

Make the number extraction opt-in (a template/field option) instead of changing the global coerce_type path, so existing templates are unaffected.
Fix the regex so it does not require thousands separators (the \d{1,3}(...)* grouping is the culprit), and add unit tests covering 1234, 1234.56, 1.234,56, 12123 Stk., €25.50, and negatives.
Re: running tests without tesseract — that test is now skipped automatically when the tesseract binary is absent (latest master), so pytest should run for you after a rebase.

This is feature-sized, so it would land in a future minor release rather than the current maintenance line. Happy to help get it over the finish line — thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add number extraction method and line skipping functionality#652

Add number extraction method and line skipping functionality#652
avonbuttlar wants to merge 1 commit into
invoice-x:masterfrom
avonbuttlar:master

avonbuttlar commented Sep 19, 2025

Uh oh!

bosd commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

avonbuttlar commented Sep 19, 2025

Uh oh!

bosd commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants