Skip to content

Add number extraction method and line skipping functionality#652

Open
avonbuttlar wants to merge 1 commit into
invoice-x:masterfrom
avonbuttlar:master
Open

Add number extraction method and line skipping functionality#652
avonbuttlar wants to merge 1 commit into
invoice-x:masterfrom
avonbuttlar:master

Conversation

@avonbuttlar
Copy link
Copy Markdown

  • Implemented extract_number_from_text to extract numeric values from strings with text.
  • Updated coerce_type to utilize the new extraction method for converting strings to int/float.
  • Added support for skipping lines based on provided patterns in the parse_block function.

Since i dont get the tests to run (i do not have tesseract), i hope someone can help :).

Since this is my first pr to a public repo ever, please give feedback. Ik i need to do some documentation still, but i tought i first add the functionalltiy.

- Implemented `extract_number_from_text` to extract numeric values from strings with text.
- Updated `coerce_type` to utilize the new extraction method for converting strings to int/float.
- Added support for skipping lines based on provided patterns in the `parse_block` function.
@bosd
Copy link
Copy Markdown
Collaborator

bosd commented May 22, 2026

Thanks for the PR @avonbuttlar, and welcome — nice first contribution! 🎉

The skip_line half is a useful, well-contained addition. The extract_number_from_text half needs rework before it can be merged, though, because coerce_type applies it to every int/float field globally, and the current regex truncates numbers ≥ 1000 that do not use a thousands separator.

The pattern [-+]?\d{1,3}(?:[.,\s']\d{3})*(?:[.,]\d+)? assumes thousands-grouping, so:

input extracted expected
1234 123 1234
1234.56 123 1234.56
1939,00 193 1939
12123 Stk. 121 12123

So even the 12123 Stk. example from the description comes out as 121. Applied to all numeric fields, this would change/break extraction of thousands-range amounts across existing templates (and the comparison fixtures in tests/compare/).

Suggestions:

  • Make the number extraction opt-in (a template/field option) instead of changing the global coerce_type path, so existing templates are unaffected.
  • Fix the regex so it does not require thousands separators (the \d{1,3}(...)* grouping is the culprit), and add unit tests covering 1234, 1234.56, 1.234,56, 12123 Stk., €25.50, and negatives.
  • Re: running tests without tesseract — that test is now skipped automatically when the tesseract binary is absent (latest master), so pytest should run for you after a rebase.

This is feature-sized, so it would land in a future minor release rather than the current maintenance line. Happy to help get it over the finish line — thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants