feat: add read_pdf_table() to extract PDF tables via docling from R (#36434)#1527
Open
hidekoji wants to merge 5 commits into
Open
feat: add read_pdf_table() to extract PDF tables via docling from R (#36434)#1527hidekoji wants to merge 5 commits into
hidekoji wants to merge 5 commits into
Conversation
Runs exp-cli's docling_extract.py in a dedicated Python subprocess (not reticulate inside RServe -- avoids the docling-stdout pipe deadlock) and returns the chosen table as a data frame. Lets AI Data PDF data sources re-extract from the live PDF on Update / Re-import instead of re-reading a stale temp CSV. - Python resolved via EXPLORATORY_PYTHON_BIN, else python3/python (each verified with import docling). Script path from EXPLORATORY_DOCLING_SCRIPT (injected by Desktop). - Bump exploratory 15.1.20 -> 15.1.22 (15.1.21 is claimed by the Oracle fix #36429). - testthat: input validation, env guards, index selection, complex column names. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on work (#36434)
system2 pastes args into a shell command line WITHOUT quoting, so the docling
availability check c("-c", "import docling") was word-split -> python ran `-c import`
-> SyntaxError -> docling always reported missing (read_pdf_table failed even with
EXPLORATORY_PYTHON_BIN set to a python that has docling). Confirmed in the desktop
RServe: unquoted -> SyntaxError; shQuoted -> exit 0. shQuote the -c code and every
extraction arg (script/PDF/out-dir paths, which could also contain spaces).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…meout) (#36434) The availability check now uses importlib.util.find_spec (locates docling without executing it / loading torch) and a 20s timeout, so missing docling is reported quickly and a hanging python invocation (e.g. the Windows 'python' Microsoft Store stub) can't stall the tool for the full 120s delegation timeout before retry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
exploratory::read_pdf_table(pdf_path, table_index = NULL, out_dir = NULL, encoding = "UTF-8", timeout = 600)— extracts tables from a PDF with docling and returns the chosen table as a data frame.Why
The AI Data "Create Data with AI" PDF flow currently extracts tables to a temp CSV via the
extract_pdf_tablesclient tool, then the generated R script justread_csvs that CSV. So Update / Re-import does NOT re-read the PDF — it re-reads a stale CSV (which also vanishes after an app restart). Withread_pdf_tablethe generated source script runs docling itself, so re-import re-extracts from the live PDF.How
docling_extract.pyin a dedicated Python subprocess (system2), not reticulate inside RServe — avoids the docling-stdout pipe deadlock. The authoritative result ismetadata.json, never parsed from stdout, so docling log noise can't corrupt it.EXPLORATORY_PYTHON_BIN, elsepython3/python(each verified with-c "import docling").EXPLORATORY_DOCLING_SCRIPT, injected into the R session by Exploratory Desktop (tam) at session init.table_index(matched against metadata'sindex), or the first table when omitted.Version
Bumps
exploratory15.1.20 → 15.1.22. (15.1.21 is claimed by the in-flight Oracle connection-reuse fix #36429; built from master, 15.1.22 is a superset — Oracle fix already in master +read_pdf_table.)readradded to Imports.Tests
tests/testthat/test_pdf.R: input validation, missing-PDF, env guards, negative/non-integertable_index, table selection by index vs default-first, out-of-range index, and complex column names (航空 会社 !#$%&'()). 11/11 pass against the installed package.Paired changes
EXPLORATORY_DOCLING_SCRIPTat R session init + bump bundled exploratory to 15.1.22 (PR #36447 / branchfix/determined-khorana-72fda0).read_pdf_table()instead ofread_csvof the temp CSV.Deploy
Build + publish the 15.1.22 per-platform binaries (mac intel / mac arm / windows) to
download2.exploratory.io/.../contrib/4.5/; coordinate with #36429 so master carries the final Oracle code before the build.🤖 Generated with Claude Code