Skip to content

feat: add read_pdf_table() to extract PDF tables via docling from R (#36434)#1527

Open
hidekoji wants to merge 5 commits into
masterfrom
feature/read-pdf-table-36434
Open

feat: add read_pdf_table() to extract PDF tables via docling from R (#36434)#1527
hidekoji wants to merge 5 commits into
masterfrom
feature/read-pdf-table-36434

Conversation

@hidekoji

Copy link
Copy Markdown
Collaborator

What

Adds exploratory::read_pdf_table(pdf_path, table_index = NULL, out_dir = NULL, encoding = "UTF-8", timeout = 600) — extracts tables from a PDF with docling and returns the chosen table as a data frame.

Why

The AI Data "Create Data with AI" PDF flow currently extracts tables to a temp CSV via the extract_pdf_tables client tool, then the generated R script just read_csvs that CSV. So Update / Re-import does NOT re-read the PDF — it re-reads a stale CSV (which also vanishes after an app restart). With read_pdf_table the generated source script runs docling itself, so re-import re-extracts from the live PDF.

How

  • Runs exp-cli's docling_extract.py in a dedicated Python subprocess (system2), not reticulate inside RServe — avoids the docling-stdout pipe deadlock. The authoritative result is metadata.json, never parsed from stdout, so docling log noise can't corrupt it.
  • Python resolved via EXPLORATORY_PYTHON_BIN, else python3 / python (each verified with -c "import docling").
  • Script path from EXPLORATORY_DOCLING_SCRIPT, injected into the R session by Exploratory Desktop (tam) at session init.
  • Picks the table by table_index (matched against metadata's index), or the first table when omitted.

Version

Bumps exploratory 15.1.20 → 15.1.22. (15.1.21 is claimed by the in-flight Oracle connection-reuse fix #36429; built from master, 15.1.22 is a superset — Oracle fix already in master + read_pdf_table.) readr added to Imports.

Tests

tests/testthat/test_pdf.R: input validation, missing-PDF, env guards, negative/non-integer table_index, table selection by index vs default-first, out-of-range index, and complex column names (航空 会社 !#$%&'()). 11/11 pass against the installed package.

Paired changes

  • tam: inject EXPLORATORY_DOCLING_SCRIPT at R session init + bump bundled exploratory to 15.1.22 (PR #36447 / branch fix/determined-khorana-72fda0).
  • datablog #3133 + AI-Data: prompt generates read_pdf_table() instead of read_csv of the temp CSV.

Deploy

Build + publish the 15.1.22 per-platform binaries (mac intel / mac arm / windows) to download2.exploratory.io/.../contrib/4.5/; coordinate with #36429 so master carries the final Oracle code before the build.

🤖 Generated with Claude Code

Runs exp-cli's docling_extract.py in a dedicated Python subprocess (not reticulate
inside RServe -- avoids the docling-stdout pipe deadlock) and returns the chosen table
as a data frame. Lets AI Data PDF data sources re-extract from the live PDF on
Update / Re-import instead of re-reading a stale temp CSV.

- Python resolved via EXPLORATORY_PYTHON_BIN, else python3/python (each verified with
  import docling). Script path from EXPLORATORY_DOCLING_SCRIPT (injected by Desktop).
- Bump exploratory 15.1.20 -> 15.1.22 (15.1.21 is claimed by the Oracle fix #36429).
- testthat: input validation, env guards, index selection, complex column names.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hidekoji hidekoji requested a review from kei51e June 21, 2026 23:48
@hidekoji hidekoji self-assigned this Jun 21, 2026
@hidekoji hidekoji assigned kei51e and unassigned hidekoji Jun 22, 2026
@hidekoji hidekoji closed this Jun 22, 2026
@hidekoji hidekoji reopened this Jun 22, 2026
hidekoji and others added 3 commits June 22, 2026 00:12
…on work (#36434)

system2 pastes args into a shell command line WITHOUT quoting, so the docling
availability check c("-c", "import docling") was word-split -> python ran `-c import`
-> SyntaxError -> docling always reported missing (read_pdf_table failed even with
EXPLORATORY_PYTHON_BIN set to a python that has docling). Confirmed in the desktop
RServe: unquoted -> SyntaxError; shQuoted -> exit 0. shQuote the -c code and every
extraction arg (script/PDF/out-dir paths, which could also contain spaces).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…meout) (#36434)

The availability check now uses importlib.util.find_spec (locates docling without
executing it / loading torch) and a 20s timeout, so missing docling is reported
quickly and a hanging python invocation (e.g. the Windows 'python' Microsoft Store
stub) can't stall the tool for the full 120s delegation timeout before retry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants