Extracts text and metadata from a PDF, given either a local file path or an HTTP(S) URL. Supports page-range filtering (e.g. 1-5, 3, 1,3,5-7) and exposes per-page text plus the concatenated full text. Backed by pypdf — pure Python, no native binaries needed.
- Python 3.12 or later
atomic-agentspydanticpypdfrequests
- Use the Atomic Assembler CLI: run
atomicand pickpdf_reader. - Or copy the
tool/folder directly into your project.
source(str): a local file path or HTTP(S) URL pointing at a PDF.page_range(str, optional):'1','1-5','1,3,5-7', etc. (1-based). Out-of-range pages are silently skipped.include_metadata(bool): defaultTrue.
source(str): echo of the input source.text(str): concatenated text for the extracted pages, separated by\f(form feed).pages(list[PdfPage]): per-page extracted text with 1-basedpage_number.page_count(int): number of pages extracted (after page_range filter).total_page_count(int): total pages in the source document.metadata(PdfMetadata, optional): document info dictionary.error(str, optional): set when the operation failed.
from tool.pdf_reader import PdfReaderTool, PdfReaderToolInputSchema
tool = PdfReaderTool()
# Read remote PDF, only first two pages
out = tool.run(PdfReaderToolInputSchema(
source="https://arxiv.org/pdf/1706.03762",
page_range="1-2",
))
print(out.text[:1000])
# Read a local PDF, every page
out = tool.run(PdfReaderToolInputSchema(source="./paper.pdf"))- The tool fails gracefully and returns
errorset in the output rather than raising. - For URL sources, the download is hard-capped by
max_size_bytesin the config (default 50 MB). - Encrypted PDFs are not currently supported.
PRs welcome — see the main repo CONTRIBUTING.md.
Same as the main Atomic Agents project.