Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 73 additions & 72 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,73 @@
# Awesome PDF [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

> Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems.



## Contents

- [Parsers, OCR and extraction](#parsers-ocr-and-extraction)
- [Creation and production](#creation-and-production)
- [Readers and viewers](#readers-and-viewers)
- [Datasets](#datasets)

---

## Parsers, OCR and extraction

- [Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway to use different parsers using a unified API.
- [Docling](https://github.com/docling-project/docling/) - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
- [SmolDocling](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
- [Filimoa/open-parse](https://github.com/Filimoa/open-parse/) - Improved file parsing for LLMs.
- [VikParuchuri/surya](https://github.com/VikParuchuri/surya) - OCR, layout analysis, reading order, table recognition in 90+ languages.
- [UniModal4Reasoning/StructEqTable-Deploy](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
- [huridocs/pdf-document-layout-analysis](https://github.com/huridocs/pdf-document-layout-analysis) - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
- [Reducto](https://reducto.ai/) - Document Ingestion API.
- [adithya-s-k/omniparse](https://github.com/adithya-s-k/omniparse) - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
- [lumina-ai-inc/chunkr](https://github.com/lumina-ai-inc/chunkr) - Vision model based PDF chunking.
- [lumina-ai-inc/PaddleOCR](https://github.com/lumina-ai-inc/PaddleOCR) - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
- [allenai/olmocr](https://github.com/allenai/olmocr) - Toolkit for linearizing PDFs for LLM datasets/training.
- [opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A comprehensive toolkit for high-quality PDF content extraction.
- [smalot/pdfparser](https://github.com/smalot/pdfparser) - A standalone PHP library, provides various tools to extract data from a PDF file.
- [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
- [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
- [CatchTheTornado/pdf-extract-api](https://github.com/CatchTheTornado/pdf-extract-api) - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
- [climatepolicyradar/navigator-document-parser](https://github.com/climatepolicyradar/navigator-document-parser) - Parsing PDFs and websites containing laws and policies.


## Creation and production

- [shipsaas/docking](https://github.com/shipsaas/docking) - Shared-microservice that takes over the document templates management & render/export PDF.
- [WeasyPrint](https://weasyprint.org/) - Generate PDF using html and CSS.
- [qpdf/qpdf](https://github.com/qpdf/qpdf) - A content-preserving PDF document transformer.
- [Stirling-Tools/Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF) - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
- [unjs/unpdf](https://github.com/unjs/unpdf) - Utilities to work with PDFs in Node.js, browser and workers.
- [PdfRest](https://pdfrest.com/) - PDF Api to create, shrink and compress.
- [Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
- [Smallpdf](https://smallpdf.com/) - Set of tools to extract and manipulate PDF content.
- [typst/typst](https://github.com/typst/typst) - A new markup-based typesetting system that is powerful and easy to learn.
- [Vexlio](https://vexlio.com/) - Tool to create diagrams and export in SVG or PDF.
- [renamed.to](https://www.renamed.to) - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
- [veraPDF](https://openpreservation.org/tools/verapdf/) - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).

## Readers and viewers

- [mozilla/pdf.js](https://github.com/mozilla/pdf.js) - PDF Reader in JavaScript.
- [agentcooper/react-pdf-highlighter](https://github.com/agentcooper/react-pdf-highlighter) - Set of React components for PDF annotation.
- [Sioyek](https://sioyek.info/) - PDF viewer with a focus on technical books and research papers (desktop app).




## Datasets

- [tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
- [pdf-association/pdf-corpora](https://github.com/pdf-association/pdf-corpora) - An index of PDF-centric corpora.
- [DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A large human-annotated dataset for document-layout analysis.
- [gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
- [DocBank Dataset](https://github.com/doc-analysis/DocBank) - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.

## Contributing

See [Contributing](./contributing.md) for details.
# Awesome PDF [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

> Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems.



## Contents

- [Parsers, OCR and extraction](#parsers-ocr-and-extraction)
- [Creation and production](#creation-and-production)
- [Readers and viewers](#readers-and-viewers)
- [Datasets](#datasets)

---

## Parsers, OCR and extraction

- [Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway to use different parsers using a unified API.
- [Docling](https://github.com/docling-project/docling/) - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
- [SmolDocling](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
- [Filimoa/open-parse](https://github.com/Filimoa/open-parse/) - Improved file parsing for LLMs.
- [VikParuchuri/surya](https://github.com/VikParuchuri/surya) - OCR, layout analysis, reading order, table recognition in 90+ languages.
- [UniModal4Reasoning/StructEqTable-Deploy](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
- [huridocs/pdf-document-layout-analysis](https://github.com/huridocs/pdf-document-layout-analysis) - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
- [Reducto](https://reducto.ai/) - Document Ingestion API.
- [adithya-s-k/omniparse](https://github.com/adithya-s-k/omniparse) - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
- [lumina-ai-inc/chunkr](https://github.com/lumina-ai-inc/chunkr) - Vision model based PDF chunking.
- [lumina-ai-inc/PaddleOCR](https://github.com/lumina-ai-inc/PaddleOCR) - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
- [allenai/olmocr](https://github.com/allenai/olmocr) - Toolkit for linearizing PDFs for LLM datasets/training.
- [opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A comprehensive toolkit for high-quality PDF content extraction.
- [smalot/pdfparser](https://github.com/smalot/pdfparser) - A standalone PHP library, provides various tools to extract data from a PDF file.
- [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
- [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
- [CatchTheTornado/pdf-extract-api](https://github.com/CatchTheTornado/pdf-extract-api) - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
- [climatepolicyradar/navigator-document-parser](https://github.com/climatepolicyradar/navigator-document-parser) - Parsing PDFs and websites containing laws and policies.


## Creation and production

- [shipsaas/docking](https://github.com/shipsaas/docking) - Shared-microservice that takes over the document templates management & render/export PDF.
- [WeasyPrint](https://weasyprint.org/) - Generate PDF using html and CSS.
- [qpdf/qpdf](https://github.com/qpdf/qpdf) - A content-preserving PDF document transformer.
- [Stirling-Tools/Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF) - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
- [unjs/unpdf](https://github.com/unjs/unpdf) - Utilities to work with PDFs in Node.js, browser and workers.
- [PdfRest](https://pdfrest.com/) - PDF Api to create, shrink and compress.
- [Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
- [Smallpdf](https://smallpdf.com/) - Set of tools to extract and manipulate PDF content.
- [PDFKit](https://pdfkit.davrapps.dev/) - Free browser-based PDF toolkit — merge, split, compress, rotate, convert, add page numbers, watermark, password protect. No signup, runs fully client-side.
- [typst/typst](https://github.com/typst/typst) - A new markup-based typesetting system that is powerful and easy to learn.
- [Vexlio](https://vexlio.com/) - Tool to create diagrams and export in SVG or PDF.
- [renamed.to](https://www.renamed.to) - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
- [veraPDF](https://openpreservation.org/tools/verapdf/) - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).

## Readers and viewers

- [mozilla/pdf.js](https://github.com/mozilla/pdf.js) - PDF Reader in JavaScript.
- [agentcooper/react-pdf-highlighter](https://github.com/agentcooper/react-pdf-highlighter) - Set of React components for PDF annotation.
- [Sioyek](https://sioyek.info/) - PDF viewer with a focus on technical books and research papers (desktop app).




## Datasets

- [tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
- [pdf-association/pdf-corpora](https://github.com/pdf-association/pdf-corpora) - An index of PDF-centric corpora.
- [DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A large human-annotated dataset for document-layout analysis.
- [gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
- [DocBank Dataset](https://github.com/doc-analysis/DocBank) - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.

## Contributing

See [Contributing](./contributing.md) for details.