OneOffTech · Dave93 · Apr 7, 2026
diff --git a/README.md b/README.md
@@ -1,72 +1,73 @@
-# Awesome PDF [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
-
-> Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems.
-
-
-
-## Contents
-
-- [Parsers, OCR and extraction](#parsers-ocr-and-extraction)
-- [Creation and production](#creation-and-production)
-- [Readers and viewers](#readers-and-viewers)
-- [Datasets](#datasets)
-
----
-
-## Parsers, OCR and extraction
-
-- [Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway to use different parsers using a unified API.
-- [Docling](https://github.com/docling-project/docling/) - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
-- [SmolDocling](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
-- [Filimoa/open-parse](https://github.com/Filimoa/open-parse/) - Improved file parsing for LLMs.
-- [VikParuchuri/surya](https://github.com/VikParuchuri/surya) - OCR, layout analysis, reading order, table recognition in 90+ languages.
-- [UniModal4Reasoning/StructEqTable-Deploy](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
-- [huridocs/pdf-document-layout-analysis](https://github.com/huridocs/pdf-document-layout-analysis) - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
-- [Reducto](https://reducto.ai/) - Document Ingestion API.
-- [adithya-s-k/omniparse](https://github.com/adithya-s-k/omniparse) - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
-- [lumina-ai-inc/chunkr](https://github.com/lumina-ai-inc/chunkr) - Vision model based PDF chunking.
-- [lumina-ai-inc/PaddleOCR](https://github.com/lumina-ai-inc/PaddleOCR) - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
-- [allenai/olmocr](https://github.com/allenai/olmocr) - Toolkit for linearizing PDFs for LLM datasets/training.
-- [opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A comprehensive toolkit for high-quality PDF content extraction.
-- [smalot/pdfparser](https://github.com/smalot/pdfparser) - A standalone PHP library, provides various tools to extract data from a PDF file.
-- [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
-- [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
-- [CatchTheTornado/pdf-extract-api](https://github.com/CatchTheTornado/pdf-extract-api) - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
-- [climatepolicyradar/navigator-document-parser](https://github.com/climatepolicyradar/navigator-document-parser) - Parsing PDFs and websites containing laws and policies.
-
-
-## Creation and production
-
-- [shipsaas/docking](https://github.com/shipsaas/docking) - Shared-microservice that takes over the document templates management & render/export PDF.
-- [WeasyPrint](https://weasyprint.org/) - Generate PDF using html and CSS.
-- [qpdf/qpdf](https://github.com/qpdf/qpdf) - A content-preserving PDF document transformer.
-- [Stirling-Tools/Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF) - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
-- [unjs/unpdf](https://github.com/unjs/unpdf) - Utilities to work with PDFs in Node.js, browser and workers.
-- [PdfRest](https://pdfrest.com/) - PDF Api to create, shrink and compress.
-- [Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
-- [Smallpdf](https://smallpdf.com/) - Set of tools to extract and manipulate PDF content.
-- [typst/typst](https://github.com/typst/typst) - A new markup-based typesetting system that is powerful and easy to learn.
-- [Vexlio](https://vexlio.com/) - Tool to create diagrams and export in SVG or PDF.
-- [renamed.to](https://www.renamed.to) - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
-- [veraPDF](https://openpreservation.org/tools/verapdf/) - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).
-
-## Readers and viewers
-
-- [mozilla/pdf.js](https://github.com/mozilla/pdf.js) - PDF Reader in JavaScript.
-- [agentcooper/react-pdf-highlighter](https://github.com/agentcooper/react-pdf-highlighter) - Set of React components for PDF annotation.
-- [Sioyek](https://sioyek.info/) - PDF viewer with a focus on technical books and research papers (desktop app).
-
-
-
-
-## Datasets
-
-- [tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
-- [pdf-association/pdf-corpora](https://github.com/pdf-association/pdf-corpora) - An index of PDF-centric corpora.
-- [DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A large human-annotated dataset for document-layout analysis.
-- [gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
-- [DocBank Dataset](https://github.com/doc-analysis/DocBank) - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.
-
-## Contributing
-
-See [Contributing](./contributing.md) for details.
+# Awesome PDF [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
+
+> Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems.
+
+
+
+## Contents
+
+- [Parsers, OCR and extraction](#parsers-ocr-and-extraction)
+- [Creation and production](#creation-and-production)
+- [Readers and viewers](#readers-and-viewers)
+- [Datasets](#datasets)
+
+---
+
+## Parsers, OCR and extraction
+
+- [Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway to use different parsers using a unified API.
+- [Docling](https://github.com/docling-project/docling/) - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
+- [SmolDocling](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
+- [Filimoa/open-parse](https://github.com/Filimoa/open-parse/) - Improved file parsing for LLMs.
+- [VikParuchuri/surya](https://github.com/VikParuchuri/surya) - OCR, layout analysis, reading order, table recognition in 90+ languages.
+- [UniModal4Reasoning/StructEqTable-Deploy](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
+- [huridocs/pdf-document-layout-analysis](https://github.com/huridocs/pdf-document-layout-analysis) - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
+- [Reducto](https://reducto.ai/) - Document Ingestion API.
+- [adithya-s-k/omniparse](https://github.com/adithya-s-k/omniparse) - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
+- [lumina-ai-inc/chunkr](https://github.com/lumina-ai-inc/chunkr) - Vision model based PDF chunking.
+- [lumina-ai-inc/PaddleOCR](https://github.com/lumina-ai-inc/PaddleOCR) - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
+- [allenai/olmocr](https://github.com/allenai/olmocr) - Toolkit for linearizing PDFs for LLM datasets/training.
+- [opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A comprehensive toolkit for high-quality PDF content extraction.
+- [smalot/pdfparser](https://github.com/smalot/pdfparser) - A standalone PHP library, provides various tools to extract data from a PDF file.
+- [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
+- [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
+- [CatchTheTornado/pdf-extract-api](https://github.com/CatchTheTornado/pdf-extract-api) - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
+- [climatepolicyradar/navigator-document-parser](https://github.com/climatepolicyradar/navigator-document-parser) - Parsing PDFs and websites containing laws and policies.
+
+
+## Creation and production
+
+- [shipsaas/docking](https://github.com/shipsaas/docking) - Shared-microservice that takes over the document templates management & render/export PDF.
+- [WeasyPrint](https://weasyprint.org/) - Generate PDF using html and CSS.
+- [qpdf/qpdf](https://github.com/qpdf/qpdf) - A content-preserving PDF document transformer.
+- [Stirling-Tools/Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF) - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
+- [unjs/unpdf](https://github.com/unjs/unpdf) - Utilities to work with PDFs in Node.js, browser and workers.
+- [PdfRest](https://pdfrest.com/) - PDF Api to create, shrink and compress.
+- [Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
+- [Smallpdf](https://smallpdf.com/) - Set of tools to extract and manipulate PDF content.
+- [PDFKit](https://pdfkit.davrapps.dev/) - Free browser-based PDF toolkit — merge, split, compress, rotate, convert, add page numbers, watermark, password protect. No signup, runs fully client-side.
+- [typst/typst](https://github.com/typst/typst) - A new markup-based typesetting system that is powerful and easy to learn.
+- [Vexlio](https://vexlio.com/) - Tool to create diagrams and export in SVG or PDF.
+- [renamed.to](https://www.renamed.to) - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
+- [veraPDF](https://openpreservation.org/tools/verapdf/) - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).
+
+## Readers and viewers
+
+- [mozilla/pdf.js](https://github.com/mozilla/pdf.js) - PDF Reader in JavaScript.
+- [agentcooper/react-pdf-highlighter](https://github.com/agentcooper/react-pdf-highlighter) - Set of React components for PDF annotation.
+- [Sioyek](https://sioyek.info/) - PDF viewer with a focus on technical books and research papers (desktop app).
+
+
+
+
+## Datasets
+
+- [tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
+- [pdf-association/pdf-corpora](https://github.com/pdf-association/pdf-corpora) - An index of PDF-centric corpora.
+- [DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A large human-annotated dataset for document-layout analysis.
+- [gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
+- [DocBank Dataset](https://github.com/doc-analysis/DocBank) - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.
+
+## Contributing
+
+See [Contributing](./contributing.md) for details.