diff --git a/README.md b/README.md index ba39add..ee8d570 100644 --- a/README.md +++ b/README.md @@ -1,72 +1,73 @@ -# Awesome PDF [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) - -> Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems. - - - -## Contents - -- [Parsers, OCR and extraction](#parsers-ocr-and-extraction) -- [Creation and production](#creation-and-production) -- [Readers and viewers](#readers-and-viewers) -- [Datasets](#datasets) - ---- - -## Parsers, OCR and extraction - -- [Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway to use different parsers using a unified API. -- [Docling](https://github.com/docling-project/docling/) - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. -- [SmolDocling](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling. -- [Filimoa/open-parse](https://github.com/Filimoa/open-parse/) - Improved file parsing for LLMs. -- [VikParuchuri/surya](https://github.com/VikParuchuri/surya) - OCR, layout analysis, reading order, table recognition in 90+ languages. -- [UniModal4Reasoning/StructEqTable-Deploy](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - A High-efficiency Open-source Toolkit for Table-to-Latex Task. -- [huridocs/pdf-document-layout-analysis](https://github.com/huridocs/pdf-document-layout-analysis) - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables. -- [Reducto](https://reducto.ai/) - Document Ingestion API. -- [adithya-s-k/omniparse](https://github.com/adithya-s-k/omniparse) - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications. -- [lumina-ai-inc/chunkr](https://github.com/lumina-ai-inc/chunkr) - Vision model based PDF chunking. -- [lumina-ai-inc/PaddleOCR](https://github.com/lumina-ai-inc/PaddleOCR) - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle. -- [allenai/olmocr](https://github.com/allenai/olmocr) - Toolkit for linearizing PDFs for LLM datasets/training. -- [opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A comprehensive toolkit for high-quality PDF content extraction. -- [smalot/pdfparser](https://github.com/smalot/pdfparser) - A standalone PHP library, provides various tools to extract data from a PDF file. -- [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines. -- [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output. -- [CatchTheTornado/pdf-extract-api](https://github.com/CatchTheTornado/pdf-extract-api) - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown. -- [climatepolicyradar/navigator-document-parser](https://github.com/climatepolicyradar/navigator-document-parser) - Parsing PDFs and websites containing laws and policies. - - -## Creation and production - -- [shipsaas/docking](https://github.com/shipsaas/docking) - Shared-microservice that takes over the document templates management & render/export PDF. -- [WeasyPrint](https://weasyprint.org/) - Generate PDF using html and CSS. -- [qpdf/qpdf](https://github.com/qpdf/qpdf) - A content-preserving PDF document transformer. -- [Stirling-Tools/Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF) - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more. -- [unjs/unpdf](https://github.com/unjs/unpdf) - Utilities to work with PDFs in Node.js, browser and workers. -- [PdfRest](https://pdfrest.com/) - PDF Api to create, shrink and compress. -- [Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel. -- [Smallpdf](https://smallpdf.com/) - Set of tools to extract and manipulate PDF content. -- [typst/typst](https://github.com/typst/typst) - A new markup-based typesetting system that is powerful and easy to learn. -- [Vexlio](https://vexlio.com/) - Tool to create diagrams and export in SVG or PDF. -- [renamed.to](https://www.renamed.to) - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application. -- [veraPDF](https://openpreservation.org/tools/verapdf/) - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation). - -## Readers and viewers - -- [mozilla/pdf.js](https://github.com/mozilla/pdf.js) - PDF Reader in JavaScript. -- [agentcooper/react-pdf-highlighter](https://github.com/agentcooper/react-pdf-highlighter) - Set of React components for PDF annotation. -- [Sioyek](https://sioyek.info/) - PDF viewer with a focus on technical books and research papers (desktop app). - - - - -## Datasets - -- [tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF collection (papers, specs, decks, manuals, etc). -- [pdf-association/pdf-corpora](https://github.com/pdf-association/pdf-corpora) - An index of PDF-centric corpora. -- [DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A large human-annotated dataset for document-layout analysis. -- [gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents. -- [DocBank Dataset](https://github.com/doc-analysis/DocBank) - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks. - -## Contributing - -See [Contributing](./contributing.md) for details. +# Awesome PDF [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) + +> Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems. + + + +## Contents + +- [Parsers, OCR and extraction](#parsers-ocr-and-extraction) +- [Creation and production](#creation-and-production) +- [Readers and viewers](#readers-and-viewers) +- [Datasets](#datasets) + +--- + +## Parsers, OCR and extraction + +- [Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway to use different parsers using a unified API. +- [Docling](https://github.com/docling-project/docling/) - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. +- [SmolDocling](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling. +- [Filimoa/open-parse](https://github.com/Filimoa/open-parse/) - Improved file parsing for LLMs. +- [VikParuchuri/surya](https://github.com/VikParuchuri/surya) - OCR, layout analysis, reading order, table recognition in 90+ languages. +- [UniModal4Reasoning/StructEqTable-Deploy](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - A High-efficiency Open-source Toolkit for Table-to-Latex Task. +- [huridocs/pdf-document-layout-analysis](https://github.com/huridocs/pdf-document-layout-analysis) - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables. +- [Reducto](https://reducto.ai/) - Document Ingestion API. +- [adithya-s-k/omniparse](https://github.com/adithya-s-k/omniparse) - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications. +- [lumina-ai-inc/chunkr](https://github.com/lumina-ai-inc/chunkr) - Vision model based PDF chunking. +- [lumina-ai-inc/PaddleOCR](https://github.com/lumina-ai-inc/PaddleOCR) - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle. +- [allenai/olmocr](https://github.com/allenai/olmocr) - Toolkit for linearizing PDFs for LLM datasets/training. +- [opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A comprehensive toolkit for high-quality PDF content extraction. +- [smalot/pdfparser](https://github.com/smalot/pdfparser) - A standalone PHP library, provides various tools to extract data from a PDF file. +- [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines. +- [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output. +- [CatchTheTornado/pdf-extract-api](https://github.com/CatchTheTornado/pdf-extract-api) - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown. +- [climatepolicyradar/navigator-document-parser](https://github.com/climatepolicyradar/navigator-document-parser) - Parsing PDFs and websites containing laws and policies. + + +## Creation and production + +- [shipsaas/docking](https://github.com/shipsaas/docking) - Shared-microservice that takes over the document templates management & render/export PDF. +- [WeasyPrint](https://weasyprint.org/) - Generate PDF using html and CSS. +- [qpdf/qpdf](https://github.com/qpdf/qpdf) - A content-preserving PDF document transformer. +- [Stirling-Tools/Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF) - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more. +- [unjs/unpdf](https://github.com/unjs/unpdf) - Utilities to work with PDFs in Node.js, browser and workers. +- [PdfRest](https://pdfrest.com/) - PDF Api to create, shrink and compress. +- [Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel. +- [Smallpdf](https://smallpdf.com/) - Set of tools to extract and manipulate PDF content. +- [PDFKit](https://pdfkit.davrapps.dev/) - Free browser-based PDF toolkit — merge, split, compress, rotate, convert, add page numbers, watermark, password protect. No signup, runs fully client-side. +- [typst/typst](https://github.com/typst/typst) - A new markup-based typesetting system that is powerful and easy to learn. +- [Vexlio](https://vexlio.com/) - Tool to create diagrams and export in SVG or PDF. +- [renamed.to](https://www.renamed.to) - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application. +- [veraPDF](https://openpreservation.org/tools/verapdf/) - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation). + +## Readers and viewers + +- [mozilla/pdf.js](https://github.com/mozilla/pdf.js) - PDF Reader in JavaScript. +- [agentcooper/react-pdf-highlighter](https://github.com/agentcooper/react-pdf-highlighter) - Set of React components for PDF annotation. +- [Sioyek](https://sioyek.info/) - PDF viewer with a focus on technical books and research papers (desktop app). + + + + +## Datasets + +- [tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF collection (papers, specs, decks, manuals, etc). +- [pdf-association/pdf-corpora](https://github.com/pdf-association/pdf-corpora) - An index of PDF-centric corpora. +- [DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A large human-annotated dataset for document-layout analysis. +- [gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents. +- [DocBank Dataset](https://github.com/doc-analysis/DocBank) - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks. + +## Contributing + +See [Contributing](./contributing.md) for details.