Skip to content

devhms/OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 B.L.A.S.T. OCR Engine

Blueprint. Link. Architect. Stylize. Trigger.

Status Python Coverage License

B.L.A.S.T. is a high-performance, deterministic, and self-healing OCR automation agent designed to extract high-quality text from PDFs, PowerPoints (PPTX), and Images. It leverages a rigorous 3-Layer Architecture to ensure reliability, maintainability, and exceptional error handling.

🌟 Key Features

  • 🛡️ Robust & Self-Healing: Automatically retries failed OCR operations with exponential backoff and gracefully handles engine failures.
  • 📄 Multi-Format Support: Native support for PDF (via Poppler), PPTX (Slide & Table extraction), and standard Images (PNG, JPG, BMP).
  • 🔧 Deterministic Output: Produces structured Markdown and formatted DOCX files for every processed document.
  • ⚡ Parallel Processing: Optimized for performance with threaded execution for multi-page documents.
  • 🖥️ Dual Interface:
    • CLI: Powerful command-line tool for batch processing.
    • GUI: Premium Streamlit Dashboard with Job History, Analytics, and Modern UI.
  • 📊 SQLite Integration: built-in database to track jobs, processing time, and confidence scores.
  • 🛡️ 100% Reliability Coverage: Full branch-level test suite for Core, Cache, Pipeline, and UI modules, ensuring 0% regressions.
  • 🛡️ Forensic Audit (v2.0): 100% resolution of 17 critical security, concurrency, and memory bugs identified in the forensic audit.

📦 Installation

Prerequisites

  • Python 3.9+
  • Poppler (Required for PDF conversion)

Note: The default runtime uses EasyOCR. Tesseract is not required for standard deployment.

Setup

  1. Use a supported Python runtime:

    • Recommended: Python 3.11 (Streamlit Cloud uses runtime.txt).
  2. Clone the repository:

    git clone https://github.com/your-username/blast-ocr.git
    cd blast-ocr
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure Environment (Optional): Copy .env.example to .env to customize settings like GPU usage or Database URL.

🕹️ Usage

Command Line Interface (CLI)

Process a single file or an entire directory:

# Process a single file (Output saved to same directory)
python run.py document.pdf

# Process a directory of images
python run.py data/pages/ --out results/

# Process and specify output folder
python run.py scan.jpg --out my_scans/

Graphical User Interface (GUI)

Launch the interactive dashboard:

python run_gui.py

Or directly via Streamlit:

streamlit run blast_ocr/ui/web_app.py

For Streamlit Community Cloud, use streamlit_app.py as the app entrypoint.

🏗️ Architecture & Documentation

B.L.A.S.T. is fully documented across several technical modules:


The project follows the A.N.T. (Architect, Navigate, Tool) philosophy:

  • Layer 1: Architect (SOPs & Logic): Located in architecture/, defining the core protocols.
  • Layer 2: Navigator (Routing & Control): main.py acts as the central router, directing data flows and handling high-level errors.
  • Layer 3: Tools (Execution): Pure, specialized modules in blast_ocr/core/ (Extractor, Healer, Parallel) that perform the work.

See ARCHITECTURE.md for a deep dive.

🛡️ Forensic Remediation

This project underwent a comprehensive Forensic Audit in March 2026, resolving 17 critical vulnerabilities. Key improvements include:

  • XXE Protection: Full defusal of XML-based attack vectors.
  • Thread Isolation: Zero data-leakage across concurrent user sessions.
  • Memory Stability: Guaranteed VRAM cleanup and Autograd graph breakage for long-running processes.

See AUDIT.md and bug_report_v2.md for full technical details.

⚙️ Configuration

Settings are managed via blast_ocr/config.py and .env.

Variable Default Description
BLAST_OCR_MAX_WORKERS 4 Number of parallel threads
BLAST_OCR_MIN_CONFIDENCE 0.6 Threshold for low-confidence warnings
BLAST_OCR_OCR_GPU False Enable GPU acceleration for EasyOCR
BLAST_OCR_EASYOCR_DOWNLOAD_ENABLED True Allow EasyOCR model download at startup (0/false/off to disable once preloaded)
BLAST_OCR_EASYOCR_MODEL_DIR auto Optional explicit EasyOCR model cache path (Linux cloud default is /tmp/.EasyOCR/model)
BLAST_OCR_POPPLER_PATH None (Optional) Path to Poppler bin directory for PDF support
BLAST_OCR_RETRY_BACKOFF 2 Backoff factor for self-healing retries

🧪 Testing

B.L.A.S.T. uses a rigorous pytest suite with pytest-cov for branch coverage validation.

To run the full test suite (160+ tests):

python -m pytest tests/ --cov=blast_ocr --cov-report=term-missing

The suite covers:

  • Core Engine: Thread-safety, VRAM management, and preprocessing fallbacks.
  • Cache System: Windows file-lock retry logic and atomic writes.
  • Pipeline: PDF batching, multi-format routing, and temp-dir cleanup.
  • UI & UX: Mocked Streamlit session state and secure upload handlers.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on testing and code style.

📝 License

MIT License. See LICENSE for details.

OCR

About

B.L.A.S.T. OCR Engine — 100% test coverage, 3-Layer A.N.T. Architecture, self-healing retries. Extracts text from PDF/PPTX/images via Python + EasyOCR with Streamlit GUI dashboard.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors