Blueprint. Link. Architect. Stylize. Trigger.
B.L.A.S.T. is a high-performance, deterministic, and self-healing OCR automation agent designed to extract high-quality text from PDFs, PowerPoints (PPTX), and Images. It leverages a rigorous 3-Layer Architecture to ensure reliability, maintainability, and exceptional error handling.
- 🛡️ Robust & Self-Healing: Automatically retries failed OCR operations with exponential backoff and gracefully handles engine failures.
- 📄 Multi-Format Support: Native support for PDF (via Poppler), PPTX (Slide & Table extraction), and standard Images (PNG, JPG, BMP).
- 🔧 Deterministic Output: Produces structured Markdown and formatted DOCX files for every processed document.
- ⚡ Parallel Processing: Optimized for performance with threaded execution for multi-page documents.
- 🖥️ Dual Interface:
- CLI: Powerful command-line tool for batch processing.
- GUI: Premium Streamlit Dashboard with Job History, Analytics, and Modern UI.
- 📊 SQLite Integration: built-in database to track jobs, processing time, and confidence scores.
- 🛡️ 100% Reliability Coverage: Full branch-level test suite for Core, Cache, Pipeline, and UI modules, ensuring 0% regressions.
- 🛡️ Forensic Audit (v2.0): 100% resolution of 17 critical security, concurrency, and memory bugs identified in the forensic audit.
- Python 3.9+
- Poppler (Required for PDF conversion)
Note: The default runtime uses EasyOCR. Tesseract is not required for standard deployment.
-
Use a supported Python runtime:
- Recommended: Python
3.11(Streamlit Cloud usesruntime.txt).
- Recommended: Python
-
Clone the repository:
git clone https://github.com/your-username/blast-ocr.git cd blast-ocr -
Install dependencies:
pip install -r requirements.txt
-
Configure Environment (Optional): Copy
.env.exampleto.envto customize settings like GPU usage or Database URL.
Process a single file or an entire directory:
# Process a single file (Output saved to same directory)
python run.py document.pdf
# Process a directory of images
python run.py data/pages/ --out results/
# Process and specify output folder
python run.py scan.jpg --out my_scans/Launch the interactive dashboard:
python run_gui.pyOr directly via Streamlit:
streamlit run blast_ocr/ui/web_app.pyFor Streamlit Community Cloud, use streamlit_app.py as the app entrypoint.
B.L.A.S.T. is fully documented across several technical modules:
- 🚀 Introduction: Core vision and acronym breakdown.
- 🏗️ Architecture Deep Dive: The A.N.T. model, sequence diagrams, and DB schema.
- 🛡️ Security Hardening: Forensic remediation of XXE, SQLi, and session isolation.
- ⚡ Performance Tuning: VRAM management and parallelism strategies.
- 📖 API Reference: Technical breakdown of core modules.
- 🚀 Deployment Guide: Windows/Linux production setup.
- 🛠️ Troubleshooting: Solutions for common errors and self-healing logic.
- 🧭 OCR Engine Evaluation (2026): Web-backed CPU-first engine analysis.
- 🔁 OCR Transition Playbook: Safe migration and rollback methodology.
- 🗺️ OCR Integration Map: Exact code touchpoints and contracts.
The project follows the A.N.T. (Architect, Navigate, Tool) philosophy:
- Layer 1: Architect (SOPs & Logic): Located in
architecture/, defining the core protocols. - Layer 2: Navigator (Routing & Control):
main.pyacts as the central router, directing data flows and handling high-level errors. - Layer 3: Tools (Execution): Pure, specialized modules in
blast_ocr/core/(Extractor, Healer, Parallel) that perform the work.
See ARCHITECTURE.md for a deep dive.
This project underwent a comprehensive Forensic Audit in March 2026, resolving 17 critical vulnerabilities. Key improvements include:
- XXE Protection: Full defusal of XML-based attack vectors.
- Thread Isolation: Zero data-leakage across concurrent user sessions.
- Memory Stability: Guaranteed VRAM cleanup and Autograd graph breakage for long-running processes.
See AUDIT.md and bug_report_v2.md for full technical details.
Settings are managed via blast_ocr/config.py and .env.
| Variable | Default | Description |
|---|---|---|
BLAST_OCR_MAX_WORKERS |
4 | Number of parallel threads |
BLAST_OCR_MIN_CONFIDENCE |
0.6 | Threshold for low-confidence warnings |
BLAST_OCR_OCR_GPU |
False | Enable GPU acceleration for EasyOCR |
BLAST_OCR_EASYOCR_DOWNLOAD_ENABLED |
True | Allow EasyOCR model download at startup (0/false/off to disable once preloaded) |
BLAST_OCR_EASYOCR_MODEL_DIR |
auto | Optional explicit EasyOCR model cache path (Linux cloud default is /tmp/.EasyOCR/model) |
BLAST_OCR_POPPLER_PATH |
None | (Optional) Path to Poppler bin directory for PDF support |
BLAST_OCR_RETRY_BACKOFF |
2 | Backoff factor for self-healing retries |
B.L.A.S.T. uses a rigorous pytest suite with pytest-cov for branch coverage validation.
To run the full test suite (160+ tests):
python -m pytest tests/ --cov=blast_ocr --cov-report=term-missingThe suite covers:
- Core Engine: Thread-safety, VRAM management, and preprocessing fallbacks.
- Cache System: Windows file-lock retry logic and atomic writes.
- Pipeline: PDF batching, multi-format routing, and temp-dir cleanup.
- UI & UX: Mocked Streamlit session state and secure upload handlers.
We welcome contributions! Please see CONTRIBUTING.md for guidelines on testing and code style.
MIT License. See LICENSE for details.