🚀 B.L.A.S.T. OCR Engine

Blueprint. Link. Architect. Stylize. Trigger.

B.L.A.S.T. is a high-performance, deterministic, and self-healing OCR automation agent designed to extract high-quality text from PDFs, PowerPoints (PPTX), and Images. It leverages a rigorous 3-Layer Architecture to ensure reliability, maintainability, and exceptional error handling.

🌟 Key Features

🛡️ Robust & Self-Healing: Automatically retries failed OCR operations with exponential backoff and gracefully handles engine failures.
📄 Multi-Format Support: Native support for PDF (via Poppler), PPTX (Slide & Table extraction), and standard Images (PNG, JPG, BMP).
🔧 Deterministic Output: Produces structured Markdown and formatted DOCX files for every processed document.
⚡ Parallel Processing: Optimized for performance with threaded execution for multi-page documents.
🖥️ Dual Interface:
- CLI: Powerful command-line tool for batch processing.
- GUI: Premium Streamlit Dashboard with Job History, Analytics, and Modern UI.
📊 SQLite Integration: built-in database to track jobs, processing time, and confidence scores.
🛡️ 100% Reliability Coverage: Full branch-level test suite for Core, Cache, Pipeline, and UI modules, ensuring 0% regressions.
🛡️ Forensic Audit (v2.0): 100% resolution of 17 critical security, concurrency, and memory bugs identified in the forensic audit.

📦 Installation

Prerequisites

Python 3.9+
Poppler (Required for PDF conversion)

Note: The default runtime uses EasyOCR. Tesseract is not required for standard deployment.

Setup

Use a supported Python runtime:
- Recommended: Python 3.11 (Streamlit Cloud uses runtime.txt).

Clone the repository:

git clone https://github.com/your-username/blast-ocr.git
cd blast-ocr

Install dependencies:
```
pip install -r requirements.txt
```
Configure Environment (Optional): Copy .env.example to .env to customize settings like GPU usage or Database URL.

🕹️ Usage

Command Line Interface (CLI)

Process a single file or an entire directory:

# Process a single file (Output saved to same directory)
python run.py document.pdf

# Process a directory of images
python run.py data/pages/ --out results/

# Process and specify output folder
python run.py scan.jpg --out my_scans/

Graphical User Interface (GUI)

Launch the interactive dashboard:

python run_gui.py

Or directly via Streamlit:

streamlit run blast_ocr/ui/web_app.py

For Streamlit Community Cloud, use streamlit_app.py as the app entrypoint.

🏗️ Architecture & Documentation

B.L.A.S.T. is fully documented across several technical modules:

🚀 Introduction: Core vision and acronym breakdown.
🏗️ Architecture Deep Dive: The A.N.T. model, sequence diagrams, and DB schema.
🛡️ Security Hardening: Forensic remediation of XXE, SQLi, and session isolation.
⚡ Performance Tuning: VRAM management and parallelism strategies.
📖 API Reference: Technical breakdown of core modules.
🚀 Deployment Guide: Windows/Linux production setup.
🛠️ Troubleshooting: Solutions for common errors and self-healing logic.
🧭 OCR Engine Evaluation (2026): Web-backed CPU-first engine analysis.
🔁 OCR Transition Playbook: Safe migration and rollback methodology.
🗺️ OCR Integration Map: Exact code touchpoints and contracts.

The project follows the A.N.T. (Architect, Navigate, Tool) philosophy:

Layer 1: Architect (SOPs & Logic): Located in architecture/, defining the core protocols.
Layer 2: Navigator (Routing & Control): main.py acts as the central router, directing data flows and handling high-level errors.
Layer 3: Tools (Execution): Pure, specialized modules in blast_ocr/core/ (Extractor, Healer, Parallel) that perform the work.

See ARCHITECTURE.md for a deep dive.

🛡️ Forensic Remediation

This project underwent a comprehensive Forensic Audit in March 2026, resolving 17 critical vulnerabilities. Key improvements include:

XXE Protection: Full defusal of XML-based attack vectors.
Thread Isolation: Zero data-leakage across concurrent user sessions.
Memory Stability: Guaranteed VRAM cleanup and Autograd graph breakage for long-running processes.

See AUDIT.md and bug_report_v2.md for full technical details.

⚙️ Configuration

Settings are managed via blast_ocr/config.py and .env.

Variable	Default	Description
`BLAST_OCR_MAX_WORKERS`	4	Number of parallel threads
`BLAST_OCR_MIN_CONFIDENCE`	0.6	Threshold for low-confidence warnings
`BLAST_OCR_OCR_GPU`	False	Enable GPU acceleration for EasyOCR
`BLAST_OCR_EASYOCR_DOWNLOAD_ENABLED`	True	Allow EasyOCR model download at startup (`0/false/off` to disable once preloaded)
`BLAST_OCR_EASYOCR_MODEL_DIR`	auto	Optional explicit EasyOCR model cache path (Linux cloud default is `/tmp/.EasyOCR/model`)
`BLAST_OCR_POPPLER_PATH`	None	(Optional) Path to Poppler `bin` directory for PDF support
`BLAST_OCR_RETRY_BACKOFF`	2	Backoff factor for self-healing retries

🧪 Testing

B.L.A.S.T. uses a rigorous pytest suite with pytest-cov for branch coverage validation.

To run the full test suite (160+ tests):

python -m pytest tests/ --cov=blast_ocr --cov-report=term-missing

The suite covers:

Core Engine: Thread-safety, VRAM management, and preprocessing fallbacks.
Cache System: Windows file-lock retry logic and atomic writes.
Pipeline: PDF batching, multi-format routing, and temp-dir cleanup.
UI & UX: Mocked Streamlit session state and secure upload handlers.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on testing and code style.

📝 License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.devcontainer		.devcontainer
.hypothesis		.hypothesis
.streamlit		.streamlit
architecture		architecture
blast_ocr		blast_ocr
data		data
docs		docs
results		results
skills		skills
test_images		test_images
tests		tests
.coverage		.coverage
.env.example		.env.example
.gitignore		.gitignore
AI_SYSTEM_CONTEXT.md		AI_SYSTEM_CONTEXT.md
ARCHITECTURE.md		ARCHITECTURE.md
AUDIT.md		AUDIT.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
DEVTOOLS_GUIDE.md		DEVTOOLS_GUIDE.md
ENHANCEMENTS.md		ENHANCEMENTS.md
README.md		README.md
benchmark.py		benchmark.py
bug_report.md		bug_report.md
bug_report_v2.md		bug_report_v2.md
clean_final_log.txt		clean_final_log.txt
clean_final_log_v2.txt		clean_final_log_v2.txt
clean_log.txt		clean_log.txt
clean_log_v2.txt		clean_log_v2.txt
coverage_report.txt		coverage_report.txt
debug_concy.py		debug_concy.py
debug_ocr.py		debug_ocr.py
debug_processed_06.png		debug_processed_06.png
deep_audit_results.txt		deep_audit_results.txt
dll_check.py		dll_check.py
fast_cov.txt		fast_cov.txt
find_bare.py		find_bare.py
gemini.md		gemini.md
generate_docs.py		generate_docs.py
generate_md_docs.py		generate_md_docs.py
generate_report.py		generate_report.py
generate_report_v2.py		generate_report_v2.py
inventory_gen.py		inventory_gen.py
lint_and_fix.ps1		lint_and_fix.ps1
maintain.py		maintain.py
memory_audit_results.txt		memory_audit_results.txt
new_cov.txt		new_cov.txt
packages.txt		packages.txt
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.py		run.py
run_baselines.ps1		run_baselines.ps1
run_gui.py		run_gui.py
run_phases.ps1		run_phases.ps1
runtime.txt		runtime.txt
sanitize_fail.txt		sanitize_fail.txt
streamlit_app.py		streamlit_app.py
test_concy_failure.txt		test_concy_failure.txt
test_db_concy.txt		test_db_concy.txt
test_db_results.txt		test_db_results.txt
test_failure.txt		test_failure.txt
test_results.txt		test_results.txt
test_results_v2.txt		test_results_v2.txt
test_simple_fix.py		test_simple_fix.py
test_xxe_failure.txt		test_xxe_failure.txt
tests_output.txt		tests_output.txt
tests_output_v2.txt		tests_output_v2.txt
verification_log.txt		verification_log.txt
verify_foundation.py		verify_foundation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 B.L.A.S.T. OCR Engine

🌟 Key Features

📦 Installation

Prerequisites

Setup

🕹️ Usage

Command Line Interface (CLI)

Graphical User Interface (GUI)

🏗️ Architecture & Documentation

🛡️ Forensic Remediation

⚙️ Configuration

🧪 Testing

🤝 Contributing

📝 License

OCR

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 B.L.A.S.T. OCR Engine

🌟 Key Features

📦 Installation

Prerequisites

Setup

🕹️ Usage

Command Line Interface (CLI)

Graphical User Interface (GUI)

🏗️ Architecture & Documentation

🛡️ Forensic Remediation

⚙️ Configuration

🧪 Testing

🤝 Contributing

📝 License

OCR

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages