DataFog · sidmohan0 · May 26, 2026 · May 26, 2026 · May 26, 2026 · May 26, 2026
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 4.3.0
+current_version = 4.4.0a5
 commit = True
 tag = True
 tag_name = v{new_version}
@@ -20,7 +20,3 @@ values =
 [bumpversion:file:datafog/__about__.py]
 search = __version__ = "{current_version}"
 replace = __version__ = "{new_version}"
-
-[bumpversion:file:setup.py]
-search = version="{current_version}"
-replace = version="{new_version}"
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -31,14 +31,6 @@ jobs:
       matrix:
         python-version: ["3.10", "3.11", "3.12", "3.13"]
         install-profile: ["core", "nlp", "nlp-advanced"]
-        exclude:
-          # v4.4.0 claims Python 3.13 support for core + CLI first.
-          # Optional heavyweight profiles remain validated separately before
-          # we advertise Python 3.13 support for them.
-          - python-version: "3.13"
-            install-profile: "nlp"
-          - python-version: "3.13"
-            install-profile: "nlp-advanced"
     steps:
       - uses: actions/checkout@v4
       - name: Set up Python
@@ -159,6 +151,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
+        python-version: ["3.11"]
         install-profile:
           - core
           - cli
@@ -167,18 +160,31 @@ jobs:
           - ocr
           - distributed
           - web
+        include:
+          - python-version: "3.13"
+            install-profile: nlp
+          - python-version: "3.13"
+            install-profile: nlp-advanced
+          - python-version: "3.13"
+            install-profile: ocr
     steps:
       - uses: actions/checkout@v4
       - name: Set up Python
         uses: actions/setup-python@v5
         with:
-          python-version: "3.11"
+          python-version: ${{ matrix.python-version }}
           cache: "pip"
 
       - name: Upgrade pip
         run: |
           python -m pip install --upgrade pip
 
+      - name: Install Tesseract OCR
+        if: matrix.install-profile == 'ocr'
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y tesseract-ocr libtesseract-dev
+
       - name: Install dependencies (core)
         if: matrix.install-profile == 'core'
         run: |
@@ -192,6 +198,7 @@ jobs:
       - name: Run install profile smoke test
         env:
           DATAFOG_INSTALL_PROFILE: ${{ matrix.install-profile }}
+          DATAFOG_REQUIRE_TESSERACT: ${{ matrix.install-profile == 'ocr' && '1' || '' }}
         run: |
           pytest tests/test_install_profiles.py -q
 

diff --git a/.gitignore b/.gitignore
@@ -24,6 +24,7 @@ error_log.txt
 # Environment
 .env
 .venv
+.venv*/
 venv/
 env/
 examples/venv/
@@ -58,14 +59,15 @@ docs/*
 !docs/conf.py
 !docs/Makefile
 !docs/make.bat
+!docs/optional-surfaces.rst
+!docs/agents/
+!docs/agents/**
 !docs/audit/
 !docs/audit/**
 
 # Keep all directories but ignore their contents
 */**/__pycache__/
 
-# Keep all files but ignore their contents
-Claude.md
 notes/benchmarking_notes.md
 Roadmap.md
 notes/*
diff --git a/Claude.md → AGENTS.md b/Claude.md → AGENTS.md
@@ -1,18 +1,26 @@
-# DataFog - Claude Development Guide
+# DataFog - Agent Development Guide
 
 ## Project Overview
+
 **DataFog** is an open-source Python library for PII detection and anonymization with a focus on speed and lightweight architecture.
 
 ## Core Value Proposition
+
 - **Ultra-Fast Performance**: 190x faster than spaCy for structured PII, 32x faster with GLiNER
 - **Lightweight Core**: <2MB package with optional ML extras
 - **Modern Engine Options**: Regex, GLiNER, spaCy, and smart cascading
 - **Production Ready**: Comprehensive testing, CI/CD, and performance validation
 
 ## Current Project Status
-**Version: 4.3.0**
+
+**Stable version: 4.4.0**
+
+**Development version: 4.4.0a5**
+
+**Next minor target: 4.5.0**
 
 ### ✅ Recently Completed (Latest)
+
 - **GLiNER Integration**: Modern NER engine with PII-specialized models
 - **Smart Cascading**: Intelligent regex → GLiNER → spaCy progression
 - **Enhanced CLI**: Model management with `--engine` flags
@@ -43,6 +51,7 @@ python -c "from datafog.services.text_service import TextService; print('✅ All
 ## Architecture Overview
 
 ### Engine Ecosystem (Updated with GLiNER)
+
 ```python
 from datafog.services.text_service import TextService
 
@@ -59,37 +68,42 @@ auto_service = TextService(engine="auto")        # Legacy: regex→spaCy
 ```
 
 ### Performance Comparison (Validated)
-| Engine  | Speed vs spaCy | Accuracy | Use Case | Install |
-|---------|----------------|----------|----------|---------|
-| `regex` | **190x faster** | High (structured) | Emails, phones, SSNs | Core only |
-| `gliner` | **32x faster** | Very High | Modern NER, custom entities | `[nlp-advanced]` |
-| `spacy` | 1x (baseline) | Good | Traditional NLP | `[nlp]` |
-| `smart` | **60x faster** | Highest | Best balance | `[nlp-advanced]` |
+
+| Engine   | Speed vs spaCy  | Accuracy          | Use Case                    | Install          |
+| -------- | --------------- | ----------------- | --------------------------- | ---------------- |
+| `regex`  | **190x faster** | High (structured) | Emails, phones, SSNs        | Core only        |
+| `gliner` | **32x faster**  | Very High         | Modern NER, custom entities | `[nlp-advanced]` |
+| `spacy`  | 1x (baseline)   | Good              | Traditional NLP             | `[nlp]`          |
+| `smart`  | **60x faster**  | Highest           | Best balance                | `[nlp-advanced]` |
 
 ### Dependency Strategy
+
 ```python
 # Lightweight core (<2MB)
 pip install datafog
 
 # Optional ML engines
 pip install datafog[nlp]           # spaCy (traditional NLP)
-pip install datafog[nlp-advanced]  # GLiNER (modern NER) 
+pip install datafog[nlp-advanced]  # GLiNER (modern NER)
 pip install datafog[ocr]           # Image processing
 pip install datafog[all]           # Everything
 ```
 
 ## GLiNER Integration (NEW)
 
 ### Overview
+
 GLiNER (Generalist Model for Named Entity Recognition) provides modern, accurate NER capabilities optimized for PII detection.
 
 ### Key Features
+
 - **PII-Specialized Models**: `urchade/gliner_multi_pii-v1` trained specifically for PII
 - **Custom Entity Types**: Configurable entity detection beyond default PII types
 - **Smart Cascading**: Automatically tries regex first, GLiNER second, spaCy last
 - **CLI Management**: Download and manage GLiNER models via CLI
 
 ### Usage Examples
+
 ```python
 # GLiNER engine
 from datafog.services.text_service import TextService
@@ -108,6 +122,7 @@ subprocess.run(["datafog", "list-models", "--engine", "gliner"])
 ```
 
 ### Available GLiNER Models
+
 - `urchade/gliner_multi_pii-v1` - PII-specialized (recommended)
 - `urchade/gliner_base` - General purpose starter
 - `urchade/gliner_large-v2` - Higher accuracy
@@ -116,17 +131,19 @@ subprocess.run(["datafog", "list-models", "--engine", "gliner"])
 ## Development Workflow
 
 ### Git Branch Strategy
+
 - **main**: Production releases only
 - **dev**: Main development branch (use this)
-- **feature/***: New features from dev
-- **fix/***: Bug fixes from dev
+- **feature/\***: New features from dev
+- **fix/\***: Bug fixes from dev
 
 ### Making Changes
+
 ```bash
 # Start from dev
 git checkout dev && git pull origin dev
 
-# Create feature branch  
+# Create feature branch
 git checkout -b feature/your-change
 
 # Make changes, test, commit
@@ -137,6 +154,7 @@ git push -u origin feature/your-change
 ```
 
 ### Testing
+
 ```bash
 # Run specific test suites
 pytest tests/test_text_service.py -v           # Core functionality
@@ -149,13 +167,14 @@ PYTEST_DONUT=yes pytest tests/test_ocr_integration.py  # OCR with real models
 
 # Performance requirements
 # - Regex: 150x+ faster than spaCy
-# - GLiNER: 25x+ faster than spaCy  
+# - GLiNER: 25x+ faster than spaCy
 # - Package size: Core <2MB, full <8MB
 ```
 
 ## Key Implementation Patterns
 
 ### Simple API (Recommended)
+
 ```python
 # Always available, lightweight
 from datafog import detect, process
@@ -164,6 +183,7 @@ result = process("john@example.com", method="redact")
 ```
 
 ### Advanced Engine Selection
+
 ```python
 # For specialized use cases
 from datafog.services.text_service import TextService
@@ -173,7 +193,7 @@ service = TextService(engine="regex")
 
 # Modern NER with custom entities
 service = TextService(
-    engine="gliner", 
+    engine="gliner",
     gliner_model="urchade/gliner_base"
 )
 
@@ -182,6 +202,7 @@ service = TextService(engine="smart")
 ```
 
 ### Graceful Degradation
+
 ```python
 # Handles missing dependencies elegantly
 try:
@@ -194,18 +215,21 @@ except ImportError:
 ## Common Tasks
 
 ### Adding New Entity Types
+
 1. Update regex patterns in `regex_annotator.py`
 2. Add GLiNER entity types in `gliner_annotator.py`
 3. Update tests and benchmarks
 4. Validate performance doesn't regress >10%
 
 ### Performance Optimization
+
 1. Profile with existing benchmarks
 2. Maintain speed thresholds (regex 150x+, GLiNER 25x+)
 3. Update baselines when making improvements
 4. Test across all engines
 
 ### CLI Enhancements
+
 1. Update `client.py` with new commands
 2. Support `--engine` flag for multi-engine commands
 3. Add comprehensive help text and examples
@@ -215,31 +239,36 @@ except ImportError:
 
 ### Workflow Architecture (3 workflows)
 
-| Workflow | Purpose | Trigger |
-|----------|---------|---------|
-| `ci.yml` | Lint + Test + Coverage + Wheel size | Push/PR to main/dev |
-| `release.yml` | Alpha/Beta/Stable publishing | Schedule + manual dispatch |
-| `benchmark.yml` | Performance benchmarks | Push/PR/weekly |
+| Workflow        | Purpose                             | Trigger                    |
+| --------------- | ----------------------------------- | -------------------------- |
+| `ci.yml`        | Lint + Test + Coverage + Wheel size | Push/PR to main/dev        |
+| `release.yml`   | Alpha/Beta/Stable publishing        | Schedule + manual dispatch |
+| `benchmark.yml` | Performance benchmarks              | Push/PR/weekly             |
 
 ### Release Cadence
+
 - **Alpha** (Mon-Wed 2AM UTC): Automatic from `dev`, date+commit versioning
 - **Beta** (Thursday 2AM UTC): Automatic from `dev`, incremental beta numbers
 - **Stable** (manual dispatch): From `main`, base version or override
 
 ### Release Pipeline
+
 `determine-release` → `test` → `publish` → `cleanup`
+
 - Tests are a hard gate — no tests = no publish
 - Stable releases check out `main`; alpha/beta check out `dev`
 - Old alphas pruned to 7, betas to 5
 - `[skip ci]` in version bump commits to prevent loops
 
 ### Pre-commit Hooks
+
 - **isort**, **black**, **flake8**, **ruff**: Code formatting and linting
 - **prettier**: Markdown, JSON, YAML formatting
 - **gitleaks**: Secret scanning
 - **pre-commit-hooks**: Large file checks, merge conflict detection, YAML validation
 
 ## Environment Variables
+
 ```bash
 # Testing configuration
 export PYTEST_DONUT=yes              # Enable real OCR testing
@@ -250,33 +279,51 @@ export PYTHONPATH=$(pwd)             # Local development imports
 ```
 
 ## Performance Requirements
+
 - **Core Package**: <2MB (from ~8MB in v4.0.x)
 - **Regex Engine**: 150x+ faster than spaCy (currently 190x)
-- **GLiNER Engine**: 25x+ faster than spaCy (currently 32x)  
+- **GLiNER Engine**: 25x+ faster than spaCy (currently 32x)
 - **Memory Usage**: Graceful handling of large texts (1MB+ chunks)
 - **Model Loading**: Cache GLiNER models to avoid repeated downloads
 
-## Best Practices for Claude Agents
+## Agent skills
+
+### Issue tracker
+
+Issues and PRDs are tracked in Linear under the DFPY team. See `docs/agents/issue-tracker.md`.
+
+### Triage labels
+
+Use the default five-label triage vocabulary. See `docs/agents/triage-labels.md`.
+
+### Domain docs
+
+Single-context repo: use root `CONTEXT.md` and root `docs/adr/` when present. See `docs/agents/domain.md`.
+
+## Best Practices for Agents
 
 Before beginning any task please checkout a branch from `dev` and create a pull request to `dev`.
 
 ### Code Quality
+
 - Follow existing patterns before implementing new approaches
 - Add comprehensive tests for all new functionality
 - Update documentation immediately with code changes
 - Run benchmarks for any text processing modifications
 
 ### GLiNER Development
+
 - Use PII-specialized models when available (`urchade/gliner_multi_pii-v1`)
 - Test graceful degradation when GLiNER dependencies missing
 - Validate smart cascading thresholds with real data
 - Consider model download time and caching strategies
 
 ### Release Preparation
+
 - Alpha/beta releases are automated via `release.yml` schedule
 - Stable releases: merge `dev` → `main`, then trigger `release.yml` with `stable` type
 - Use `dry_run: true` to validate before actual publish
 - Performance validation on realistic data sets
-- In Release Notes or Comments, do not reference that it was authored by Claude (all code is anonymously authored)
+- In Release Notes or Comments, do not reference that it was authored by an AI agent (all code is anonymously authored)
 
-This guide provides the essential information for DataFog development while maintaining focus on current priorities and recent GLiNER integration work.
+This guide provides the essential information for DataFog development while maintaining focus on current priorities and recent GLiNER integration work.