Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 4.3.0
current_version = 4.4.0a5
commit = True
tag = True
tag_name = v{new_version}
Expand All @@ -20,7 +20,3 @@ values =
[bumpversion:file:datafog/__about__.py]
search = __version__ = "{current_version}"
replace = __version__ = "{new_version}"

[bumpversion:file:setup.py]
search = version="{current_version}"
replace = version="{new_version}"
25 changes: 16 additions & 9 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,6 @@ jobs:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13"]
install-profile: ["core", "nlp", "nlp-advanced"]
exclude:
# v4.4.0 claims Python 3.13 support for core + CLI first.
# Optional heavyweight profiles remain validated separately before
# we advertise Python 3.13 support for them.
- python-version: "3.13"
install-profile: "nlp"
- python-version: "3.13"
install-profile: "nlp-advanced"
steps:
- uses: actions/checkout@v4
- name: Set up Python
Expand Down Expand Up @@ -159,6 +151,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.11"]
install-profile:
- core
- cli
Expand All @@ -167,18 +160,31 @@ jobs:
- ocr
- distributed
- web
include:
- python-version: "3.13"
install-profile: nlp
- python-version: "3.13"
install-profile: nlp-advanced
- python-version: "3.13"
install-profile: ocr
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
python-version: ${{ matrix.python-version }}
cache: "pip"

- name: Upgrade pip
run: |
python -m pip install --upgrade pip

- name: Install Tesseract OCR
if: matrix.install-profile == 'ocr'
run: |
sudo apt-get update
sudo apt-get install -y tesseract-ocr libtesseract-dev

- name: Install dependencies (core)
if: matrix.install-profile == 'core'
run: |
Expand All @@ -192,6 +198,7 @@ jobs:
- name: Run install profile smoke test
env:
DATAFOG_INSTALL_PROFILE: ${{ matrix.install-profile }}
DATAFOG_REQUIRE_TESSERACT: ${{ matrix.install-profile == 'ocr' && '1' || '' }}
run: |
pytest tests/test_install_profiles.py -q

Expand Down
6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ error_log.txt
# Environment
.env
.venv
.venv*/
venv/
env/
examples/venv/
Expand Down Expand Up @@ -58,14 +59,15 @@ docs/*
!docs/conf.py
!docs/Makefile
!docs/make.bat
!docs/optional-surfaces.rst
!docs/agents/
!docs/agents/**
!docs/audit/
!docs/audit/**

# Keep all directories but ignore their contents
*/**/__pycache__/

# Keep all files but ignore their contents
Claude.md
notes/benchmarking_notes.md
Roadmap.md
notes/*
93 changes: 70 additions & 23 deletions Claude.md → AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,26 @@
# DataFog - Claude Development Guide
# DataFog - Agent Development Guide

## Project Overview

**DataFog** is an open-source Python library for PII detection and anonymization with a focus on speed and lightweight architecture.

## Core Value Proposition

- **Ultra-Fast Performance**: 190x faster than spaCy for structured PII, 32x faster with GLiNER
- **Lightweight Core**: <2MB package with optional ML extras
- **Modern Engine Options**: Regex, GLiNER, spaCy, and smart cascading
- **Production Ready**: Comprehensive testing, CI/CD, and performance validation

## Current Project Status
**Version: 4.3.0**

**Stable version: 4.4.0**

**Development version: 4.4.0a5**

**Next minor target: 4.5.0**

### ✅ Recently Completed (Latest)

- **GLiNER Integration**: Modern NER engine with PII-specialized models
- **Smart Cascading**: Intelligent regex → GLiNER → spaCy progression
- **Enhanced CLI**: Model management with `--engine` flags
Expand Down Expand Up @@ -43,6 +51,7 @@ python -c "from datafog.services.text_service import TextService; print('✅ All
## Architecture Overview

### Engine Ecosystem (Updated with GLiNER)

```python
from datafog.services.text_service import TextService

Expand All @@ -59,37 +68,42 @@ auto_service = TextService(engine="auto") # Legacy: regex→spaCy
```

### Performance Comparison (Validated)
| Engine | Speed vs spaCy | Accuracy | Use Case | Install |
|---------|----------------|----------|----------|---------|
| `regex` | **190x faster** | High (structured) | Emails, phones, SSNs | Core only |
| `gliner` | **32x faster** | Very High | Modern NER, custom entities | `[nlp-advanced]` |
| `spacy` | 1x (baseline) | Good | Traditional NLP | `[nlp]` |
| `smart` | **60x faster** | Highest | Best balance | `[nlp-advanced]` |

| Engine | Speed vs spaCy | Accuracy | Use Case | Install |
| -------- | --------------- | ----------------- | --------------------------- | ---------------- |
| `regex` | **190x faster** | High (structured) | Emails, phones, SSNs | Core only |
| `gliner` | **32x faster** | Very High | Modern NER, custom entities | `[nlp-advanced]` |
| `spacy` | 1x (baseline) | Good | Traditional NLP | `[nlp]` |
| `smart` | **60x faster** | Highest | Best balance | `[nlp-advanced]` |

### Dependency Strategy

```python
# Lightweight core (<2MB)
pip install datafog

# Optional ML engines
pip install datafog[nlp] # spaCy (traditional NLP)
pip install datafog[nlp-advanced] # GLiNER (modern NER)
pip install datafog[nlp-advanced] # GLiNER (modern NER)
pip install datafog[ocr] # Image processing
pip install datafog[all] # Everything
```

## GLiNER Integration (NEW)

### Overview

GLiNER (Generalist Model for Named Entity Recognition) provides modern, accurate NER capabilities optimized for PII detection.

### Key Features

- **PII-Specialized Models**: `urchade/gliner_multi_pii-v1` trained specifically for PII
- **Custom Entity Types**: Configurable entity detection beyond default PII types
- **Smart Cascading**: Automatically tries regex first, GLiNER second, spaCy last
- **CLI Management**: Download and manage GLiNER models via CLI

### Usage Examples

```python
# GLiNER engine
from datafog.services.text_service import TextService
Expand All @@ -108,6 +122,7 @@ subprocess.run(["datafog", "list-models", "--engine", "gliner"])
```

### Available GLiNER Models

- `urchade/gliner_multi_pii-v1` - PII-specialized (recommended)
- `urchade/gliner_base` - General purpose starter
- `urchade/gliner_large-v2` - Higher accuracy
Expand All @@ -116,17 +131,19 @@ subprocess.run(["datafog", "list-models", "--engine", "gliner"])
## Development Workflow

### Git Branch Strategy

- **main**: Production releases only
- **dev**: Main development branch (use this)
- **feature/***: New features from dev
- **fix/***: Bug fixes from dev
- **feature/\***: New features from dev
- **fix/\***: Bug fixes from dev

### Making Changes

```bash
# Start from dev
git checkout dev && git pull origin dev

# Create feature branch
# Create feature branch
git checkout -b feature/your-change

# Make changes, test, commit
Expand All @@ -137,6 +154,7 @@ git push -u origin feature/your-change
```

### Testing

```bash
# Run specific test suites
pytest tests/test_text_service.py -v # Core functionality
Expand All @@ -149,13 +167,14 @@ PYTEST_DONUT=yes pytest tests/test_ocr_integration.py # OCR with real models

# Performance requirements
# - Regex: 150x+ faster than spaCy
# - GLiNER: 25x+ faster than spaCy
# - GLiNER: 25x+ faster than spaCy
# - Package size: Core <2MB, full <8MB
```

## Key Implementation Patterns

### Simple API (Recommended)

```python
# Always available, lightweight
from datafog import detect, process
Expand All @@ -164,6 +183,7 @@ result = process("john@example.com", method="redact")
```

### Advanced Engine Selection

```python
# For specialized use cases
from datafog.services.text_service import TextService
Expand All @@ -173,7 +193,7 @@ service = TextService(engine="regex")

# Modern NER with custom entities
service = TextService(
engine="gliner",
engine="gliner",
gliner_model="urchade/gliner_base"
)

Expand All @@ -182,6 +202,7 @@ service = TextService(engine="smart")
```

### Graceful Degradation

```python
# Handles missing dependencies elegantly
try:
Expand All @@ -194,18 +215,21 @@ except ImportError:
## Common Tasks

### Adding New Entity Types

1. Update regex patterns in `regex_annotator.py`
2. Add GLiNER entity types in `gliner_annotator.py`
3. Update tests and benchmarks
4. Validate performance doesn't regress >10%

### Performance Optimization

1. Profile with existing benchmarks
2. Maintain speed thresholds (regex 150x+, GLiNER 25x+)
3. Update baselines when making improvements
4. Test across all engines

### CLI Enhancements

1. Update `client.py` with new commands
2. Support `--engine` flag for multi-engine commands
3. Add comprehensive help text and examples
Expand All @@ -215,31 +239,36 @@ except ImportError:

### Workflow Architecture (3 workflows)

| Workflow | Purpose | Trigger |
|----------|---------|---------|
| `ci.yml` | Lint + Test + Coverage + Wheel size | Push/PR to main/dev |
| `release.yml` | Alpha/Beta/Stable publishing | Schedule + manual dispatch |
| `benchmark.yml` | Performance benchmarks | Push/PR/weekly |
| Workflow | Purpose | Trigger |
| --------------- | ----------------------------------- | -------------------------- |
| `ci.yml` | Lint + Test + Coverage + Wheel size | Push/PR to main/dev |
| `release.yml` | Alpha/Beta/Stable publishing | Schedule + manual dispatch |
| `benchmark.yml` | Performance benchmarks | Push/PR/weekly |

### Release Cadence

- **Alpha** (Mon-Wed 2AM UTC): Automatic from `dev`, date+commit versioning
- **Beta** (Thursday 2AM UTC): Automatic from `dev`, incremental beta numbers
- **Stable** (manual dispatch): From `main`, base version or override

### Release Pipeline

`determine-release` → `test` → `publish` → `cleanup`

- Tests are a hard gate — no tests = no publish
- Stable releases check out `main`; alpha/beta check out `dev`
- Old alphas pruned to 7, betas to 5
- `[skip ci]` in version bump commits to prevent loops

### Pre-commit Hooks

- **isort**, **black**, **flake8**, **ruff**: Code formatting and linting
- **prettier**: Markdown, JSON, YAML formatting
- **gitleaks**: Secret scanning
- **pre-commit-hooks**: Large file checks, merge conflict detection, YAML validation

## Environment Variables

```bash
# Testing configuration
export PYTEST_DONUT=yes # Enable real OCR testing
Expand All @@ -250,33 +279,51 @@ export PYTHONPATH=$(pwd) # Local development imports
```

## Performance Requirements

- **Core Package**: <2MB (from ~8MB in v4.0.x)
- **Regex Engine**: 150x+ faster than spaCy (currently 190x)
- **GLiNER Engine**: 25x+ faster than spaCy (currently 32x)
- **GLiNER Engine**: 25x+ faster than spaCy (currently 32x)
- **Memory Usage**: Graceful handling of large texts (1MB+ chunks)
- **Model Loading**: Cache GLiNER models to avoid repeated downloads

## Best Practices for Claude Agents
## Agent skills

### Issue tracker

Issues and PRDs are tracked in Linear under the DFPY team. See `docs/agents/issue-tracker.md`.

### Triage labels

Use the default five-label triage vocabulary. See `docs/agents/triage-labels.md`.

### Domain docs

Single-context repo: use root `CONTEXT.md` and root `docs/adr/` when present. See `docs/agents/domain.md`.

## Best Practices for Agents

Before beginning any task please checkout a branch from `dev` and create a pull request to `dev`.

### Code Quality

- Follow existing patterns before implementing new approaches
- Add comprehensive tests for all new functionality
- Update documentation immediately with code changes
- Run benchmarks for any text processing modifications

### GLiNER Development

- Use PII-specialized models when available (`urchade/gliner_multi_pii-v1`)
- Test graceful degradation when GLiNER dependencies missing
- Validate smart cascading thresholds with real data
- Consider model download time and caching strategies

### Release Preparation

- Alpha/beta releases are automated via `release.yml` schedule
- Stable releases: merge `dev` → `main`, then trigger `release.yml` with `stable` type
- Use `dry_run: true` to validate before actual publish
- Performance validation on realistic data sets
- In Release Notes or Comments, do not reference that it was authored by Claude (all code is anonymously authored)
- In Release Notes or Comments, do not reference that it was authored by an AI agent (all code is anonymously authored)

This guide provides the essential information for DataFog development while maintaining focus on current priorities and recent GLiNER integration work.
This guide provides the essential information for DataFog development while maintaining focus on current priorities and recent GLiNER integration work.
Loading
Loading