GitHub - openlobbying/muckrake: A framework for creating Follow The Money data.

A reusable framework for creating and storing FollowTheMoney entities.

Warning

This is a work in progress. Expect breaking changes and incomplete features.

Muckrake

Muckrake is the data pipeline. It is partially inspired by zavod and other tools in the FollowTheMoney ecosystems.

Run uv run muckrake --help for a full list of available commands.

Install Python dependencies with uv sync. This now includes the external org-id package used for structured organization identifiers.

OpenLobbying-specific code now lives in the sibling ../openlobbying/ repository. That repo owns:

the OpenLobbying dataset crawlers
the OpenLobbying FastAPI application
the OpenLobbying Svelte frontend
any project-specific FtM schema extensions
deployment assets for the public site

Crawlers

Muckrake discovers crawler configs from ./datasets/ in the current working directory and any paths listed in MUCKRAKE_DATASET_PATHS. At a minimum, each dataset consists of a config.yml with metadata and a crawler.py script that outputs FollowTheMoney statements in CSV format.

To crawl a dataset, run uv run muckrake crawl {dataset_name}. Run uv run muckrake list to see available datasets.

Each crawl now creates a dataset_runs record in Postgres and stores immutable artifacts under MUCKRAKE_ARTIFACT_PATH (defaults to data/artifacts). The latest successful run remains mirrored into data/datasets/{name}/statements.pack.csv for local compatibility.

AI-based NER

Many data sources have composite fields that contain multiple entities. We use LLMs to extract unique entities and relationships from these fields, and store them as candidates in the database for review and approval. See NER docs for details.

# Create extraction candidates for one dataset
uv run muckrake ner-extract open_access --extractor llm --limit 50

# Review candidates in a terminal UI
uv run muckrake ner-review

Dedupe

Our goal is to link entities across datasets to provide a unified view of lobbying and political finance for any given person, company, or organisation.

# Create dedupe candidates across all datasets
uv run muckrake xref

# Review candidates in a terminal UI
uv run muckrake dedupe

We also want to collapse duplicate relationship edges across datasets, especially for ORCL and PRCA. This is done automatically, no review step required.

uv run muckrake dedupe-edges

Loading

Statements are loaded into Postgres with uv run muckrake load. This reads the statements CSV files and applies any approved NER candidates before materialising entities and relationships.

To load from a specific immutable crawl snapshot instead of the local workspace copy:

uv run muckrake load gb_political_finance --run-id 123

For the published site, prefer the release workflow instead of loading directly into the serving database:

uv run muckrake release-build
uv run muckrake release-publish 1

OpenLobbying

The primary user of Muckrake data is OpenLobbying, an open database of lobbying and political finance data. See ../openlobbying/README.md for app setup, API serving, and frontend development.

Environment setup

Copy .env.example to .env in the repo root.
By default muckrake loads the nearest .env from the current working directory upward. Override that with MUCKRAKE_ENV_FILE if needed.
Default local database setup:
- working DB: sqlite:///data/muckrake.db
- published DB: same as the working DB unless MUCKRAKE_PUBLISHED_DATABASE_URL is set
Required only if you want Postgres:
- MUCKRAKE_DATABASE_URL
Common local settings:
- MUCKRAKE_PUBLISHED_DATABASE_URL for a separate published API database
Optional local overrides:
- MUCKRAKE_DATA_PATH
- MUCKRAKE_ARTIFACT_PATH
- MUCKRAKE_DATASET_PATHS
- MUCKRAKE_FTM_SCHEMA_PATHS
- MUCKRAKE_ENV_FILE
- FTM_MODEL_PATH if you need to override the merged FollowTheMoney model entirely
- OPENROUTER_API_KEY, LLM_MODEL, NER_LLM_PROMPT_FILE, LOGFIRE_TOKEN
Example:

cp .env.example .env

Consumers

../openlobbying/: OpenLobbying application repo built on top of muckrake
../us-congress-lobbying/: project-specific investigative sandbox

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.agents/skills/version-bump		.agents/skills/version-bump
.github		.github
.vscode		.vscode
datasets_to_do		datasets_to_do
src/muckrake		src/muckrake
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Muckrake

Crawlers

AI-based NER

Dedupe

Loading

OpenLobbying

Environment setup

Consumers

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Muckrake

Crawlers

AI-based NER

Dedupe

Loading

OpenLobbying

Environment setup

Consumers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages