parselmouth: Conda mapping runner

Overview

parselmouth is a utility designed to facilitate the mapping of Conda package names to their corresponding PyPI names and the inverse. This tool automates the process of generating and updating mappings on an hourly basis, ensuring that users have access to the most accurate and up-to-date information.

Local Testing

Test the complete pipeline locally with MinIO (S3-compatible storage):

# One-command start (recommended) - starts MinIO + interactive mode
pixi run test-interactive

# Or run manually with more control:

# 1. Start MinIO
docker-compose up -d

# 2. Run with defaults (pytorch, noarch, package names starting with 't')
pixi run test-pipeline

# 3. Test with conda-forge, package names starting with 'n' (numpy, napari, etc.)
pixi run test-pipeline --channel conda-forge --letter n

# 4. Test incrementally (skip packages already in MinIO)
pixi run test-pipeline --mode incremental

# 5. Test with all packages in a subdir (warning: can be slow!)
pixi run test-pipeline --channel bioconda --letter all

# Multiple channels can coexist in the same bucket (separated by path prefixes)

# Access MinIO UI at http://localhost:9001 (minioadmin / minioadmin)

# Clean up when done
pixi run clean-all          # Everything + stop MinIO
pixi run clean-local-data   # Just cache + outputs (keep MinIO)

See docs/LOCAL_TESTING.md for detailed information.

Conda to PyPI

Example of mapping for numpy-1.26.4-py311h64a7726_0.conda with sha256 3f4365e11b28e244c95ba8579942b0802761ba7bb31c026f50d1a9ea9c728149

{
  "pypi_normalized_names": ["numpy"],
  "versions": {
    "numpy": "1.26.4"
  },
  "conda_name": "numpy",
  "package_name": "numpy-1.26.4-py311h64a7726_0.conda",
  "direct_url": [
    "https://github.com/numpy/numpy/releases/download/v1.26.4/numpy-1.26.4.tar.gz"
  ]
}

A more simplified version of our mapping is stored here: files/mapping_as_grayskull.json

PyPI to conda

Example of mapping requests to the corresponding conda versions is, this shows you the known conda names per PyPI version, if a version is missing it is not available on that conda channel:

{
  "2.10.0": ["requests"],
  "2.11.0": ["requests"],
  "2.11.1": ["requests"],
  "2.12.0": ["requests"],
  "2.12.1": ["requests"],
  "2.12.4": ["requests"],
  "2.12.5": ["requests"],
  "2.13.0": ["requests"],
  "2.17.3": ["requests"],
  "2.18.1": ["requests"],
  "2.18.2": ["requests"],
  "2.18.3": ["requests"],
  "2.18.4": ["requests"],
  "2.19.0": ["requests"],
  "2.19.1": ["requests"],
  "2.20.0": ["requests"],
  "2.20.1": ["requests"],
  "2.21.0": ["requests"],
  "2.22.0": ["requests"],
  "2.23.0": ["requests"],
  "2.9.2": ["requests"],
  "2.27.1": ["requests", "arm_pyart"],
  "2.24.0": ["requests", "google-cloud-bigquery-storage-core"],
  "2.26.0": ["requests"],
  "2.25.1": ["requests"],
  "2.25.0": ["requests"],
  "2.27.0": ["requests"],
  "2.28.0": ["requests"],
  "2.28.1": ["requests"],
  "2.31.0": ["requests", "jupyter-sphinx"],
  "2.28.2": ["requests"],
  "2.29.0": ["requests"],
  "2.32.1": ["requests"],
  "2.32.2": ["requests"],
  "2.32.3": ["requests"]
}

Online availability

The public mapping data is served as static objects from:

https://conda-mapping.prefix.dev/

There is no directory listing or query API; clients fetch the object path they need. Supported channel names are conda-forge, bioconda, pytorch, and tango-controls unless noted otherwise.

Exposed endpoints

Purpose	Endpoint	Notes
Conda package hash → PyPI mapping	`https://conda-mapping.prefix.dev/hash-v0/{sha256}`	`{sha256}` is the SHA-256 from a conda package record in a channel `repodata.json`. Hash objects are shared across channels.
Channel hash index	`https://conda-mapping.prefix.dev/hash-v0/{channel}/index.json`	Large JSON object containing the package hashes known for a channel and their mapping entries.
PyPI package → conda mapping	`https://conda-mapping.prefix.dev/pypi-to-conda-v1/{channel}/{pypi-normalized-name}.json`	Per-PyPI-package lookup derived from the relations table. Currently produced for channels with `relations-v1` data.
Relations table	`https://conda-mapping.prefix.dev/relations-v1/{channel}/relations.jsonl.gz`	Gzipped JSON Lines source-of-truth table for package relations. Currently exposed for `conda-forge`, `bioconda`, and `pytorch`.
Relations metadata	`https://conda-mapping.prefix.dev/relations-v1/{channel}/metadata.json`	Generation time and counts for the relations table. Currently exposed for `conda-forge`, `bioconda`, and `pytorch`.
Legacy compressed mapping, top-level	`https://conda-mapping.prefix.dev/compressed-v0/compressed_mapping.json`	Flat compressed mapping for `conda-forge`, consumed by `pixi` and other tooling.
Legacy compressed mapping, per-channel	`https://conda-mapping.prefix.dev/compressed-v0/{channel}/compressed_mapping.json`	Flat compressed mapping for a specific channel.

Examples:

Conda hash lookup for numpy-1.26.4-py310h4bfa8fc_0.conda: https://conda-mapping.prefix.dev/hash-v0/914476e2d3273fdf9c0419a7bdcb7b31a5ec25949e4afbc847297ff3a50c62c8
conda-forge channel index: https://conda-mapping.prefix.dev/hash-v0/conda-forge/index.json
PyPI requests lookup on conda-forge: https://conda-mapping.prefix.dev/pypi-to-conda-v1/conda-forge/requests.json
Legacy compressed bioconda mapping: https://conda-mapping.prefix.dev/compressed-v0/bioconda/compressed_mapping.json

Infrastructure and Storage Architecture

Storage Locations

Parselmouth uses two primary storage locations:

1. R2 Bucket (Cloud Storage)

The main package mapping data is stored in Cloudflare R2 (S3-compatible storage), configured via the R2_PREFIX_BUCKET environment variable. The bucket contains:

Hash-based Mappings (v0):

hash-v0/{channel}/index.json - Channel-specific index containing all package hashes
hash-v0/{package_sha256} - Individual mapping entries keyed by conda package SHA256 hash

Relations Tables (v1):

relations-v1/{channel}/relations.jsonl.gz - Master relations table (JSONL format, gzipped)
relations-v1/{channel}/metadata.json - Metadata about the relations table
pypi-to-conda-v1/{channel}/{pypi_name}.json - Fast PyPI lookup files derived from relations table

Legacy Compressed Mappings (compressed-v0):

compressed-v0/compressed_mapping.json - Top-level conda-forge compressed mapping (the file pixi fetches)
compressed-v0/{channel}/compressed_mapping.json - Per-channel compressed mappings (conda-forge, bioconda, pytorch, tango-controls)

2. Git Repository Files

The files/ directory in the repository stores compressed mappings that are committed to version control:

files/mapping_as_grayskull.json - Legacy mapping format for Grayskull compatibility
files/compressed_mapping.json - Compressed mapping (legacy format)
files/v0/{channel}/compressed_mapping.json - Channel-specific compressed mappings (conda-forge, pytorch, bioconda)

Version System

Parselmouth uses a versioned approach to support multiple data formats:

v0 (Current Hash-based System):

Uses conda package SHA256 hashes as keys
Direct lookup: hash-v0/{sha256} returns a single mapping entry
Optimized for conda → PyPI lookups
Both old and new workflows write to this path

v1 (Relations System - New):

Stores package relationships in a normalized table format
Enables PyPI → conda lookups and dependency analysis
Three-tier structure:
1. Master relations table (source of truth)
2. Metadata (statistics, generation timestamp)
3. Derived lookup files (cached for performance)
Only new workflows with update_relations_table job write to this path

Workflow Pipeline Architecture

The GitHub Actions workflows are organized into stages:

Producer Stage (generate_hash_letters):
- Identifies missing packages by comparing upstream channel repodata with existing index
- Outputs a matrix of subdir@letter combinations to process in parallel
Updater Stage (updater_of_records):
- Runs in parallel for each subdir@letter combination
- Downloads artifact metadata and extracts PyPI mappings
- Uploads individual package mappings to hash-v0/{sha256}
Merger Stage (updater_of_index):
- Combines all partial indices into a master index
- Uploads consolidated index to hash-v0/{channel}/index.json
Relations Generation Stage (update_relations_table) - NEW:
- Runs after the merger stage completes
- Reads the updated index and generates relations table
- Uploads to relations-v1/{channel}/ paths
- Only present in new workflows with relations support
Commit Stage (update_file):
- Updates local git repository files
- Runs mapping transformations (update-mapping-legacy, update-mapping)
- Commits compressed mappings to version control

Bucket Isolation and Safety

The new workflows with relations support do NOT overwrite or interfere with old data:

Same bucket, different prefixes: Both old and new workflows use R2_PREFIX_BUCKET, but write to isolated path prefixes
v0 paths: Both systems continue to write hash-based mappings (backward compatible)
v1 paths: Only new workflows write relations data (additive, no conflicts)
No destructive operations: New workflows add functionality without removing or replacing existing data

This architecture allows for:

Zero-downtime deployment of relations features
Gradual migration from v0 to v1 APIs
Rollback capability if issues arise
Parallel operation of both systems during transition

Relations Table Structure

The RelationsTable is a normalized table that maps Conda packages to PyPI packages and vice versa. Think of it as a many-to-many relationship database.

Basic Concept

The table stores pairs of related packages:

Conda side: package name + version + build (e.g., numpy-1.26.4-py311h64a7726_0)
PyPI side: package name + version (e.g., numpy==1.26.4)

Each row in the table represents one relationship between a specific conda build and a PyPI package version.

How Lookups Work

Conda → PyPI (hash-based):

Given a conda package hash, find which PyPI packages it contains
Location: hash-v0/{sha256}
Example: numpy-1.26.4-py311h64a7726_0 → numpy==1.26.4

PyPI → Conda (aggregated files):

Given a PyPI package name, find all available conda versions
Location: pypi-to-conda-v1/{channel}/{pypi_name}.json
Example: requests → all conda packages containing requests

{
  "pypi_name": "requests",
  "conda_versions": {
    "2.31.0": ["requests", "jupyter-sphinx"],
    "2.32.3": ["requests"]
  }
}

Relationship Types

Many-to-One (common): Multiple conda builds for one PyPI version
- Example: numpy-1.26.4-py311h... and numpy-1.26.4-py310h... both map to numpy==1.26.4
One-to-Many (vendoring): One conda package contains multiple PyPI packages
- Example: arm_pyart vendors requests, creating two mappings from one conda build
Many-to-Many: A PyPI version appears in multiple conda packages
- Example: requests==2.31.0 is in both requests and jupyter-sphinx conda packages

Storage Format

The table is stored as JSONL (JSON Lines) with gzip compression:

# Each line is one relation
{"conda_name": "numpy", "conda_version": "1.26.4", "conda_build": "py311h64a7726_0",
 "pypi_name": "numpy", "pypi_version": "1.26.4", "channel": "conda-forge"}

Benefits:

Each relationship stored exactly once (no duplication)
Can query in either direction
Incremental updates are simple
Compact: ~10-30 MB compressed for conda-forge

Statistics (conda-forge example)

Total relationships: ~1.5 million
Unique conda packages: ~1.4 million
Unique PyPI packages: ~18,000

The ratio (~1.07 relationships per conda package) shows that most conda packages map to a single PyPI package, with occasional vendoring creating the extra relationships.

Web UI

A small React frontend lives under frontend/ and is deployed via GitHub Pages at https://prefix-dev.github.io/parselmouth/. It lets you browse the conda ↔ PyPI mapping from a single search box and shares deep links like /parselmouth/?q=numpy&dir=conda.

Data flows from the live mapping artifacts: the frontend fetches the compressed conda → PyPI mapping from the repository's files/v0/{channel}/compressed_mapping.json files and fetches PyPI → conda detail from https://conda-mapping.prefix.dev/pypi-to-conda-v1/{channel}/{name}.json. The site does not need to be redeployed when the mapping updates — only when frontend code changes.

Local development:

pixi run -e frontend frontend
# open http://localhost:5173/parselmouth/

(frontend depends on frontend-install, so dependencies are installed on first run.)

The dev server proxies /api/r2 to https://conda-mapping.prefix.dev so CORS is not required for localhost.

Production requires CORS to allow https://prefix-dev.github.io on the conda-mapping.prefix.dev R2 bucket. GitHub Pages must also be enabled under Settings → Pages → Source: GitHub Actions.

Thanks!

Developed with ❤️ at prefix.dev.

Name		Name	Last commit message	Last commit date
Latest commit History 5,375 Commits
.github/workflows		.github/workflows
docs		docs
files		files
frontend		frontend
scripts		scripts
src/parselmouth		src/parselmouth
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml
renovate.json		renovate.json
yank.yaml		yank.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parselmouth: Conda mapping runner

Overview

Local Testing

Conda to PyPI

PyPI to conda

Online availability

Exposed endpoints

Infrastructure and Storage Architecture

Storage Locations

1. R2 Bucket (Cloud Storage)

2. Git Repository Files

Version System

Workflow Pipeline Architecture

Bucket Isolation and Safety

Relations Table Structure

Basic Concept

How Lookups Work

Relationship Types

Storage Format

Statistics (conda-forge example)

Web UI

Thanks!

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

parselmouth: Conda mapping runner

Overview

Local Testing

Conda to PyPI

PyPI to conda

Online availability

Exposed endpoints

Infrastructure and Storage Architecture

Storage Locations

1. R2 Bucket (Cloud Storage)

2. Git Repository Files

Version System

Workflow Pipeline Architecture

Bucket Isolation and Safety

Relations Table Structure

Basic Concept

How Lookups Work

Relationship Types

Storage Format

Statistics (conda-forge example)

Web UI

Thanks!

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages