Most teams keep logs on expensive hot storage for weeks — not because they query them every day, but because cold storage can't answer questions when they need it.
PFC-JSONL stores a block-level timestamp index alongside every compressed file. Query any one-hour window from S3, Glacier Instant Retrieval, or any byte-range-capable storage — without decompressing the full archive.
Query any 1-hour window in under 6 seconds — via the DuckDB Community Extension, pfc-gateway, or Grafana. Directly from S3, without downloading the full file. A 1-hour query reads ~30–80 MB. gzip reads the full archive. Every time. Benchmarked across 10 enterprise log types — 5%–13% ratio, 28%–58% smaller than gzip, 33%–58% smaller than zstd. → Full benchmark report
Engineering teams keep logs warm — on Elasticsearch, Datadog, Loki, or a fast S3 tier — for weeks longer than they need to. Not because they query them every day. Because the one time they need to look back — an incident discovered late, a customer dispute from last week, a security audit covering the past 90 days — cold storage offers nothing useful without a full download and decompress.
PFC-JSONL is built around a different assumption. Every archive carries a block-level timestamp index (.pfc.bidx). A time-range query reads only the blocks that cover the requested window — via HTTP Range requests, directly from S3, Glacier Instant Retrieval, or any byte-range-capable storage. Everything else stays on disk, untouched.
The warm tier exists because cold storage can't answer questions. Remove that constraint and the tier disappears.
A one-hour query on a 30-day, 1 GB archive — measured across 10 log types:
| Tool | 1-hour query time | Data read from storage | Full download required? |
|---|---|---|---|
| PFC-JSONL + DuckDB | 2.8–5.3 s | ~30–80 MB (1–2 of 32 blocks) | No |
| gzip | ~70–100 s | Entire 1 GB file | Yes |
| zstd | ~8–10 s | Entire 1 GB file | Yes |
Every team's numbers look different. The direction is always the same.
Teams with 3-day hot retention and 30-day warm storage can move warm to cold on day 3 — and still return a forensics query in seconds when it surfaces on day 22. Teams paying for 30-day Elasticsearch clusters to cover incident investigations they run twice a month can drop to S3 cold and keep the same query workflow via DuckDB. Compliance archives that live on S3 for 7 years as write-only stores become queryable for their entire retention period — no restore step, no re-hydration window.
Queries run via the DuckDB Community Extension (INSTALL pfc FROM community) — standard SQL, no new tooling to learn. For teams not running DuckDB, pfc-gateway exposes the same queries as a REST API, and the Grafana plugin connects directly to PFC archives for dashboard panels. The query interface fits whatever your team already uses.
The savings compound in two directions: smaller archives mean less data stored, and block-level queries mean far less data read per lookup.
S3 storage tiers and PFC compatibility (AWS us-east-1 reference pricing — rates vary):
| Tier | Cost/GB/month | HTTP Range access | PFC queryable |
|---|---|---|---|
| S3 Standard | $0.023 | ✓ | ✓ |
| S3 Standard-IA | $0.0125 | ✓ | ✓ |
| S3 Glacier Instant Retrieval | $0.004 | ✓ | ✓ |
| S3 Glacier Flexible Retrieval | $0.0036 | Restore required | — |
| S3 Glacier Deep Archive | $0.00099 | Restore required | — |
Query cost per 1-hour window on S3 Standard-IA (egress + retrieval):
- gzip/zstd: ~$0.10 per query (full 1 GB download, every time)
- PFC: ~$0.007 per query (1–2 blocks, ~30–80 MB)
At 1,000 queries/month: $100 with gzip vs. $7 with PFC. The storage savings and the query savings move in the same direction.
S3 Glacier Flexible Retrieval and Deep Archive require a restore step before access — those tiers are best suited for true long-term archival where queries are not expected.
PFC-JSONL is purpose-built for structured JSONL log data. It recognises the patterns that repeat across log types — field names, log levels, HTTP status codes, Kubernetes states, timestamps, IP addresses, and more — and encodes them efficiently before compression. The result is consistently better ratios than general-purpose compressors, across every log type, at 1 GB scale.
Smaller archives lower storage costs permanently on any tier. Combined with block-level queries, smaller blocks also mean less data read per lookup and lower per-query egress costs.
Results across 10 enterprise log types — infrastructure, Kubernetes, API access, auth, network, streaming, ops, cloud, application, and transaction logs — all measured at 1 GB:
- Best case: Infrastructure / System logs at 5.31% — 46% smaller than gzip-9, 58% smaller than zstd-3
- Typical range: Kubernetes 7.97%, Auth 9.13%, Application 10.56%
- Highest volume: API Access 12.76% — still 28% smaller than gzip, 33% smaller than zstd-3
- PFC-JSONL wins on every log type — by 26%–58% over gzip, 33%–58% over zstd
→ Full benchmark report — compress/decompress speed, DuckDB query times, all 10 log types
The most common objection: "My 50 TB are already compressed on S3 — I'd have to download everything to convert."
You don't. pfc-migrate converts gzip, zstd, bzip2, and lz4 archives to PFC directly in your S3 bucket — reading and writing in-region, no egress, no local download:
pip install pfc-migrate[all]
# Convert an entire S3 prefix in-place — no download required
pfc-migrate s3 --bucket my-logs --prefix 2025/ \
--out-bucket my-logs-pfc --out-prefix pfc/The original files stay untouched. The converted .pfc archives land in the output prefix, immediately queryable with DuckDB.
Have legacy log formats? Apache CLF, nginx access logs, CSV, or old-style plain text logs — pfc-convert converts them to structured JSONL first, then to .pfc. Even if they're stored compressed:
pip install pfc-convert
# Convert Apache/nginx logs to JSONL → .pfc in one step
pfc-convert apache --input access.log.gz --output access.pfc
# NASA-95 / CLF format
pfc-convert clf --input NASA_access_log_Jul95.gz --output nasa95.pfcFor the common case — legacy format archives already sitting compressed on S3 — combine both:
pfc-converthandles the schema transformation (CLF → JSONL)pfc-migratehandles the storage migration (S3 in-region, no egress)
The result: your existing archive — years of logs, in whatever format, wherever they sit — becomes a queryable PFC archive without a single byte leaving your cloud region.
1-hour query on a 30-day, 1 GB archive — measured:
| Tool | Query Time | Data Downloaded | Rows Returned |
|---|---|---|---|
| PFC-JSONL + DuckDB | 2.8–5.3 s | ~30–80 MB (1–2 blocks) | 73K–239K |
| gzip | ~70–100 s | Entire file (1 GB) | — |
| zstd | ~8–10 s | Entire file (1 GB) | — |
PFC reads 1–2 blocks out of 32. gzip and zstd decompress the entire archive before a single row can be returned — regardless of storage tier.
v5.6.5 is a complete rewrite. Same CLI, same file format compatibility — significantly faster and more efficient under the hood.
Measured on the same 200 MB API access log benchmark:
| v3.4.x | v5.6.5 | Improvement | |
|---|---|---|---|
| Compress speed | 22 MB/s | 40–61 MB/s | up to 2.8× faster |
| Decompress speed | 27 MB/s | 45–62 MB/s | up to 2.3× faster |
| Ratio (API access logs) | 8.98% | 8.44% | 6% better |
| Binary size | ~40 MB | 4 MB | 10× smaller |
| Runtime required | Python 3.8+ | None | drop-in binary |
| Platforms | Linux x64, macOS ARM64 | Linux x64, macOS ARM64, macOS Intel x64, Windows | 2 → 4 platforms |
v3.4 shipped as a ~40 MB Nuitka-compiled Python bundle. v5.6.5 is a single native 4 MB binary — no Python, no pip, no virtualenv. Download, make executable, run.
v5.6.5 is fully backward compatible. Every .pfc file created with v3.4 is readable by v5.6.5 without conversion. Upgrade the binary, keep your archives.
v3.4 ran on Linux x64 and macOS ARM64 only — the Python bundle made Windows and macOS Intel impractical.
v5.6.5 is written in Rust and cross-compiles cleanly to all major platforms:
| Platform | v3.4.x | v5.6.5 |
|---|---|---|
| Linux x86_64 | ✅ | ✅ |
| macOS ARM64 (Apple Silicon) | ✅ | ✅ |
| macOS Intel x64 | ❌ | ✅ |
| Windows x64 (native) | ❌ (WSL2 only) | ✅ |
The CLI is unchanged — compress, decompress, query, seek-blocks work exactly as before. All ecosystem tools (pfc-gateway, pfc-duckdb, pfc-archiver-, pfc-export-, pfc-ingest-*) are compatible without any code changes.
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
-o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl
pfc_jsonl --helpcurl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
-o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl
pfc_jsonl --helpcurl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-x64 \
-o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl
pfc_jsonl --help# Download to a folder in your PATH, e.g. C:\bin\ (create if it doesn't exist)
Invoke-WebRequest `
-Uri "https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-windows-x64.exe" `
-OutFile "C:\bin\pfc_jsonl.exe"
pfc_jsonl --helpTip: Add
C:\binto your system PATH sopfc_jsonlworks from any terminal.
Query .pfc files directly from DuckDB SQL — no intermediate decompression step:
INSTALL pfc FROM community;
LOAD pfc;
LOAD json;
-- Read all lines
SELECT line->>'$.level' AS level, line->>'$.message' AS msg
FROM read_pfc_jsonl('/path/to/events.pfc')
LIMIT 10;
-- Block-level timestamp filter: only decompress relevant blocks
-- Returns all rows from the matching blocks (block granularity ~30–40 min per block)
SELECT count(*)
FROM read_pfc_jsonl(
'/path/to/events.pfc',
ts_from = epoch(TIMESTAMPTZ '2026-01-01 00:00:00+00'),
ts_to = epoch(TIMESTAMPTZ '2026-01-02 00:00:00+00')
);
-- For exact row-level filtering, add a WHERE clause:
SELECT line->>'$.level' AS level, line->>'$.timestamp' AS ts
FROM read_pfc_jsonl(
'/path/to/events.pfc',
ts_from = epoch(TIMESTAMPTZ '2026-01-01 10:00:00+00'),
ts_to = epoch(TIMESTAMPTZ '2026-01-01 11:00:00+00')
)
WHERE line->>'$.timestamp' >= '2026-01-01T10:00:00Z'
AND line->>'$.timestamp' < '2026-01-01T11:00:00Z';How block-level filtering works:
ts_from/ts_toselects which compressed blocks to decompress — blocks that have no overlap with the requested range are skipped entirely. Each block typically covers 30–40 minutes of data for a 1 GB, 1-day archive. The returned rows come from all matching blocks, which may extend slightly beyond the exact window boundary. For exact row-level time filtering, add aWHEREclause on the timestamp field as shown above.The narrower your time window, the fewer blocks are read. A 5-minute incident window on a 30-day archive typically reads 1–2 blocks out of hundreds. Even a rough window ("sometime Tuesday") reduces the read to ~3% of the archive — versus full decompression with gzip or zstd regardless.
Use
pfc_jsonl info --blocks archive.pfcto see which time ranges each block covers.
Dirty input:
pfc_jsonlis a byte-faithful compressor — it accepts and preserves any input including UTF-8, emoji, schema-drifted fields, and truncated lines. The DuckDB extension silently skips empty lines, non-JSON-object lines, and malformed JSON during queries (no panic, no abort). If you need to audit or count every raw line including malformed ones, usepfc_jsonl decompressand parse the output directly.
Full-scan queries (no
ts_from/ts_to) decompress all blocks — useful for analytics likeCOUNT(*) WHERE status = 404across a full archive. RAM usage scales with file size and parallel worker count (~350 MB/worker × 6 workers = ~2 GB per query). For concurrent multi-user workloads, plan accordingly.
The DuckDB extension calls pfc_jsonl as a subprocess. Install the binary first (see above).
See pfc-duckdb on GitHub for manual install instructions.
Plug PFC-JSONL into your existing logging or metrics pipeline. All ingest tools buffer data locally, compress when the buffer is full, and optionally upload to S3.
| Tool | Protocol / Format | Port | Repo |
|---|---|---|---|
| pfc-fluentbit | Fluent Bit TCP output | — | Fluent Bit → .pfc |
| pfc-vector | HTTP sink (JSON / NDJSON) | 8766 | Vector.dev → .pfc |
| pfc-telegraf | HTTP (InfluxDB line protocol + JSON) | 8767 | Telegraf → .pfc |
| pfc-otel-collector | OTLP/HTTP (logs, traces, metrics) | 4318 | OpenTelemetry → .pfc |
| pfc-kafka-consumer | Kafka / Redpanda consumer | — | Kafka topic → .pfc |
| pfc-gateway ↕ | HTTP REST POST /ingest |
8765 | Any source → .pfc (+ query) |
pfc-gateway is bidirectional — it accepts ingest via
POST /ingestand serves queries viaPOST /query. No DuckDB required.
The fastest way to query .pfc archives locally — see the DuckDB Extension section above.
Query .pfc archives over HTTP without DuckDB — works with any language, curl, Grafana, or PowerBI:
# Start the gateway (points at your archive directory)
PFC_ARCHIVE_DIR=/var/lib/pfc PFC_API_KEY=secret \
python3 pfc_gateway.py --port 8765
# Query a time range
curl -X POST http://localhost:8765/query \
-H "x-api-key: secret" \
-H "Content-Type: application/json" \
-d '{
"file": "/var/lib/pfc/logs_20260101.pfc",
"from_ts": "2026-01-01T10:00:00Z",
"to_ts": "2026-01-01T11:00:00Z"
}'
# Query multiple files at once
curl -X POST http://localhost:8765/query/batch \
-H "x-api-key: secret" \
-H "Content-Type: application/json" \
-d '{"files": ["/var/lib/pfc/logs_20260101.pfc", "/var/lib/pfc/logs_20260102.pfc"]}'Also supports Grafana — see pfc-grafana for the native Grafana datasource plugin. See pfc-gateway on GitHub for full documentation.
Already have logs stored as gzip, zstd, bzip2, or lz4 — on disk, on S3, on Azure, or on GCS?
pfc-migrate converts them in one command, directly in your storage (no egress charges):
pip install pfc-migrate[all]
# Local
pfc-migrate convert --dir /var/log/archive/ --output-dir /var/log/pfc/ -v
# S3
pfc-migrate s3 --bucket my-logs --prefix 2025/ --out-bucket my-logs-pfc --out-prefix pfc/
# Azure Blob
pfc-migrate azure --container my-logs --prefix 2025/ --out-container my-logs-pfc --connection-string "..."
# GCS
pfc-migrate gcs --bucket my-logs --prefix 2025/ --out-bucket my-logs-pfcUse the pfc Python package (PyPI: pfc-jsonl) to compress, decompress, and query .pfc files from Python:
pip install pfc-jsonlimport pfc
pfc.compress("logs/app.jsonl", "logs/app.pfc")
pfc.query("logs/app.pfc",
from_ts="2026-01-15T08:00:00",
to_ts="2026-01-15T09:00:00",
output_path="logs/morning.jsonl")| Command | Description |
|---|---|
pfc_jsonl compress <input> <output> |
Compress JSONL → .pfc + .pfc.bidx |
pfc_jsonl decompress <input> <output> |
Full decompression |
pfc_jsonl query <input> --from X --to Y --out <output> |
Decompress blocks matching time range |
pfc_jsonl seek-block N <input> [output] |
Extract single block by index |
pfc_jsonl seek-blocks <input> --blocks N [N...] |
Extract multiple blocks (DuckDB primitive) |
pfc_jsonl info <input> |
Show block table + timestamp ranges |
One JSON object per line with a timestamp field:
{"timestamp": "2025-01-15T06:32:11Z", "level": "ERROR", "service": "api", "msg": "timeout"}
{"timestamp": "2025-01-15T06:32:12Z", "level": "INFO", "service": "db", "msg": "query_ok"}Supported timestamp fields: timestamp, ts, time, @timestamp (ISO 8601 or Unix epoch seconds).
PFC divides JSONL logs into independent blocks (configurable, default 32 MiB).
Each block is compressed with a BWT-based transform pipeline optimized for structured log data.
Block timestamp ranges are stored in .pfc.bidx (32 bytes/block, binary).
To query a time range, only the relevant blocks are decompressed — the rest is never read. The block index enables HTTP Range requests on S3 and Glacier — fetch only the blocks you need.
Ingest
- pfc-fluentbit — Fluent Bit TCP output → PFC
- pfc-vector — Vector.dev HTTP sink → PFC (Rust)
- pfc-telegraf — Telegraf HTTP output plugin → PFC
- pfc-otel-collector — OpenTelemetry OTLP/HTTP → PFC
- pfc-kafka-consumer — Kafka / Redpanda consumer → PFC
- pfc-grafana — Grafana datasource plugin for PFC archives
Query & Gateway
- pfc-gateway — HTTP REST API: ingest + query, no DuckDB required
- pfc-duckdb — DuckDB community extension for SQL queries on PFC files
Archive & Migration
- pfc-migrate — convert gzip/zstd/lz4/bz2 archives → PFC (local, S3, Azure, GCS)
- pfc-migrate-parquet — convert Apache Parquet files → PFC (streaming, in-region, all compression variants)
- pfc-convert — convert Apache CLF, nginx, CSV, NDJSON → JSONL → PFC (schema conversion)
- pfc-ingest-watchdog — auto-convert when new files arrive in folder or S3 (calls pfc-convert or pfc-migrate)
- pfc-export-cratedb — one-shot CrateDB table export → PFC
- pfc-export-questdb — one-shot QuestDB table export → PFC
- pfc-export-clickhouse — one-shot ClickHouse table export → PFC
- pfc-export-influxdb — one-shot InfluxDB 2.x measurement export → PFC
- pfc-export-timescaledb — one-shot TimescaleDB table export → PFC
- pfc-archiver-cratedb — autonomous archive daemon for CrateDB
- pfc-archiver-questdb — autonomous archive daemon for QuestDB
- pfc-archiver-clickhouse — autonomous archive daemon for ClickHouse
- pfc-archiver-influxdb — autonomous archive daemon for InfluxDB 2.x
- pfc-archiver-timescaledb — autonomous archive daemon for TimescaleDB
SDK
- pfc-py — Python client library (PyPI:
pfc-jsonl)
PFC-JSONL is free for personal and open-source use.
Commercial use (production pipelines, paid services, or business operations) requires a license. Contact: info@impossibleforge.com
Built by ImpossibleForge