PFC-JSONL — Queryable Cold Storage for JSONL Logs

Most teams keep logs on expensive hot storage for weeks — not because they query them every day, but because cold storage can't answer questions when they need it.

PFC-JSONL stores a block-level timestamp index alongside every compressed file. Query any one-hour window from S3, Glacier Instant Retrieval, or any byte-range-capable storage — without decompressing the full archive.

Query any 1-hour window in under 6 seconds — via the DuckDB Community Extension, pfc-gateway, or Grafana. Directly from S3, without downloading the full file. A 1-hour query reads ~30–80 MB. gzip reads the full archive. Every time. Benchmarked across 10 enterprise log types — 5%–13% ratio, 28%–58% smaller than gzip, 33%–58% smaller than zstd. → Full benchmark report

Why PFC-JSONL?

Cold storage that queries like hot storage

Engineering teams keep logs warm — on Elasticsearch, Datadog, Loki, or a fast S3 tier — for weeks longer than they need to. Not because they query them every day. Because the one time they need to look back — an incident discovered late, a customer dispute from last week, a security audit covering the past 90 days — cold storage offers nothing useful without a full download and decompress.

PFC-JSONL is built around a different assumption. Every archive carries a block-level timestamp index (.pfc.bidx). A time-range query reads only the blocks that cover the requested window — via HTTP Range requests, directly from S3, Glacier Instant Retrieval, or any byte-range-capable storage. Everything else stays on disk, untouched.

The warm tier exists because cold storage can't answer questions. Remove that constraint and the tier disappears.

A one-hour query on a 30-day, 1 GB archive — measured across 10 log types:

Tool	1-hour query time	Data read from storage	Full download required?
PFC-JSONL + DuckDB	2.8–5.3 s	~30–80 MB (1–2 of 32 blocks)	No
gzip	~70–100 s	Entire 1 GB file	Yes
zstd	~8–10 s	Entire 1 GB file	Yes

What this means in practice — for any setup

Every team's numbers look different. The direction is always the same.

Teams with 3-day hot retention and 30-day warm storage can move warm to cold on day 3 — and still return a forensics query in seconds when it surfaces on day 22. Teams paying for 30-day Elasticsearch clusters to cover incident investigations they run twice a month can drop to S3 cold and keep the same query workflow via DuckDB. Compliance archives that live on S3 for 7 years as write-only stores become queryable for their entire retention period — no restore step, no re-hydration window.

Queries run via the DuckDB Community Extension (INSTALL pfc FROM community) — standard SQL, no new tooling to learn. For teams not running DuckDB, pfc-gateway exposes the same queries as a REST API, and the Grafana plugin connects directly to PFC archives for dashboard panels. The query interface fits whatever your team already uses.

The savings compound in two directions: smaller archives mean less data stored, and block-level queries mean far less data read per lookup.

S3 storage tiers and PFC compatibility (AWS us-east-1 reference pricing — rates vary):

Tier	Cost/GB/month	HTTP Range access	PFC queryable
S3 Standard	$0.023	✓	✓
S3 Standard-IA	$0.0125	✓	✓
S3 Glacier Instant Retrieval	$0.004	✓	✓
S3 Glacier Flexible Retrieval	$0.0036	Restore required	—
S3 Glacier Deep Archive	$0.00099	Restore required	—

Query cost per 1-hour window on S3 Standard-IA (egress + retrieval):

gzip/zstd: ~$0.10 per query (full 1 GB download, every time)
PFC: ~$0.007 per query (1–2 blocks, ~30–80 MB)

At 1,000 queries/month: $100 with gzip vs. $7 with PFC. The storage savings and the query savings move in the same direction.

S3 Glacier Flexible Retrieval and Deep Archive require a restore step before access — those tiers are best suited for true long-term archival where queries are not expected.

Smaller archives — on every log type

PFC-JSONL is purpose-built for structured JSONL log data. It recognises the patterns that repeat across log types — field names, log levels, HTTP status codes, Kubernetes states, timestamps, IP addresses, and more — and encodes them efficiently before compression. The result is consistently better ratios than general-purpose compressors, across every log type, at 1 GB scale.

Smaller archives lower storage costs permanently on any tier. Combined with block-level queries, smaller blocks also mean less data read per lookup and lower per-query egress costs.

Results across 10 enterprise log types — infrastructure, Kubernetes, API access, auth, network, streaming, ops, cloud, application, and transaction logs — all measured at 1 GB:

Best case: Infrastructure / System logs at 5.31% — 46% smaller than gzip-9, 58% smaller than zstd-3
Typical range: Kubernetes 7.97%, Auth 9.13%, Application 10.56%
Highest volume: API Access 12.76% — still 28% smaller than gzip, 33% smaller than zstd-3
PFC-JSONL wins on every log type — by 26%–58% over gzip, 33%–58% over zstd

→ Full benchmark report — compress/decompress speed, DuckDB query times, all 10 log types

Already storing logs in gzip or zstd? You don't have to start from scratch.

The most common objection: "My 50 TB are already compressed on S3 — I'd have to download everything to convert."

You don't. pfc-migrate converts gzip, zstd, bzip2, and lz4 archives to PFC directly in your S3 bucket — reading and writing in-region, no egress, no local download:

pip install pfc-migrate[all]

# Convert an entire S3 prefix in-place — no download required
pfc-migrate s3 --bucket my-logs --prefix 2025/ \
               --out-bucket my-logs-pfc --out-prefix pfc/

The original files stay untouched. The converted .pfc archives land in the output prefix, immediately queryable with DuckDB.

Have legacy log formats? Apache CLF, nginx access logs, CSV, or old-style plain text logs — pfc-convert converts them to structured JSONL first, then to .pfc. Even if they're stored compressed:

pip install pfc-convert

# Convert Apache/nginx logs to JSONL → .pfc in one step
pfc-convert apache --input access.log.gz --output access.pfc

# NASA-95 / CLF format
pfc-convert clf --input NASA_access_log_Jul95.gz --output nasa95.pfc

For the common case — legacy format archives already sitting compressed on S3 — combine both:

pfc-convert handles the schema transformation (CLF → JSONL)
pfc-migrate handles the storage migration (S3 in-region, no egress)

The result: your existing archive — years of logs, in whatever format, wherever they sit — becomes a queryable PFC archive without a single byte leaving your cloud region.

1-hour query on a 30-day, 1 GB archive — measured:

Tool	Query Time	Data Downloaded	Rows Returned
PFC-JSONL + DuckDB	2.8–5.3 s	~30–80 MB (1–2 blocks)	73K–239K
gzip	~70–100 s	Entire file (1 GB)	—
zstd	~8–10 s	Entire file (1 GB)	—

PFC reads 1–2 blocks out of 32. gzip and zstd decompress the entire archive before a single row can be returned — regardless of storage tier.

What's New in v5.6.5

v5.6.5 is a complete rewrite. Same CLI, same file format compatibility — significantly faster and more efficient under the hood.

Faster compression and decompression

Measured on the same 200 MB API access log benchmark:

	v3.4.x	v5.6.5	Improvement
Compress speed	22 MB/s	40–61 MB/s	up to 2.8× faster
Decompress speed	27 MB/s	45–62 MB/s	up to 2.3× faster
Ratio (API access logs)	8.98%	8.44%	6% better
Binary size	~40 MB	4 MB	10× smaller
Runtime required	Python 3.8+	None	drop-in binary
Platforms	Linux x64, macOS ARM64	Linux x64, macOS ARM64, macOS Intel x64, Windows	2 → 4 platforms

A 4 MB binary. No runtime. No dependencies.

v3.4 shipped as a ~40 MB Nuitka-compiled Python bundle. v5.6.5 is a single native 4 MB binary — no Python, no pip, no virtualenv. Download, make executable, run.

Reads all existing v3.4 archives

v5.6.5 is fully backward compatible. Every .pfc file created with v3.4 is readable by v5.6.5 without conversion. Upgrade the binary, keep your archives.

Now available on macOS Intel and Windows

v3.4 ran on Linux x64 and macOS ARM64 only — the Python bundle made Windows and macOS Intel impractical.

v5.6.5 is written in Rust and cross-compiles cleanly to all major platforms:

Platform	v3.4.x	v5.6.5
Linux x86_64	✅	✅
macOS ARM64 (Apple Silicon)	✅	✅
macOS Intel x64	❌	✅
Windows x64 (native)	❌ (WSL2 only)	✅

Drop-in replacement for the entire ecosystem

The CLI is unchanged — compress, decompress, query, seek-blocks work exactly as before. All ecosystem tools (pfc-gateway, pfc-duckdb, pfc-archiver-, pfc-export-, pfc-ingest-*) are compatible without any code changes.

Install

Linux x86_64

curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

pfc_jsonl --help

macOS (Apple Silicon — M1/M2/M3/M4)

curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

pfc_jsonl --help

macOS Intel (x64)

curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

pfc_jsonl --help

Windows x64 (native — no WSL2 required)

# Download to a folder in your PATH, e.g. C:\bin\ (create if it doesn't exist)
Invoke-WebRequest `
  -Uri "https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-windows-x64.exe" `
  -OutFile "C:\bin\pfc_jsonl.exe"

pfc_jsonl --help

Tip: Add C:\bin to your system PATH so pfc_jsonl works from any terminal.

DuckDB Extension

Query .pfc files directly from DuckDB SQL — no intermediate decompression step:

INSTALL pfc FROM community;
LOAD pfc;
LOAD json;

-- Read all lines
SELECT line->>'$.level' AS level, line->>'$.message' AS msg
FROM read_pfc_jsonl('/path/to/events.pfc')
LIMIT 10;

-- Block-level timestamp filter: only decompress relevant blocks
-- Returns all rows from the matching blocks (block granularity ~30–40 min per block)
SELECT count(*)
FROM read_pfc_jsonl(
    '/path/to/events.pfc',
    ts_from = epoch(TIMESTAMPTZ '2026-01-01 00:00:00+00'),
    ts_to   = epoch(TIMESTAMPTZ '2026-01-02 00:00:00+00')
);

-- For exact row-level filtering, add a WHERE clause:
SELECT line->>'$.level' AS level, line->>'$.timestamp' AS ts
FROM read_pfc_jsonl(
    '/path/to/events.pfc',
    ts_from = epoch(TIMESTAMPTZ '2026-01-01 10:00:00+00'),
    ts_to   = epoch(TIMESTAMPTZ '2026-01-01 11:00:00+00')
)
WHERE line->>'$.timestamp' >= '2026-01-01T10:00:00Z'
  AND line->>'$.timestamp' <  '2026-01-01T11:00:00Z';

How block-level filtering works: ts_from/ts_to selects which compressed blocks to decompress — blocks that have no overlap with the requested range are skipped entirely. Each block typically covers 30–40 minutes of data for a 1 GB, 1-day archive. The returned rows come from all matching blocks, which may extend slightly beyond the exact window boundary. For exact row-level time filtering, add a WHERE clause on the timestamp field as shown above.

The narrower your time window, the fewer blocks are read. A 5-minute incident window on a 30-day archive typically reads 1–2 blocks out of hundreds. Even a rough window ("sometime Tuesday") reduces the read to ~3% of the archive — versus full decompression with gzip or zstd regardless.

Use pfc_jsonl info --blocks archive.pfc to see which time ranges each block covers.

Dirty input: pfc_jsonl is a byte-faithful compressor — it accepts and preserves any input including UTF-8, emoji, schema-drifted fields, and truncated lines. The DuckDB extension silently skips empty lines, non-JSON-object lines, and malformed JSON during queries (no panic, no abort). If you need to audit or count every raw line including malformed ones, use pfc_jsonl decompress and parse the output directly.

Full-scan queries (no ts_from/ts_to) decompress all blocks — useful for analytics like COUNT(*) WHERE status = 404 across a full archive. RAM usage scales with file size and parallel worker count (~350 MB/worker × 6 workers = ~2 GB per query). For concurrent multi-user workloads, plan accordingly.

The DuckDB extension calls pfc_jsonl as a subprocess. Install the binary first (see above). See pfc-duckdb on GitHub for manual install instructions.

Ingest — Send Data to PFC

Plug PFC-JSONL into your existing logging or metrics pipeline. All ingest tools buffer data locally, compress when the buffer is full, and optionally upload to S3.

Tool	Protocol / Format	Port	Repo
pfc-fluentbit	Fluent Bit TCP output	—	Fluent Bit → `.pfc`
pfc-vector	HTTP sink (JSON / NDJSON)	8766	Vector.dev → `.pfc`
pfc-telegraf	HTTP (InfluxDB line protocol + JSON)	8767	Telegraf → `.pfc`
pfc-otel-collector	OTLP/HTTP (logs, traces, metrics)	4318	OpenTelemetry → `.pfc`
pfc-kafka-consumer	Kafka / Redpanda consumer	—	Kafka topic → `.pfc`
pfc-gateway ↕	HTTP REST `POST /ingest`	8765	Any source → `.pfc` (+ query)

pfc-gateway is bidirectional — it accepts ingest via POST /ingest and serves queries via POST /query. No DuckDB required.

Query — Read PFC Archives

DuckDB Extension

The fastest way to query .pfc archives locally — see the DuckDB Extension section above.

pfc-gateway — HTTP REST API

Query .pfc archives over HTTP without DuckDB — works with any language, curl, Grafana, or PowerBI:

# Start the gateway (points at your archive directory)
PFC_ARCHIVE_DIR=/var/lib/pfc PFC_API_KEY=secret \
  python3 pfc_gateway.py --port 8765

# Query a time range
curl -X POST http://localhost:8765/query \
  -H "x-api-key: secret" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "/var/lib/pfc/logs_20260101.pfc",
    "from_ts": "2026-01-01T10:00:00Z",
    "to_ts":   "2026-01-01T11:00:00Z"
  }'

# Query multiple files at once
curl -X POST http://localhost:8765/query/batch \
  -H "x-api-key: secret" \
  -H "Content-Type: application/json" \
  -d '{"files": ["/var/lib/pfc/logs_20260101.pfc", "/var/lib/pfc/logs_20260102.pfc"]}'

Also supports Grafana — see pfc-grafana for the native Grafana datasource plugin. See pfc-gateway on GitHub for full documentation.

Migrate Existing Archives

Already have logs stored as gzip, zstd, bzip2, or lz4 — on disk, on S3, on Azure, or on GCS?

pfc-migrate converts them in one command, directly in your storage (no egress charges):

pip install pfc-migrate[all]

# Local
pfc-migrate convert --dir /var/log/archive/ --output-dir /var/log/pfc/ -v

# S3
pfc-migrate s3 --bucket my-logs --prefix 2025/ --out-bucket my-logs-pfc --out-prefix pfc/

# Azure Blob
pfc-migrate azure --container my-logs --prefix 2025/ --out-container my-logs-pfc --connection-string "..."

# GCS
pfc-migrate gcs --bucket my-logs --prefix 2025/ --out-bucket my-logs-pfc

Python Package

Use the pfc Python package (PyPI: pfc-jsonl) to compress, decompress, and query .pfc files from Python:

pip install pfc-jsonl

import pfc

pfc.compress("logs/app.jsonl", "logs/app.pfc")
pfc.query("logs/app.pfc",
          from_ts="2026-01-15T08:00:00",
          to_ts="2026-01-15T09:00:00",
          output_path="logs/morning.jsonl")

Commands

Command	Description
`pfc_jsonl compress <input> <output>`	Compress JSONL → `.pfc` + `.pfc.bidx`
`pfc_jsonl decompress <input> <output>`	Full decompression
`pfc_jsonl query <input> --from X --to Y --out <output>`	Decompress blocks matching time range
`pfc_jsonl seek-block N <input> [output]`	Extract single block by index
`pfc_jsonl seek-blocks <input> --blocks N [N...]`	Extract multiple blocks (DuckDB primitive)
`pfc_jsonl info <input>`	Show block table + timestamp ranges

Input Format

One JSON object per line with a timestamp field:

{"timestamp": "2025-01-15T06:32:11Z", "level": "ERROR", "service": "api", "msg": "timeout"}
{"timestamp": "2025-01-15T06:32:12Z", "level": "INFO",  "service": "db",  "msg": "query_ok"}

Supported timestamp fields: timestamp, ts, time, @timestamp (ISO 8601 or Unix epoch seconds).

How It Works

PFC divides JSONL logs into independent blocks (configurable, default 32 MiB). Each block is compressed with a BWT-based transform pipeline optimized for structured log data. Block timestamp ranges are stored in .pfc.bidx (32 bytes/block, binary).

To query a time range, only the relevant blocks are decompressed — the rest is never read. The block index enables HTTP Range requests on S3 and Glacier — fetch only the blocks you need.

Related Repos

Ingest

pfc-fluentbit — Fluent Bit TCP output → PFC
pfc-vector — Vector.dev HTTP sink → PFC (Rust)
pfc-telegraf — Telegraf HTTP output plugin → PFC
pfc-otel-collector — OpenTelemetry OTLP/HTTP → PFC
pfc-kafka-consumer — Kafka / Redpanda consumer → PFC
pfc-grafana — Grafana datasource plugin for PFC archives

Query & Gateway

pfc-gateway — HTTP REST API: ingest + query, no DuckDB required
pfc-duckdb — DuckDB community extension for SQL queries on PFC files

Archive & Migration

pfc-migrate — convert gzip/zstd/lz4/bz2 archives → PFC (local, S3, Azure, GCS)
pfc-migrate-parquet — convert Apache Parquet files → PFC (streaming, in-region, all compression variants)
pfc-convert — convert Apache CLF, nginx, CSV, NDJSON → JSONL → PFC (schema conversion)
pfc-ingest-watchdog — auto-convert when new files arrive in folder or S3 (calls pfc-convert or pfc-migrate)
pfc-export-cratedb — one-shot CrateDB table export → PFC
pfc-export-questdb — one-shot QuestDB table export → PFC
pfc-export-clickhouse — one-shot ClickHouse table export → PFC
pfc-export-influxdb — one-shot InfluxDB 2.x measurement export → PFC
pfc-export-timescaledb — one-shot TimescaleDB table export → PFC
pfc-archiver-cratedb — autonomous archive daemon for CrateDB
pfc-archiver-questdb — autonomous archive daemon for QuestDB
pfc-archiver-clickhouse — autonomous archive daemon for ClickHouse
pfc-archiver-influxdb — autonomous archive daemon for InfluxDB 2.x
pfc-archiver-timescaledb — autonomous archive daemon for TimescaleDB

SDK

pfc-py — Python client library (PyPI: pfc-jsonl)

License

PFC-JSONL is free for personal and open-source use.

Commercial use (production pipelines, paid services, or business operations) requires a license. Contact: info@impossibleforge.com

Built by ImpossibleForge

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
docs		docs
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PFC-JSONL — Queryable Cold Storage for JSONL Logs

Why PFC-JSONL?

Cold storage that queries like hot storage

What this means in practice — for any setup

Smaller archives — on every log type

Already storing logs in gzip or zstd? You don't have to start from scratch.

What's New in v5.6.5

Faster compression and decompression

A 4 MB binary. No runtime. No dependencies.

Reads all existing v3.4 archives

Now available on macOS Intel and Windows

Drop-in replacement for the entire ecosystem

Install

Linux x86_64

macOS (Apple Silicon — M1/M2/M3/M4)

macOS Intel (x64)

Windows x64 (native — no WSL2 required)

DuckDB Extension

Ingest — Send Data to PFC

Query — Read PFC Archives

DuckDB Extension

pfc-gateway — HTTP REST API

Migrate Existing Archives

Python Package

Commands

Input Format

How It Works

Related Repos

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PFC-JSONL — Queryable Cold Storage for JSONL Logs

Why PFC-JSONL?

Cold storage that queries like hot storage

What this means in practice — for any setup

Smaller archives — on every log type

Already storing logs in gzip or zstd? You don't have to start from scratch.

What's New in v5.6.5

Faster compression and decompression

A 4 MB binary. No runtime. No dependencies.

Reads all existing v3.4 archives

Now available on macOS Intel and Windows

Drop-in replacement for the entire ecosystem

Install

Linux x86_64

macOS (Apple Silicon — M1/M2/M3/M4)

macOS Intel (x64)

Windows x64 (native — no WSL2 required)

DuckDB Extension

Ingest — Send Data to PFC

Query — Read PFC Archives

DuckDB Extension

pfc-gateway — HTTP REST API

Migrate Existing Archives

Python Package

Commands

Input Format

How It Works

Related Repos

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Packages