Skip to content

ImpossibleForge/pfc-jsonl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PFC-JSONL — Queryable Cold Storage for JSONL Logs

Most teams keep logs on expensive hot storage for weeks — not because they query them every day, but because cold storage can't answer questions when they need it.

PFC-JSONL stores a block-level timestamp index alongside every compressed file. Query any one-hour window from S3, Glacier Instant Retrieval, or any byte-range-capable storage — without decompressing the full archive.

Query any 1-hour window in under 6 seconds — via the DuckDB Community Extension, pfc-gateway, or Grafana. Directly from S3, without downloading the full file. A 1-hour query reads ~30–80 MB. gzip reads the full archive. Every time. Benchmarked across 10 enterprise log types — 5%–13% ratio, 28%–58% smaller than gzip, 33%–58% smaller than zstd. → Full benchmark report

License: Free for personal use Version DuckDB Extension Fluent Bit Vector Telegraf OpenTelemetry PyPI Awesome DuckDB Awesome Observability Awesome Data Engineering


Why PFC-JSONL?

Cold storage that queries like hot storage

Engineering teams keep logs warm — on Elasticsearch, Datadog, Loki, or a fast S3 tier — for weeks longer than they need to. Not because they query them every day. Because the one time they need to look back — an incident discovered late, a customer dispute from last week, a security audit covering the past 90 days — cold storage offers nothing useful without a full download and decompress.

PFC-JSONL is built around a different assumption. Every archive carries a block-level timestamp index (.pfc.bidx). A time-range query reads only the blocks that cover the requested window — via HTTP Range requests, directly from S3, Glacier Instant Retrieval, or any byte-range-capable storage. Everything else stays on disk, untouched.

The warm tier exists because cold storage can't answer questions. Remove that constraint and the tier disappears.

A one-hour query on a 30-day, 1 GB archive — measured across 10 log types:

Tool 1-hour query time Data read from storage Full download required?
PFC-JSONL + DuckDB 2.8–5.3 s ~30–80 MB (1–2 of 32 blocks) No
gzip ~70–100 s Entire 1 GB file Yes
zstd ~8–10 s Entire 1 GB file Yes

What this means in practice — for any setup

Every team's numbers look different. The direction is always the same.

Teams with 3-day hot retention and 30-day warm storage can move warm to cold on day 3 — and still return a forensics query in seconds when it surfaces on day 22. Teams paying for 30-day Elasticsearch clusters to cover incident investigations they run twice a month can drop to S3 cold and keep the same query workflow via DuckDB. Compliance archives that live on S3 for 7 years as write-only stores become queryable for their entire retention period — no restore step, no re-hydration window.

Queries run via the DuckDB Community Extension (INSTALL pfc FROM community) — standard SQL, no new tooling to learn. For teams not running DuckDB, pfc-gateway exposes the same queries as a REST API, and the Grafana plugin connects directly to PFC archives for dashboard panels. The query interface fits whatever your team already uses.

The savings compound in two directions: smaller archives mean less data stored, and block-level queries mean far less data read per lookup.

S3 storage tiers and PFC compatibility (AWS us-east-1 reference pricing — rates vary):

Tier Cost/GB/month HTTP Range access PFC queryable
S3 Standard $0.023
S3 Standard-IA $0.0125
S3 Glacier Instant Retrieval $0.004
S3 Glacier Flexible Retrieval $0.0036 Restore required
S3 Glacier Deep Archive $0.00099 Restore required

Query cost per 1-hour window on S3 Standard-IA (egress + retrieval):

  • gzip/zstd: ~$0.10 per query (full 1 GB download, every time)
  • PFC: ~$0.007 per query (1–2 blocks, ~30–80 MB)

At 1,000 queries/month: $100 with gzip vs. $7 with PFC. The storage savings and the query savings move in the same direction.

S3 Glacier Flexible Retrieval and Deep Archive require a restore step before access — those tiers are best suited for true long-term archival where queries are not expected.

Smaller archives — on every log type

PFC-JSONL is purpose-built for structured JSONL log data. It recognises the patterns that repeat across log types — field names, log levels, HTTP status codes, Kubernetes states, timestamps, IP addresses, and more — and encodes them efficiently before compression. The result is consistently better ratios than general-purpose compressors, across every log type, at 1 GB scale.

Smaller archives lower storage costs permanently on any tier. Combined with block-level queries, smaller blocks also mean less data read per lookup and lower per-query egress costs.

Results across 10 enterprise log types — infrastructure, Kubernetes, API access, auth, network, streaming, ops, cloud, application, and transaction logs — all measured at 1 GB:

  • Best case: Infrastructure / System logs at 5.31% — 46% smaller than gzip-9, 58% smaller than zstd-3
  • Typical range: Kubernetes 7.97%, Auth 9.13%, Application 10.56%
  • Highest volume: API Access 12.76% — still 28% smaller than gzip, 33% smaller than zstd-3
  • PFC-JSONL wins on every log type — by 26%–58% over gzip, 33%–58% over zstd

Full benchmark report — compress/decompress speed, DuckDB query times, all 10 log types

Already storing logs in gzip or zstd? You don't have to start from scratch.

The most common objection: "My 50 TB are already compressed on S3 — I'd have to download everything to convert."

You don't. pfc-migrate converts gzip, zstd, bzip2, and lz4 archives to PFC directly in your S3 bucket — reading and writing in-region, no egress, no local download:

pip install pfc-migrate[all]

# Convert an entire S3 prefix in-place — no download required
pfc-migrate s3 --bucket my-logs --prefix 2025/ \
               --out-bucket my-logs-pfc --out-prefix pfc/

The original files stay untouched. The converted .pfc archives land in the output prefix, immediately queryable with DuckDB.

Have legacy log formats? Apache CLF, nginx access logs, CSV, or old-style plain text logs — pfc-convert converts them to structured JSONL first, then to .pfc. Even if they're stored compressed:

pip install pfc-convert

# Convert Apache/nginx logs to JSONL → .pfc in one step
pfc-convert apache --input access.log.gz --output access.pfc

# NASA-95 / CLF format
pfc-convert clf --input NASA_access_log_Jul95.gz --output nasa95.pfc

For the common case — legacy format archives already sitting compressed on S3 — combine both:

  1. pfc-convert handles the schema transformation (CLF → JSONL)
  2. pfc-migrate handles the storage migration (S3 in-region, no egress)

The result: your existing archive — years of logs, in whatever format, wherever they sit — becomes a queryable PFC archive without a single byte leaving your cloud region.

1-hour query on a 30-day, 1 GB archive — measured:

Tool Query Time Data Downloaded Rows Returned
PFC-JSONL + DuckDB 2.8–5.3 s ~30–80 MB (1–2 blocks) 73K–239K
gzip ~70–100 s Entire file (1 GB)
zstd ~8–10 s Entire file (1 GB)

PFC reads 1–2 blocks out of 32. gzip and zstd decompress the entire archive before a single row can be returned — regardless of storage tier.


What's New in v5.6.5

v5.6.5 is a complete rewrite. Same CLI, same file format compatibility — significantly faster and more efficient under the hood.

Faster compression and decompression

Measured on the same 200 MB API access log benchmark:

v3.4.x v5.6.5 Improvement
Compress speed 22 MB/s 40–61 MB/s up to 2.8× faster
Decompress speed 27 MB/s 45–62 MB/s up to 2.3× faster
Ratio (API access logs) 8.98% 8.44% 6% better
Binary size ~40 MB 4 MB 10× smaller
Runtime required Python 3.8+ None drop-in binary
Platforms Linux x64, macOS ARM64 Linux x64, macOS ARM64, macOS Intel x64, Windows 2 → 4 platforms

A 4 MB binary. No runtime. No dependencies.

v3.4 shipped as a ~40 MB Nuitka-compiled Python bundle. v5.6.5 is a single native 4 MB binary — no Python, no pip, no virtualenv. Download, make executable, run.

Reads all existing v3.4 archives

v5.6.5 is fully backward compatible. Every .pfc file created with v3.4 is readable by v5.6.5 without conversion. Upgrade the binary, keep your archives.

Now available on macOS Intel and Windows

v3.4 ran on Linux x64 and macOS ARM64 only — the Python bundle made Windows and macOS Intel impractical.

v5.6.5 is written in Rust and cross-compiles cleanly to all major platforms:

Platform v3.4.x v5.6.5
Linux x86_64
macOS ARM64 (Apple Silicon)
macOS Intel x64
Windows x64 (native) ❌ (WSL2 only)

Drop-in replacement for the entire ecosystem

The CLI is unchanged — compress, decompress, query, seek-blocks work exactly as before. All ecosystem tools (pfc-gateway, pfc-duckdb, pfc-archiver-, pfc-export-, pfc-ingest-*) are compatible without any code changes.


Install

Linux x86_64

curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

pfc_jsonl --help

macOS (Apple Silicon — M1/M2/M3/M4)

curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

pfc_jsonl --help

macOS Intel (x64)

curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

pfc_jsonl --help

Windows x64 (native — no WSL2 required)

# Download to a folder in your PATH, e.g. C:\bin\ (create if it doesn't exist)
Invoke-WebRequest `
  -Uri "https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-windows-x64.exe" `
  -OutFile "C:\bin\pfc_jsonl.exe"

pfc_jsonl --help

Tip: Add C:\bin to your system PATH so pfc_jsonl works from any terminal.


DuckDB Extension

Query .pfc files directly from DuckDB SQL — no intermediate decompression step:

INSTALL pfc FROM community;
LOAD pfc;
LOAD json;

-- Read all lines
SELECT line->>'$.level' AS level, line->>'$.message' AS msg
FROM read_pfc_jsonl('/path/to/events.pfc')
LIMIT 10;

-- Block-level timestamp filter: only decompress relevant blocks
-- Returns all rows from the matching blocks (block granularity ~30–40 min per block)
SELECT count(*)
FROM read_pfc_jsonl(
    '/path/to/events.pfc',
    ts_from = epoch(TIMESTAMPTZ '2026-01-01 00:00:00+00'),
    ts_to   = epoch(TIMESTAMPTZ '2026-01-02 00:00:00+00')
);

-- For exact row-level filtering, add a WHERE clause:
SELECT line->>'$.level' AS level, line->>'$.timestamp' AS ts
FROM read_pfc_jsonl(
    '/path/to/events.pfc',
    ts_from = epoch(TIMESTAMPTZ '2026-01-01 10:00:00+00'),
    ts_to   = epoch(TIMESTAMPTZ '2026-01-01 11:00:00+00')
)
WHERE line->>'$.timestamp' >= '2026-01-01T10:00:00Z'
  AND line->>'$.timestamp' <  '2026-01-01T11:00:00Z';

How block-level filtering works: ts_from/ts_to selects which compressed blocks to decompress — blocks that have no overlap with the requested range are skipped entirely. Each block typically covers 30–40 minutes of data for a 1 GB, 1-day archive. The returned rows come from all matching blocks, which may extend slightly beyond the exact window boundary. For exact row-level time filtering, add a WHERE clause on the timestamp field as shown above.

The narrower your time window, the fewer blocks are read. A 5-minute incident window on a 30-day archive typically reads 1–2 blocks out of hundreds. Even a rough window ("sometime Tuesday") reduces the read to ~3% of the archive — versus full decompression with gzip or zstd regardless.

Use pfc_jsonl info --blocks archive.pfc to see which time ranges each block covers.

Dirty input: pfc_jsonl is a byte-faithful compressor — it accepts and preserves any input including UTF-8, emoji, schema-drifted fields, and truncated lines. The DuckDB extension silently skips empty lines, non-JSON-object lines, and malformed JSON during queries (no panic, no abort). If you need to audit or count every raw line including malformed ones, use pfc_jsonl decompress and parse the output directly.

Full-scan queries (no ts_from/ts_to) decompress all blocks — useful for analytics like COUNT(*) WHERE status = 404 across a full archive. RAM usage scales with file size and parallel worker count (~350 MB/worker × 6 workers = ~2 GB per query). For concurrent multi-user workloads, plan accordingly.

The DuckDB extension calls pfc_jsonl as a subprocess. Install the binary first (see above). See pfc-duckdb on GitHub for manual install instructions.


Ingest — Send Data to PFC

Plug PFC-JSONL into your existing logging or metrics pipeline. All ingest tools buffer data locally, compress when the buffer is full, and optionally upload to S3.

Tool Protocol / Format Port Repo
pfc-fluentbit Fluent Bit TCP output Fluent Bit → .pfc
pfc-vector HTTP sink (JSON / NDJSON) 8766 Vector.dev → .pfc
pfc-telegraf HTTP (InfluxDB line protocol + JSON) 8767 Telegraf → .pfc
pfc-otel-collector OTLP/HTTP (logs, traces, metrics) 4318 OpenTelemetry → .pfc
pfc-kafka-consumer Kafka / Redpanda consumer Kafka topic → .pfc
pfc-gateway HTTP REST POST /ingest 8765 Any source → .pfc (+ query)

pfc-gateway is bidirectional — it accepts ingest via POST /ingest and serves queries via POST /query. No DuckDB required.


Query — Read PFC Archives

DuckDB Extension

The fastest way to query .pfc archives locally — see the DuckDB Extension section above.

pfc-gateway — HTTP REST API

Query .pfc archives over HTTP without DuckDB — works with any language, curl, Grafana, or PowerBI:

# Start the gateway (points at your archive directory)
PFC_ARCHIVE_DIR=/var/lib/pfc PFC_API_KEY=secret \
  python3 pfc_gateway.py --port 8765

# Query a time range
curl -X POST http://localhost:8765/query \
  -H "x-api-key: secret" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "/var/lib/pfc/logs_20260101.pfc",
    "from_ts": "2026-01-01T10:00:00Z",
    "to_ts":   "2026-01-01T11:00:00Z"
  }'

# Query multiple files at once
curl -X POST http://localhost:8765/query/batch \
  -H "x-api-key: secret" \
  -H "Content-Type: application/json" \
  -d '{"files": ["/var/lib/pfc/logs_20260101.pfc", "/var/lib/pfc/logs_20260102.pfc"]}'

Also supports Grafana — see pfc-grafana for the native Grafana datasource plugin. See pfc-gateway on GitHub for full documentation.


Migrate Existing Archives

Already have logs stored as gzip, zstd, bzip2, or lz4 — on disk, on S3, on Azure, or on GCS?

pfc-migrate converts them in one command, directly in your storage (no egress charges):

pip install pfc-migrate[all]

# Local
pfc-migrate convert --dir /var/log/archive/ --output-dir /var/log/pfc/ -v

# S3
pfc-migrate s3 --bucket my-logs --prefix 2025/ --out-bucket my-logs-pfc --out-prefix pfc/

# Azure Blob
pfc-migrate azure --container my-logs --prefix 2025/ --out-container my-logs-pfc --connection-string "..."

# GCS
pfc-migrate gcs --bucket my-logs --prefix 2025/ --out-bucket my-logs-pfc

Python Package

Use the pfc Python package (PyPI: pfc-jsonl) to compress, decompress, and query .pfc files from Python:

pip install pfc-jsonl
import pfc

pfc.compress("logs/app.jsonl", "logs/app.pfc")
pfc.query("logs/app.pfc",
          from_ts="2026-01-15T08:00:00",
          to_ts="2026-01-15T09:00:00",
          output_path="logs/morning.jsonl")

Commands

Command Description
pfc_jsonl compress <input> <output> Compress JSONL → .pfc + .pfc.bidx
pfc_jsonl decompress <input> <output> Full decompression
pfc_jsonl query <input> --from X --to Y --out <output> Decompress blocks matching time range
pfc_jsonl seek-block N <input> [output] Extract single block by index
pfc_jsonl seek-blocks <input> --blocks N [N...] Extract multiple blocks (DuckDB primitive)
pfc_jsonl info <input> Show block table + timestamp ranges

Input Format

One JSON object per line with a timestamp field:

{"timestamp": "2025-01-15T06:32:11Z", "level": "ERROR", "service": "api", "msg": "timeout"}
{"timestamp": "2025-01-15T06:32:12Z", "level": "INFO",  "service": "db",  "msg": "query_ok"}

Supported timestamp fields: timestamp, ts, time, @timestamp (ISO 8601 or Unix epoch seconds).


How It Works

PFC divides JSONL logs into independent blocks (configurable, default 32 MiB). Each block is compressed with a BWT-based transform pipeline optimized for structured log data. Block timestamp ranges are stored in .pfc.bidx (32 bytes/block, binary).

To query a time range, only the relevant blocks are decompressed — the rest is never read. The block index enables HTTP Range requests on S3 and Glacier — fetch only the blocks you need.


Related Repos

Ingest

Query & Gateway

  • pfc-gateway — HTTP REST API: ingest + query, no DuckDB required
  • pfc-duckdb — DuckDB community extension for SQL queries on PFC files

Archive & Migration

SDK

  • pfc-py — Python client library (PyPI: pfc-jsonl)

License

PFC-JSONL is free for personal and open-source use.

Commercial use (production pipelines, paid services, or business operations) requires a license. Contact: info@impossibleforge.com


Built by ImpossibleForge

About

Cold storage that queries like hot storage. High-ratio JSONL compressor with block-level random access. Query cold storage in seconds — no restore, no full decompression. 5–13% ratio on JSONL logs, 25–57% smaller than gzip. 1-hour window via DuckDB in under 6s. Free for personal and open-source use.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors