Skip to content

feat: add pg_tiktoken_c - pure C tiktoken for PostgreSQL, 1700x faster than pg_tiktoken#18

Open
fasdwcx wants to merge 1 commit into
pgsty:mainfrom
fasdwcx:add-pg_tiktoken_c
Open

feat: add pg_tiktoken_c - pure C tiktoken for PostgreSQL, 1700x faster than pg_tiktoken#18
fasdwcx wants to merge 1 commit into
pgsty:mainfrom
fasdwcx:add-pg_tiktoken_c

Conversation

@fasdwcx
Copy link
Copy Markdown

@fasdwcx fasdwcx commented May 26, 2026

Summary

Add pg_tiktoken_c (id: 1875) to the RAG category, positioned after pg_tiktoken (1870).

What is pg_tiktoken_c?

A PostgreSQL extension that implements OpenAI's tiktoken BPE tokenizer in pure C, as a high-performance alternative to pg_tiktoken (Rust/pgrx).

Performance vs pg_tiktoken (Rust/pgrx)

Benchmark on Apple M-series · PostgreSQL 17 · cl100k_base · single connection:

Text size pg_tiktoken_c (C) pg_tiktoken (Rust) Speedup
Short (~3 tok) 11,061 rows/s, 86 µs 4 rows/s 2,765×
Medium (~60 tok) 6,779 rows/s, 141 µs 4 rows/s 1,695×
Long (~500 tok) 1,202 rows/s, 810 µs 4 rows/s 301×

Root cause of gap: pg_tiktoken re-initialises the BPE encoder on every call (~220 ms overhead). pg_tiktoken_c caches the encoder in TopMemoryContext once per backend.

Features

  • tiktoken_count(encoding, text) — token counting
  • tiktoken_encode(encoding, text) — returns token ID arrays
  • chunk_text_table(text, size, overlap) — document chunking for RAG pipelines
  • All major OpenAI encodings: cl100k_base, o200k_base, r50k_base, p50k_base, p50k_edit
  • Model name aliases (gpt-4o, gpt-4, o1, etc.)
  • IMMUTABLE PARALLEL SAFE — works in indexes, generated columns, parallel queries

Entry added

1875,pg_tiktoken_c,pg_tiktoken_c,RAG,https://github.com/relytcloud/pg_tiktoken_c,Apache-2.0,,1.1,NONE,C,f,t,t,t,f,f,f,,"{17,16,15,14,13}",,,,,,,,,,,,tiktoken tokenizer for PostgreSQL in pure C, 1700x faster than pg_tiktoken (Rust/pgrx),纯C实现的PostgreSQL tiktoken分词器,比Rust版本快1700倍,支持RAG文档切分与Token计数,
  • repo: NONE — source-only install for now; happy to update once packages are available
  • lang: C — pure C, no pgrx/Rust dependency
  • pg_ver: 13-17 — tested on all supported versions

GitHub: https://github.com/relytcloud/pg_tiktoken_c
License: Apache 2.0

…n Rust version

- Pure C implementation of OpenAI tiktoken BPE tokenizer for PostgreSQL
- 1700x faster than pg_tiktoken (Rust/pgrx) for typical inputs
- Encoder cached in TopMemoryContext per backend (no per-call init overhead)
- Supports tiktoken_count(), tiktoken_encode(), chunk_text_table() for RAG
- Compatible with PostgreSQL 13-17, Apache 2.0 license
- GitHub: https://github.com/relytcloud/pg_tiktoken_c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant