Companion WordPress plugin that adds DuckDB (embedded) and MotherDuck (cloud) as alternative vector stores for MxChat — an open-source, SQL-native replacement for Pinecone.
MxChat is a popular AI-chatbot plugin for WordPress that ships with two storage backends for its vector knowledge base:
- MySQL — embeddings serialized in
LONGTEXTcolumns; cosine similarity computed in PHP. Simple but slow past a few thousand entries. - Pinecone — fast and managed, but a proprietary SaaS with per-record pricing.
This plugin adds a third option:
- DuckDB / MotherDuck — analytical columnar database with native VSS (vector similarity search) extension. Open-source, runs locally or in the cloud, $0 for the embedded mode.
- 🪶 Two backend modes — embedded
.duckdbfile or MotherDuck cloud (via native DuckDBATTACH), switchable at runtime. - ⚡ HNSW-indexed similarity search via DuckDB's VSS extension (
array_cosine_similarity) — works on the embedded backend, and via the optional local mirror for MotherDuck installs (see docs/MIRROR.md). - 🪞 Local mirror for MotherDuck (v0.10.0+) — synchronous write-through to a local
.duckdbshadow with HNSW. MotherDuck stays the canonical store; reads come from local for HNSW acceleration. Resumable bootstrap, drift detection, automatic drain of failed writes. - 🔀 Hybrid BM25 + vector retrieval (optional) via DuckDB's FTS extension with min-max-normalised score blending.
- 💨 Query result cache keyed by embedding hash + filter + bot — slashes MotherDuck cost and latency on repeat queries.
- 🎯 Per-source dedup + custom reranker hook so the LLM sees diverse, high-quality context.
- 🗜️ INT8 quantization (experimental, opt-in) — 4× smaller vector storage, < 1 % recall loss on unit-normalised embeddings.
- 🔌 Drop-in for Pinecone — implements the Pinecone wire protocol over REST, so MxChat needs zero modifications to use it.
- 🔐 Per-namespace REST tokens so leaking one bot's API key doesn't compromise others.
- 🪛 Optional upstream patch for direct in-process integration (eliminates one HTTP round-trip; ~12 lines, see patches/).
- 🔁 Four ingestion paths — bulk-sync from MySQL, sync reprocess from WordPress posts, async reprocess via Action Scheduler (survives PHP timeouts on large catalogs), and one-shot Pinecone → DuckDB migration without re-embedding.
- 📦 Parquet export/import — portable backups and seamless moves between embedded ⇄ MotherDuck via DuckDB's native
COPY. - 🧰 Auto cascade-delete with nonce-verified handler; orphan compactor cron sweeps stragglers.
- 🩺
/healthendpoint + rolling p50/p95/p99 latency metrics for external monitors. - 🛠️ WP-CLI:
wp mxchat-duckdb {test|stats|sync|reprocess|async-reprocess|compact|metrics|cache|export|import|migrate-from-pinecone}. - 🧪 CI on every PR —
php -lmatrix (PHP 8.0–8.3),msgfmtcatalog check, PHPStan, PHPUnit smoke suite. - 🕒 Hourly WP-cron for incremental sync of new content + daily orphan compaction.
- 🛡️ Per-user-role access control preserved from MxChat (metadata-driven).
- 🌐 i18n-ready — English source strings, French translation shipped,
.pottemplate for additional locales.
The plugin connects to MxChat via two parallel integration paths (Option A — filter override; Option B — Pinecone wire-protocol proxy). Both are registered unconditionally; whichever's prerequisite is present at runtime wins.
See ARCHITECTURE.md for the full integration flowchart, a sequence diagram of the query lifecycle (cache → vector → BM25 → dedup → rerank → metrics), the file layout, and the design conventions contributors should follow.
| Component | Version |
|---|---|
| PHP | ≥ 8.0 |
| WordPress | ≥ 6.0 |
MxChat (mxchat-basic) |
≥ 3.2.5 |
| Site protocol | HTTPS (required for Option B; MxChat hardcodes https:// when calling Pinecone) |
Both backends rely on a local DuckDB process — either the PECL duckdb PHP extension (preferred, in-process) or the duckdb CLI binary (auto-detected in /usr/local/bin, /usr/bin, /opt/homebrew/bin, or set explicitly in plugin settings).
For MotherDuck: a token from app.motherduck.com. MotherDuck mode is a thin wrapper around the local DuckDB process — it runs INSTALL motherduck; LOAD motherduck; ATTACH 'md:<db>?motherduck_token=…' at connect time. There is no HTTP-only path: SQL is shipped through DuckDB's native protocol. With CLI fallback, each query re-attaches; for any production traffic, install the PECL extension.
cd wp-content/plugins/
git clone https://github.com/paulargoud/mxchat-duckdb.gitThen activate MxChat DuckDB / MotherDuck in the WordPress plugins screen (after MxChat itself).
- Download the latest
mxchat-duckdb-x.y.z.zipfrom the Releases page. - Plugins → Add New → Upload Plugin → choose the zip.
- Activate.
- Go to MxChat → DuckDB / MotherDuck in the WordPress admin.
- Choose a backend:
- MotherDuck — paste your token + database name.
- Embedded — leave the path empty for the default (
wp-content/uploads/mxchat-duckdb-private/store.duckdb, protected by an auto-generated.htaccess+index.php+web.config).
- Click Test connection to verify.
- Choose an ingestion strategy:
- Sync MySQL → DuckDB — copies the existing
wp_mxchat_system_prompt_contenttable. Use this if MxChat has been running in MySQL mode and the table contains embeddings. - Reprocess all posts — walks published WordPress posts/pages and runs them through MxChat's full ingestion pipeline (chunking + embedding + upsert). Recommended for installs that have been on Pinecone-only.
- Sync MySQL → DuckDB — copies the existing
- (Optional) Apply
patches/README.mdto enable the faster Option A integration.
⚠️ Reprocessing calls the embedding API configured in MxChat (OpenAI / Voyage / Gemini), which may incur usage costs. Typical cost: a few cents for 100–500 posts ontext-embedding-3-small.
| Doc | What's in it |
|---|---|
| ARCHITECTURE.md | How the plugin wires into MxChat (flowchart), the query lifecycle (sequence diagram), file layout, design conventions for contributors. |
| docs/CONFIGURATION.md | Every option in mxchat_duckdb_options, sidecar options, where data is stored, dimension/storage change guards. |
| docs/HOOKS.md | Every filter and action the plugin exposes, with signatures and PHP examples. |
| docs/CLI.md | Full wp mxchat-duckdb reference with sample output. |
| docs/USAGE.md | Howtos: async reprocess, Pinecone migration, Parquet backup/restore, INT8 quantization, /health endpoint, end-to-end verification. |
| docs/MIRROR.md | Local mirror for MotherDuck installs (v0.10.0+): when to enable, status states, troubleshooting, WP-CLI commands, disk + cost considerations. |
| docs/BACKUP.md | Backup + restore workflow (Parquet export, filesystem snapshot, cross-environment moves, disaster-recovery checklist). |
| CHANGELOG.md | Release history. |
| CONTRIBUTING.md | How to file a bug, send a PR, run the test suite. |
-
Import-from-Pinecone tool— shipped in v0.4.0 (wp mxchat-duckdb migrate-from-pinecone) - Submit the upstream patch (
mxchat_pre_vector_queryfilter, WP-canonicalpre_*convention) to MxChat - Migrate Option B users to Option A automatically once the filter ships
- PDF / attachment reprocessing (currently only post types are covered)
- Per-bot configuration UI (multi-bot installs)
- Built-in cross-encoder reranker (Cohere Rerank / BGE-reranker) plugged into the
mxchat_duckdb_rerank_matcheshook - Native DuckDB extension binding when the PECL extension API stabilizes
- Bench suite comparing query latency: MySQL-PHP vs Pinecone vs DuckDB embedded vs MotherDuck
- Shared hosting: the PECL
duckdbextension is rarely available; falls back to invoking the CLI viaproc_open(), which may be disabled by some hosts. CLI mode adds ~50–200 ms of process-spawn latency per query. - MotherDuck + CLI: each query re-runs
ATTACH 'md:…', adding 1–3 s of network handshake. Acceptable for low-traffic admin tasks; install the PECL extension for any production chatbot traffic. - Embedding dimension must match the model active in MxChat. The settings page shows the detected dimension; the plugin now blocks
embedding_dimchanges when the table already contains vectors — you must wipe and re-sync to switch models. - Direct SQL writes to
wp_mxchat_system_prompt_content(outside MxChat's UI) won't propagate to DuckDB until the next incremental cron tick. - HNSW + multi-tenant
bot_idfilter: DuckDB VSS does not push down arbitraryWHEREclauses into the HNSW index, so queries scoped bybot_idfall back to a brute-force scan. Single-tenant installs use the index as expected. - HNSW on MotherDuck cloud: MotherDuck cloud does not currently support the VSS extension (source). Two ways out, depending on the deployment shape:
- (Recommended for > 100k vectors) Enable the local mirror (v0.10.0+). The plugin maintains a local
.duckdbshadow with HNSW indexed; MotherDuck stays the canonical write target and reads route to local. See docs/MIRROR.md. - Or switch to the embedded backend for a single-server install that doesn't need MotherDuck's multi-server access.
Without either, queries run as brute-force
array_cosine_similarityscans — fine under ~100k vectors but slow beyond. The plugin surfaces an admin notice in that combination.
- (Recommended for > 100k vectors) Enable the local mirror (v0.10.0+). The plugin maintains a local
See CONTRIBUTING.md for the full guide. TL;DR: PHP 8.0+, run php -l on changed files, update CHANGELOG.md under ## [Unreleased], run translatable strings through __() and re-compile mxchat-duckdb-fr_FR.mo.
GPLv2 or later, same as MxChat itself.
- MxChat — the chatbot plugin this companion extends.
- DuckDB and the VSS extension.
- MotherDuck for the hosted DuckDB experience.