Skip to content

PaulArgoud/mxchat-duckdb

MxChat DuckDB / MotherDuck

Plugin version PHP WordPress MxChat DuckDB MotherDuck License: GPL v2+ Status: beta CI

Companion WordPress plugin that adds DuckDB (embedded) and MotherDuck (cloud) as alternative vector stores for MxChat — an open-source, SQL-native replacement for Pinecone.


Why

MxChat is a popular AI-chatbot plugin for WordPress that ships with two storage backends for its vector knowledge base:

  1. MySQL — embeddings serialized in LONGTEXT columns; cosine similarity computed in PHP. Simple but slow past a few thousand entries.
  2. Pinecone — fast and managed, but a proprietary SaaS with per-record pricing.

This plugin adds a third option:

  1. DuckDB / MotherDuck — analytical columnar database with native VSS (vector similarity search) extension. Open-source, runs locally or in the cloud, $0 for the embedded mode.

Features

  • 🪶 Two backend modes — embedded .duckdb file or MotherDuck cloud (via native DuckDB ATTACH), switchable at runtime.
  • HNSW-indexed similarity search via DuckDB's VSS extension (array_cosine_similarity) — works on the embedded backend, and via the optional local mirror for MotherDuck installs (see docs/MIRROR.md).
  • 🪞 Local mirror for MotherDuck (v0.10.0+) — synchronous write-through to a local .duckdb shadow with HNSW. MotherDuck stays the canonical store; reads come from local for HNSW acceleration. Resumable bootstrap, drift detection, automatic drain of failed writes.
  • 🔀 Hybrid BM25 + vector retrieval (optional) via DuckDB's FTS extension with min-max-normalised score blending.
  • 💨 Query result cache keyed by embedding hash + filter + bot — slashes MotherDuck cost and latency on repeat queries.
  • 🎯 Per-source dedup + custom reranker hook so the LLM sees diverse, high-quality context.
  • 🗜️ INT8 quantization (experimental, opt-in) — 4× smaller vector storage, < 1 % recall loss on unit-normalised embeddings.
  • 🔌 Drop-in for Pinecone — implements the Pinecone wire protocol over REST, so MxChat needs zero modifications to use it.
  • 🔐 Per-namespace REST tokens so leaking one bot's API key doesn't compromise others.
  • 🪛 Optional upstream patch for direct in-process integration (eliminates one HTTP round-trip; ~12 lines, see patches/).
  • 🔁 Four ingestion paths — bulk-sync from MySQL, sync reprocess from WordPress posts, async reprocess via Action Scheduler (survives PHP timeouts on large catalogs), and one-shot Pinecone → DuckDB migration without re-embedding.
  • 📦 Parquet export/import — portable backups and seamless moves between embedded ⇄ MotherDuck via DuckDB's native COPY.
  • 🧰 Auto cascade-delete with nonce-verified handler; orphan compactor cron sweeps stragglers.
  • 🩺 /health endpoint + rolling p50/p95/p99 latency metrics for external monitors.
  • 🛠️ WP-CLI: wp mxchat-duckdb {test|stats|sync|reprocess|async-reprocess|compact|metrics|cache|export|import|migrate-from-pinecone}.
  • 🧪 CI on every PRphp -l matrix (PHP 8.0–8.3), msgfmt catalog check, PHPStan, PHPUnit smoke suite.
  • 🕒 Hourly WP-cron for incremental sync of new content + daily orphan compaction.
  • 🛡️ Per-user-role access control preserved from MxChat (metadata-driven).
  • 🌐 i18n-ready — English source strings, French translation shipped, .pot template for additional locales.

Architecture

The plugin connects to MxChat via two parallel integration paths (Option A — filter override; Option B — Pinecone wire-protocol proxy). Both are registered unconditionally; whichever's prerequisite is present at runtime wins.

See ARCHITECTURE.md for the full integration flowchart, a sequence diagram of the query lifecycle (cache → vector → BM25 → dedup → rerank → metrics), the file layout, and the design conventions contributors should follow.

Requirements

Component Version
PHP ≥ 8.0
WordPress ≥ 6.0
MxChat (mxchat-basic) ≥ 3.2.5
Site protocol HTTPS (required for Option B; MxChat hardcodes https:// when calling Pinecone)

Both backends rely on a local DuckDB process — either the PECL duckdb PHP extension (preferred, in-process) or the duckdb CLI binary (auto-detected in /usr/local/bin, /usr/bin, /opt/homebrew/bin, or set explicitly in plugin settings).

For MotherDuck: a token from app.motherduck.com. MotherDuck mode is a thin wrapper around the local DuckDB process — it runs INSTALL motherduck; LOAD motherduck; ATTACH 'md:<db>?motherduck_token=…' at connect time. There is no HTTP-only path: SQL is shipped through DuckDB's native protocol. With CLI fallback, each query re-attaches; for any production traffic, install the PECL extension.

Installation

From source

cd wp-content/plugins/
git clone https://github.com/paulargoud/mxchat-duckdb.git

Then activate MxChat DuckDB / MotherDuck in the WordPress plugins screen (after MxChat itself).

From release zip

  1. Download the latest mxchat-duckdb-x.y.z.zip from the Releases page.
  2. Plugins → Add New → Upload Plugin → choose the zip.
  3. Activate.

Quick start

  1. Go to MxChat → DuckDB / MotherDuck in the WordPress admin.
  2. Choose a backend:
    • MotherDuck — paste your token + database name.
    • Embedded — leave the path empty for the default (wp-content/uploads/mxchat-duckdb-private/store.duckdb, protected by an auto-generated .htaccess + index.php + web.config).
  3. Click Test connection to verify.
  4. Choose an ingestion strategy:
    • Sync MySQL → DuckDB — copies the existing wp_mxchat_system_prompt_content table. Use this if MxChat has been running in MySQL mode and the table contains embeddings.
    • Reprocess all posts — walks published WordPress posts/pages and runs them through MxChat's full ingestion pipeline (chunking + embedding + upsert). Recommended for installs that have been on Pinecone-only.
  5. (Optional) Apply patches/README.md to enable the faster Option A integration.

⚠️ Reprocessing calls the embedding API configured in MxChat (OpenAI / Voyage / Gemini), which may incur usage costs. Typical cost: a few cents for 100–500 posts on text-embedding-3-small.

Documentation

Doc What's in it
ARCHITECTURE.md How the plugin wires into MxChat (flowchart), the query lifecycle (sequence diagram), file layout, design conventions for contributors.
docs/CONFIGURATION.md Every option in mxchat_duckdb_options, sidecar options, where data is stored, dimension/storage change guards.
docs/HOOKS.md Every filter and action the plugin exposes, with signatures and PHP examples.
docs/CLI.md Full wp mxchat-duckdb reference with sample output.
docs/USAGE.md Howtos: async reprocess, Pinecone migration, Parquet backup/restore, INT8 quantization, /health endpoint, end-to-end verification.
docs/MIRROR.md Local mirror for MotherDuck installs (v0.10.0+): when to enable, status states, troubleshooting, WP-CLI commands, disk + cost considerations.
docs/BACKUP.md Backup + restore workflow (Parquet export, filesystem snapshot, cross-environment moves, disaster-recovery checklist).
CHANGELOG.md Release history.
CONTRIBUTING.md How to file a bug, send a PR, run the test suite.

Roadmap

  • Import-from-Pinecone tool — shipped in v0.4.0 (wp mxchat-duckdb migrate-from-pinecone)
  • Submit the upstream patch (mxchat_pre_vector_query filter, WP-canonical pre_* convention) to MxChat
  • Migrate Option B users to Option A automatically once the filter ships
  • PDF / attachment reprocessing (currently only post types are covered)
  • Per-bot configuration UI (multi-bot installs)
  • Built-in cross-encoder reranker (Cohere Rerank / BGE-reranker) plugged into the mxchat_duckdb_rerank_matches hook
  • Native DuckDB extension binding when the PECL extension API stabilizes
  • Bench suite comparing query latency: MySQL-PHP vs Pinecone vs DuckDB embedded vs MotherDuck

Limitations

  • Shared hosting: the PECL duckdb extension is rarely available; falls back to invoking the CLI via proc_open(), which may be disabled by some hosts. CLI mode adds ~50–200 ms of process-spawn latency per query.
  • MotherDuck + CLI: each query re-runs ATTACH 'md:…', adding 1–3 s of network handshake. Acceptable for low-traffic admin tasks; install the PECL extension for any production chatbot traffic.
  • Embedding dimension must match the model active in MxChat. The settings page shows the detected dimension; the plugin now blocks embedding_dim changes when the table already contains vectors — you must wipe and re-sync to switch models.
  • Direct SQL writes to wp_mxchat_system_prompt_content (outside MxChat's UI) won't propagate to DuckDB until the next incremental cron tick.
  • HNSW + multi-tenant bot_id filter: DuckDB VSS does not push down arbitrary WHERE clauses into the HNSW index, so queries scoped by bot_id fall back to a brute-force scan. Single-tenant installs use the index as expected.
  • HNSW on MotherDuck cloud: MotherDuck cloud does not currently support the VSS extension (source). Two ways out, depending on the deployment shape:
    • (Recommended for > 100k vectors) Enable the local mirror (v0.10.0+). The plugin maintains a local .duckdb shadow with HNSW indexed; MotherDuck stays the canonical write target and reads route to local. See docs/MIRROR.md.
    • Or switch to the embedded backend for a single-server install that doesn't need MotherDuck's multi-server access. Without either, queries run as brute-force array_cosine_similarity scans — fine under ~100k vectors but slow beyond. The plugin surfaces an admin notice in that combination.

Contributing

See CONTRIBUTING.md for the full guide. TL;DR: PHP 8.0+, run php -l on changed files, update CHANGELOG.md under ## [Unreleased], run translatable strings through __() and re-compile mxchat-duckdb-fr_FR.mo.

License

GPLv2 or later, same as MxChat itself.

Acknowledgements

About

WordPress companion plugin: DuckDB / MotherDuck as a SQL-native vector store for MxChat (replacement for Pinecone).

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors