Skip to content

Add changes for compatibility with WASM components and collocated UDF servers#121

Merged
kesmit13 merged 49 commits into
mainfrom
wasm-compat
Jun 2, 2026
Merged

Add changes for compatibility with WASM components and collocated UDF servers#121
kesmit13 merged 49 commits into
mainfrom
wasm-compat

Conversation

@kesmit13
Copy link
Copy Markdown
Collaborator

@kesmit13 kesmit13 commented Apr 1, 2026

This PR makes several changes to allow the singlestoredb work in WASM environments. Many of these changes benefit standard installations as well such as lazy loading of numpy, pandas, polars, and pyarrow. Others move imports that are only needed in certain environments, but not within WASM.

A new collocated UDF server implementation is also included that uses a high-performance loop in the C extension to parse and call Python functions on each row. This function is used both by standard collocated servers as well as WASM-based UDF handlers.


Note

High Risk
New network-facing UDF server with dynamic @@register (exec of user code) and large C-extension protocol changes on the UDF hot path; release workflow changes affect artifact publishing.

Overview
This PR extends WASM/collocated UDF support and adds a plugin-mode UDF server that talks to SingleStore over a Unix socket, aligned with the existing Rust wasm-udf-server protocol.

C extension (accel.c) — ROWDAT_1 gains full decimal, date, time, and datetime encode/decode (with safer memcpy/bounds checks), plus call_function_accel (parse → call Python → serialize in one path). New helpers mmap_read, mmap_write, and recv_exact speed mmap/socket I/O on Unix (stubs elsewhere).

Plugin server (singlestoredb/functions/ext/plugin/) — New python-udf-server CLI (pyproject.toml): loads a plugin module, serves UDFs on a Unix socket with thread or pre-fork process pools, mmap request/response I/O, and control signals (@@health, @@functions, @@register, @@delete) with live reload via a generation-based registry.

Packaging & docs — README/ARCHITECTURE/CONTRIBUTING document plugin mode, pytest/docker extras, and release workflows now tie tag pushes (prerelease tags) to GitHub release create/upload as well as published releases.

Core library — Lazy numpy via get_numpy / dtype map getters; VECTOR() SQL helper enabled; JSON UDF encoding for temporal/decimal types; jwt imported lazily in auth.py; Connection._iquery normalizes more result shapes without requiring pandas at import time; .gitignore tightens root test*.py ignores. resources/build_wasm.sh added for WASI wheel builds.

Reviewed by Cursor Bugbot for commit 25281b4. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread accel.c
Comment thread accel.c
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces WASM-compatibility improvements (primarily by lazy-loading heavyweight optional dependencies and moving environment-specific imports into call sites) and adds a new collocated Python UDF server implementation, including a new C-extension hot path to accelerate rowdat_1 decode → Python call → rowdat_1 encode.

Changes:

  • Added a WIT interface definition and a WASM build helper script for external UDF component workflows.
  • Refactored optional dependency handling (numpy/pandas/polars/pyarrow, IPython, JWT) to be more robust in constrained/WASM-like environments.
  • Added a new collocated UDF server (socket + mmap protocol, thread/process modes, dynamic registration) and a C-extension accelerator entry point (call_function_accel).

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
wit/udf.wit Defines the external UDF WIT interface and exported world.
singlestoredb/utils/_lazy_import.py Adds cached lazy imports for heavy optional deps.
singlestoredb/utils/dtypes.py Converts dtype maps to lazily-evaluated, cached getters.
singlestoredb/utils/results.py Switches result formatting to lazy imports + cached type maps.
singlestoredb/utils/events.py Broadens IPython import failure handling.
singlestoredb/converters.py Uses lazy numpy import in vector converters.
singlestoredb/connection.py Adjusts internal result-to-dict conversion to avoid importing pandas.
singlestoredb/mysql/connection.py Adds WASM-friendly DEFAULT_USER detection (handles OSError).
singlestoredb/auth.py Moves jwt import into call site.
singlestoredb/management/utils.py Moves jwt import into call sites for WASM-friendliness.
singlestoredb/management/manager.py Moves jwt import into is_jwt call site.
singlestoredb/functions/dtypes.py Updates exports to use dtype-map getter functions.
singlestoredb/functions/ext/rowdat_1.py Replaces eager dtype maps with lazy getter functions.
singlestoredb/functions/ext/json.py Replaces eager dtype maps with lazy getter functions.
singlestoredb/functions/ext/collocated/* Adds collocated server, protocol handling, registry, control signals, and WASM adapter.
singlestoredb/tests/test_connection.py Makes pandas string dtype assertions version-tolerant.
resources/build_wasm.sh Adds a build helper for wasm32-wasip2 wheels.
pyproject.toml Adds python-udf-server CLI entry point.
accel.c Adds call_function_accel C hot path and exports it from the extension module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread accel.c
Comment thread accel.c Outdated
Comment thread accel.c
Comment thread singlestoredb/functions/ext/plugin/registry.py Outdated
Comment thread singlestoredb/functions/ext/plugin/connection.py
Comment thread singlestoredb/functions/ext/plugin/connection.py
Comment thread singlestoredb/connection.py Outdated
Comment thread singlestoredb/utils/events.py Outdated
Comment thread singlestoredb/functions/ext/collocated/server.py Outdated
Comment thread singlestoredb/functions/ext/plugin/connection.py
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread singlestoredb/functions/ext/collocated/registry.py Outdated
Comment thread accel.c Outdated
Comment thread accel.c
Comment thread accel.c
Comment thread singlestoredb/functions/ext/collocated/connection.py Outdated
Comment thread singlestoredb/functions/ext/collocated/connection.py Outdated
Comment thread singlestoredb/functions/ext/collocated/server.py Outdated
Comment thread singlestoredb/functions/ext/plugin/server.py
Comment thread wit/udf.wit Outdated
Comment thread singlestoredb/functions/ext/collocated/wasm.py Outdated
kesmit13 added a commit that referenced this pull request Apr 2, 2026
accel.c:
- Replace empty TODO type stubs with NotImplementedError raises
- Add CHECK_REMAINING macro for bounds checking on buffer reads
- Replace unaligned pointer-cast reads with memcpy for WASM/ARM safety
- Fix double-decref in output error paths (set to NULL before goto)
- Fix Py_None reference leak by removing pre-switch INCREF
- Fix MYSQL_TYPE_NULL consuming an extra byte from next column
- Add PyErr_Format in default switch cases
- Add PyErr_Occurred() checks after PyLong/PyFloat conversions

Python:
- Align list/tuple multi-return handling in registry.py with C path
- Add _write_all_fd helper for partial os.write() handling
- Harden handshake recvmsg: name length bound, ancdata validation,
  MSG_CTRUNC check, FD cleanup on error
- Wrap get_context('fork') with platform safety error
- Narrow events.py exception catch to (ImportError, OSError)
- Fix _iquery DataFrame check ordering (check before list())
- Expand setblocking(False) warning comment
- Update WIT and wasm.py docstrings for code parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread accel.c
Comment thread accel.c
Comment thread accel.c
Comment thread singlestoredb/connection.py Outdated
Comment thread singlestoredb/functions/ext/plugin/connection.py
Comment thread singlestoredb/functions/ext/collocated/registry.py Outdated
Comment thread accel.c
Comment thread accel.c
Comment thread accel.c
kesmit13 and others added 23 commits June 1, 2026 15:39
Extend JSONEncoder.default() to handle datetime, date, timedelta, and
Decimal types that were missing from the JSON format path. These types
were added to the ROWDAT_1 binary format but the JSON encoder only
handled bytes→base64. Without this, json.dumps() raises TypeError when
a UDF returns any of these types via FORMAT JSON.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a TimeoutError occurs mid-message, the function removes the socket
timeout to avoid protocol desync. Previously the timeout was never
restored, causing the caller's shutdown-polling loop to block
indefinitely on subsequent reads. Now the original timeout is saved
and restored on both EOF and successful completion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename the `singlestoredb.functions.ext.collocated` package to
`singlestoredb.functions.ext.plugin` and update all related naming:

- CLI args: --extension -> --plugin-name, --extension-path -> --search-path
- Env vars: EXTERNAL_UDF_* -> PLUGIN_*
- Class: FunctionHandler -> Plugin
- WIT: singlestore:udf/function-handler -> singlestore:plugin/plugin
- WIT world: external-udf -> plugin-server
- pyproject.toml entry point updated to new module path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update ARCHITECTURE.md with plugin/ subpackage in the package tree,
three-mode execution diagram, Plugin Server CLI section with all 7
arguments and env variables, and Appendix A/D entries. Update
CONTRIBUTING.md with plugin server CLI example for manual testing.
Update README.md UDF bullet to mention plugin-mode deployment.

Replace remaining "collocated" references in plugin/ source files:
logger names (collocated.* → plugin.*), docstrings, and argparse
description. Files outside plugin/ (config.py, asgi.py, mmap.py)
correctly use "collocated" for their own functionality and are
left unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The negated binary blob case list in load_rowdat_1_numpy's sizing pass
omitted -MYSQL_TYPE_BLOB while including the other three blob types.
The data output pass already handled all four. This mismatch caused
columns with negated MYSQL_TYPE_BLOB to fall through the sizing switch
unhandled, leading to incorrect data pointer advancement and corrupted
parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…est coverage

Add missing CHECKSIZE/CHECK_REMAINING guards for MYSQL_TYPE_NULL in both
the C accelerator and Python rowdat_1 paths. Refactor decimal and datetime
unpacking in rowdat_1.py to properly propagate null values. Expand tests
for ext func data parsing, plugin UDF server components, and VECTOR type
assertions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract shared datetime encode/decode into static inline helpers,
replace repeated 11-case string/binary label blocks with macros,
and replace the numpy pre-scan switch with a type descriptor table.
Reduces ~500 lines of duplication across 5 functions with zero
runtime overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Connection._iquery now applies under2camel to dict results via fix_names,
making the second conversion in ShowAccessor._iquery unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lean up tests

Fix a bug in json.py _dump_vectors where the null mask was never applied
because `m is not None` is always True for boolean mask values. Also add
15 new tests for call_function_accel covering datetime/date/time/decimal
types, error paths, and edge cases. Remove leftover debug pprint, fix
test data inconsistency, and document @@register security boundary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ocol hardening

Fix PyDict_SetItem key reference leaks in load_rowdat_1_numpy by creating
temporary key objects and decrementing after use (7 call sites on hot path).

Remove NEWDECIMAL (246) from string_types so decimal_types handler is
reachable, returning decimal.Decimal instead of strings. Fix _pack_time
to use integer arithmetic instead of float total_seconds(). Reject
datetime.time UDF annotations with a clear TypeError (timedelta required).
Normalize VECTOR element_type to uppercase before SQL emission.

Add recvmsg partial-read check in plugin handshake to prevent protocol
desync. Validate socket path before unlink in _bind_socket to prevent
arbitrary file deletion. Use private module namespace for dynamic UDF
registration instead of __main__.

Broaden lazy import exception handling to catch OSError for WASM/WASI
environments where optional deps may not raise ImportError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Required since Python 3.12 for '#' format codes in PyArg_ParseTuple.
Without this, mmap_write and other C accelerator functions fail with
"PY_SSIZE_T_CLEAN macro must be defined for '#' formats".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _discover_udf_functions() method required user plugin modules to
import the WASM-specific Plugin class for their @udf functions to be
found. This broke native plugin servers where Plugin is irrelevant.
Replace the Plugin identity check with a direct scan for the
_singlestoredb_attrs marker set by @udf.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The pointer-to-object remap switch only handled string/blob types,
so DECIMAL, DATE, TIME, DATETIME, and TIMESTAMP columns returned
raw pointer integers instead of Python objects in numpy arrays.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract FDs from ancdata eagerly before validation so the
try/finally cleanup covers all early-return paths after recvmsg.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of scanning all sys.modules, pass the imported plugin module
directly so discovery is targeted and avoids false positives from
infrastructure modules. Also hardens the sys.modules fallback path
against RuntimeError (dict changed during iteration) and TypeError
from problematic module attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect vectorized functions (numpy, pandas, polars, arrow) via
args_data_format in the function signature and route through the
existing C-accelerated load_rowdat_1_numpy/dump_rowdat_1_numpy
infrastructure instead of the per-row scalar path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable asset generation on test/rc/alpha/beta tag pushes without
publishing to PyPI. Also prevents fusion-docs from triggering on
pre-release tags.

Ref: PR #123

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The accel C extension uses poll.h, sys/mman.h, sys/socket.h, and
unistd.h which are unavailable on Windows. Extend the existing
__wasi__ guards to also exclude _WIN32.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows deletion of dynamically registered functions (via @@register)
while protecting base functions from removal. Includes pipe notification
for process-mode worker re-forking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 35 integration tests covering the protocol handshake, UDF request
loop, control signal dispatch (@@register/@@delete), server lifecycle
in thread and process modes, rowdat_1 mixed-type roundtrips, the
call_function dispatch matrix, vectorized UDF dispatch, pipe message
protocol, and SharedRegistry generation caching.

Fix .gitignore to only ignore test*.py at the repo root (not in
subdirectories like singlestoredb/tests/).

Fix accel.c: overflow guard in ucs4_to_utf8, refcount ordering in
read_rowdata_packet, and safer bounds check in CHECKSIZE macro.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- connection.py: coerce scalar rows to 1-tuple in _iquery() numpy path
  so single-column structured ndarrays don't break zip(names, row).

- json.py: use integer arithmetic (days/seconds/microseconds) instead
  of float total_seconds() for timedelta serialization to avoid
  precision loss on large durations.

- registry.py: guard numpy import in _normalize_vector_output() behind
  try/except ImportError so non-numpy environments (WASM) can still
  use vector formats like 'list'.

- server.py: use pre-allocated bytearray in _read_pipe_message() to
  avoid quadratic copying, and enforce a 16 MiB size cap to prevent
  unbounded memory growth from malformed length prefixes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The README was missing documentation for several shipped features:
the pytest plugin fixtures, the python-udf-server CLI with its full
argument set, and the docker/pytest optional dependency extras.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 25281b4. Configure here.

pd.Series(
data, index=index, name=spec[0],
dtype=PANDAS_TYPE_MAP[spec[1]],
dtype=get_numpy_type_map()[spec[1]],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas load uses wrong type map

Medium Severity

The load_pandas function now uses get_numpy_type_map() where PANDAS_TYPE_MAP was previously used for the pd.Series dtype argument. PANDAS_TYPE_MAP was a separate mapping from NUMPY_TYPE_MAP (both were distinct exports from utils.dtypes), but no get_pandas_type_map equivalent was created during the lazy-loading refactor. If the pandas map contained pandas-specific types (e.g., nullable integer dtypes or string dtypes), substituting the numpy map silently changes Series dtype behavior.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 25281b4. Configure here.

@kesmit13 kesmit13 merged commit e82a914 into main Jun 2, 2026
12 checks passed
@kesmit13 kesmit13 deleted the wasm-compat branch June 2, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants