Add changes for compatibility with WASM components and collocated UDF servers#121
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces WASM-compatibility improvements (primarily by lazy-loading heavyweight optional dependencies and moving environment-specific imports into call sites) and adds a new collocated Python UDF server implementation, including a new C-extension hot path to accelerate rowdat_1 decode → Python call → rowdat_1 encode.
Changes:
- Added a WIT interface definition and a WASM build helper script for external UDF component workflows.
- Refactored optional dependency handling (numpy/pandas/polars/pyarrow, IPython, JWT) to be more robust in constrained/WASM-like environments.
- Added a new collocated UDF server (socket + mmap protocol, thread/process modes, dynamic registration) and a C-extension accelerator entry point (
call_function_accel).
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| wit/udf.wit | Defines the external UDF WIT interface and exported world. |
| singlestoredb/utils/_lazy_import.py | Adds cached lazy imports for heavy optional deps. |
| singlestoredb/utils/dtypes.py | Converts dtype maps to lazily-evaluated, cached getters. |
| singlestoredb/utils/results.py | Switches result formatting to lazy imports + cached type maps. |
| singlestoredb/utils/events.py | Broadens IPython import failure handling. |
| singlestoredb/converters.py | Uses lazy numpy import in vector converters. |
| singlestoredb/connection.py | Adjusts internal result-to-dict conversion to avoid importing pandas. |
| singlestoredb/mysql/connection.py | Adds WASM-friendly DEFAULT_USER detection (handles OSError). |
| singlestoredb/auth.py | Moves jwt import into call site. |
| singlestoredb/management/utils.py | Moves jwt import into call sites for WASM-friendliness. |
| singlestoredb/management/manager.py | Moves jwt import into is_jwt call site. |
| singlestoredb/functions/dtypes.py | Updates exports to use dtype-map getter functions. |
| singlestoredb/functions/ext/rowdat_1.py | Replaces eager dtype maps with lazy getter functions. |
| singlestoredb/functions/ext/json.py | Replaces eager dtype maps with lazy getter functions. |
| singlestoredb/functions/ext/collocated/* | Adds collocated server, protocol handling, registry, control signals, and WASM adapter. |
| singlestoredb/tests/test_connection.py | Makes pandas string dtype assertions version-tolerant. |
| resources/build_wasm.sh | Adds a build helper for wasm32-wasip2 wheels. |
| pyproject.toml | Adds python-udf-server CLI entry point. |
| accel.c | Adds call_function_accel C hot path and exports it from the extension module. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
accel.c:
- Replace empty TODO type stubs with NotImplementedError raises
- Add CHECK_REMAINING macro for bounds checking on buffer reads
- Replace unaligned pointer-cast reads with memcpy for WASM/ARM safety
- Fix double-decref in output error paths (set to NULL before goto)
- Fix Py_None reference leak by removing pre-switch INCREF
- Fix MYSQL_TYPE_NULL consuming an extra byte from next column
- Add PyErr_Format in default switch cases
- Add PyErr_Occurred() checks after PyLong/PyFloat conversions
Python:
- Align list/tuple multi-return handling in registry.py with C path
- Add _write_all_fd helper for partial os.write() handling
- Harden handshake recvmsg: name length bound, ancdata validation,
MSG_CTRUNC check, FD cleanup on error
- Wrap get_context('fork') with platform safety error
- Narrow events.py exception catch to (ImportError, OSError)
- Fix _iquery DataFrame check ordering (check before list())
- Expand setblocking(False) warning comment
- Update WIT and wasm.py docstrings for code parameter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend JSONEncoder.default() to handle datetime, date, timedelta, and Decimal types that were missing from the JSON format path. These types were added to the ROWDAT_1 binary format but the JSON encoder only handled bytes→base64. Without this, json.dumps() raises TypeError when a UDF returns any of these types via FORMAT JSON. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a TimeoutError occurs mid-message, the function removes the socket timeout to avoid protocol desync. Previously the timeout was never restored, causing the caller's shutdown-polling loop to block indefinitely on subsequent reads. Now the original timeout is saved and restored on both EOF and successful completion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename the `singlestoredb.functions.ext.collocated` package to `singlestoredb.functions.ext.plugin` and update all related naming: - CLI args: --extension -> --plugin-name, --extension-path -> --search-path - Env vars: EXTERNAL_UDF_* -> PLUGIN_* - Class: FunctionHandler -> Plugin - WIT: singlestore:udf/function-handler -> singlestore:plugin/plugin - WIT world: external-udf -> plugin-server - pyproject.toml entry point updated to new module path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update ARCHITECTURE.md with plugin/ subpackage in the package tree, three-mode execution diagram, Plugin Server CLI section with all 7 arguments and env variables, and Appendix A/D entries. Update CONTRIBUTING.md with plugin server CLI example for manual testing. Update README.md UDF bullet to mention plugin-mode deployment. Replace remaining "collocated" references in plugin/ source files: logger names (collocated.* → plugin.*), docstrings, and argparse description. Files outside plugin/ (config.py, asgi.py, mmap.py) correctly use "collocated" for their own functionality and are left unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The negated binary blob case list in load_rowdat_1_numpy's sizing pass omitted -MYSQL_TYPE_BLOB while including the other three blob types. The data output pass already handled all four. This mismatch caused columns with negated MYSQL_TYPE_BLOB to fall through the sizing switch unhandled, leading to incorrect data pointer advancement and corrupted parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…est coverage Add missing CHECKSIZE/CHECK_REMAINING guards for MYSQL_TYPE_NULL in both the C accelerator and Python rowdat_1 paths. Refactor decimal and datetime unpacking in rowdat_1.py to properly propagate null values. Expand tests for ext func data parsing, plugin UDF server components, and VECTOR type assertions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract shared datetime encode/decode into static inline helpers, replace repeated 11-case string/binary label blocks with macros, and replace the numpy pre-scan switch with a type descriptor table. Reduces ~500 lines of duplication across 5 functions with zero runtime overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Connection._iquery now applies under2camel to dict results via fix_names, making the second conversion in ShowAccessor._iquery unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lean up tests Fix a bug in json.py _dump_vectors where the null mask was never applied because `m is not None` is always True for boolean mask values. Also add 15 new tests for call_function_accel covering datetime/date/time/decimal types, error paths, and edge cases. Remove leftover debug pprint, fix test data inconsistency, and document @@register security boundary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ocol hardening Fix PyDict_SetItem key reference leaks in load_rowdat_1_numpy by creating temporary key objects and decrementing after use (7 call sites on hot path). Remove NEWDECIMAL (246) from string_types so decimal_types handler is reachable, returning decimal.Decimal instead of strings. Fix _pack_time to use integer arithmetic instead of float total_seconds(). Reject datetime.time UDF annotations with a clear TypeError (timedelta required). Normalize VECTOR element_type to uppercase before SQL emission. Add recvmsg partial-read check in plugin handshake to prevent protocol desync. Validate socket path before unlink in _bind_socket to prevent arbitrary file deletion. Use private module namespace for dynamic UDF registration instead of __main__. Broaden lazy import exception handling to catch OSError for WASM/WASI environments where optional deps may not raise ImportError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Required since Python 3.12 for '#' format codes in PyArg_ParseTuple. Without this, mmap_write and other C accelerator functions fail with "PY_SSIZE_T_CLEAN macro must be defined for '#' formats". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _discover_udf_functions() method required user plugin modules to import the WASM-specific Plugin class for their @udf functions to be found. This broke native plugin servers where Plugin is irrelevant. Replace the Plugin identity check with a direct scan for the _singlestoredb_attrs marker set by @udf. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The pointer-to-object remap switch only handled string/blob types, so DECIMAL, DATE, TIME, DATETIME, and TIMESTAMP columns returned raw pointer integers instead of Python objects in numpy arrays. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract FDs from ancdata eagerly before validation so the try/finally cleanup covers all early-return paths after recvmsg. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of scanning all sys.modules, pass the imported plugin module directly so discovery is targeted and avoids false positives from infrastructure modules. Also hardens the sys.modules fallback path against RuntimeError (dict changed during iteration) and TypeError from problematic module attributes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect vectorized functions (numpy, pandas, polars, arrow) via args_data_format in the function signature and route through the existing C-accelerated load_rowdat_1_numpy/dump_rowdat_1_numpy infrastructure instead of the per-row scalar path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable asset generation on test/rc/alpha/beta tag pushes without publishing to PyPI. Also prevents fusion-docs from triggering on pre-release tags. Ref: PR #123 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The accel C extension uses poll.h, sys/mman.h, sys/socket.h, and unistd.h which are unavailable on Windows. Extend the existing __wasi__ guards to also exclude _WIN32. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows deletion of dynamically registered functions (via @@register) while protecting base functions from removal. Includes pipe notification for process-mode worker re-forking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 35 integration tests covering the protocol handshake, UDF request loop, control signal dispatch (@@register/@@delete), server lifecycle in thread and process modes, rowdat_1 mixed-type roundtrips, the call_function dispatch matrix, vectorized UDF dispatch, pipe message protocol, and SharedRegistry generation caching. Fix .gitignore to only ignore test*.py at the repo root (not in subdirectories like singlestoredb/tests/). Fix accel.c: overflow guard in ucs4_to_utf8, refcount ordering in read_rowdata_packet, and safer bounds check in CHECKSIZE macro. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- connection.py: coerce scalar rows to 1-tuple in _iquery() numpy path so single-column structured ndarrays don't break zip(names, row). - json.py: use integer arithmetic (days/seconds/microseconds) instead of float total_seconds() for timedelta serialization to avoid precision loss on large durations. - registry.py: guard numpy import in _normalize_vector_output() behind try/except ImportError so non-numpy environments (WASM) can still use vector formats like 'list'. - server.py: use pre-allocated bytearray in _read_pipe_message() to avoid quadratic copying, and enforce a 16 MiB size cap to prevent unbounded memory growth from malformed length prefixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The README was missing documentation for several shipped features: the pytest plugin fixtures, the python-udf-server CLI with its full argument set, and the docker/pytest optional dependency extras. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 25281b4. Configure here.
| pd.Series( | ||
| data, index=index, name=spec[0], | ||
| dtype=PANDAS_TYPE_MAP[spec[1]], | ||
| dtype=get_numpy_type_map()[spec[1]], |
There was a problem hiding this comment.
Pandas load uses wrong type map
Medium Severity
The load_pandas function now uses get_numpy_type_map() where PANDAS_TYPE_MAP was previously used for the pd.Series dtype argument. PANDAS_TYPE_MAP was a separate mapping from NUMPY_TYPE_MAP (both were distinct exports from utils.dtypes), but no get_pandas_type_map equivalent was created during the lazy-loading refactor. If the pandas map contained pandas-specific types (e.g., nullable integer dtypes or string dtypes), substituting the numpy map silently changes Series dtype behavior.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 25281b4. Configure here.


This PR makes several changes to allow the singlestoredb work in WASM environments. Many of these changes benefit standard installations as well such as lazy loading of numpy, pandas, polars, and pyarrow. Others move imports that are only needed in certain environments, but not within WASM.
A new collocated UDF server implementation is also included that uses a high-performance loop in the C extension to parse and call Python functions on each row. This function is used both by standard collocated servers as well as WASM-based UDF handlers.
Note
High Risk
New network-facing UDF server with dynamic
@@register(execof user code) and large C-extension protocol changes on the UDF hot path; release workflow changes affect artifact publishing.Overview
This PR extends WASM/collocated UDF support and adds a plugin-mode UDF server that talks to SingleStore over a Unix socket, aligned with the existing Rust wasm-udf-server protocol.
C extension (
accel.c) — ROWDAT_1 gains full decimal, date, time, and datetime encode/decode (with safermemcpy/bounds checks), pluscall_function_accel(parse → call Python → serialize in one path). New helpersmmap_read,mmap_write, andrecv_exactspeed mmap/socket I/O on Unix (stubs elsewhere).Plugin server (
singlestoredb/functions/ext/plugin/) — Newpython-udf-serverCLI (pyproject.toml): loads a plugin module, serves UDFs on a Unix socket with thread or pre-fork process pools, mmap request/response I/O, and control signals (@@health,@@functions,@@register,@@delete) with live reload via a generation-based registry.Packaging & docs — README/ARCHITECTURE/CONTRIBUTING document plugin mode, pytest/docker extras, and release workflows now tie tag pushes (prerelease tags) to GitHub release create/upload as well as published releases.
Core library — Lazy numpy via
get_numpy/ dtype map getters;VECTOR()SQL helper enabled; JSON UDF encoding for temporal/decimal types;jwtimported lazily inauth.py;Connection._iquerynormalizes more result shapes without requiring pandas at import time;.gitignoretightens roottest*.pyignores.resources/build_wasm.shadded for WASI wheel builds.Reviewed by Cursor Bugbot for commit 25281b4. Bugbot is set up for automated code reviews on this repo. Configure here.