fix: serialize tensors with per-channel transport mode in RPC pre-passes#6
fix: serialize tensors with per-channel transport mode in RPC pre-passes#6pollockjj wants to merge 29 commits into
Conversation
…l absent When pyisolate is installed from PyPI (non-editable), _pyisolate_source_path() resolves to site-packages/ which has no pyproject.toml, causing pixi install to fail. Now checks for pyproject.toml existence and falls back to a version-pinned PyPI dependency matching the uv code path pattern.
Remove 3 non-production ][ DIAG: log statements from the RPC deserialization path that printed on every type-tagged object crossing the wire.
Add _internal/pixi_provisioner.py that downloads, caches, and integrity-verifies the pixi binary from prefix-dev GitHub releases. Supports Linux x86_64/aarch64, macOS x86_64/arm64, Windows x86_64. Version pinned to 0.67.0. 11 tests covering platform detection, checksum verification, cache hit, and full download path.
Replace shutil.which("pixi") checks in environment.py and
environment_conda.py with ensure_pixi() from pixi_provisioner.
The conda backend now auto-downloads, caches, and verifies the pixi
binary on first use. Users never need to install pixi manually.
…ilable _deserialize_json_tensor previously returned raw dicts when torch was absent. Now falls back to numpy arrays. Also registers TensorValue deserializer in the child bootstrap for sealed workers without torch.
Add TestPlatformCoverage class with parametrized tests for all 5 platform combinations, Windows binary name (pixi.exe), and URL construction verification against GitHub release naming convention.
Sealed workers never run adapter.register_serializers() because the adapter can't be rehydrated. This caused ndarray results to be wrapped as RemoteObjectHandle instead of crossing the wire as inline data. Register ndarray as a data_type serializer directly in SealedNodeExtension.__init__. The serializer converts ndarray to TensorValue dict format, which the host's existing TensorValue fast-path in rpc_transports.py reconstructs as torch.Tensor.
…worker ndarray fix
Add extra_index_urls field to ExtensionConfig and plumb through install_dependencies() as --extra-index-url args to uv pip install. Enables packages from non-standard PyPI indexes (e.g., fbxsdkpy from gitlab.inria.fr) to resolve during isolated venv provisioning.
There was a problem hiding this comment.
Pull request overview
Fixes a cross-extension isolation bug where outbound torch.Tensor arguments could be pre-serialized using a process-global, mode-bound "Tensor" serializer (last-writer-wins), causing JSON channels to accidentally emit shared-memory TensorRef payloads that torch-free sealed/conda workers can’t decode.
Changes:
- Reorders
_prepare_for_rpc_implto handletorch.Tensorbefore registry serializer lookup, deferring tensor encoding to the per-channel JSON transport mode. - Reorders
_serialize_for_isolation_implsimilarly to prevent registry-pinned tensor modes from leaking across channels. - Adds a regression test covering both pre-pass paths with a registry whose Tensor serializer is pinned to
shared_memory.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/test_tensor_transport_mode_isolation.py | Adds regression coverage ensuring tensors are deferred to per-channel transport rather than registry mode. |
| pyisolate/_internal/rpc_serialization.py | Moves torch tensor handling ahead of registry serializer resolution in the RPC pre-pass. |
| pyisolate/_internal/model_serialization.py | Moves torch tensor handling ahead of registry serializer resolution in isolation serialization pre-pass. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from __future__ import annotations | ||
|
|
||
| import pytest | ||
|
|
||
|
|
||
| def _shared_memory_registry(): | ||
| """A registry whose Tensor serializer is pinned to shared_memory, simulating a | ||
| co-resident share_torch extension that registered last on the process-global | ||
| SerializerRegistry singleton.""" | ||
| from pyisolate._internal.serialization_registry import SerializerRegistry | ||
| from pyisolate._internal.tensor_serializer import register_tensor_serializer | ||
|
|
||
| registry = SerializerRegistry() | ||
| register_tensor_serializer(registry, mode="shared_memory") | ||
| return registry |
Codex Review -- Round 1 -- KICKOFFRunner: Prompt sent to codex reviewAwaiting codex output. The result will be posted as a separate comment on this PR when codex exits. |
Codex Review -- Round 1 -- RESULTPR: #6 Prompt sent to codex reviewVerbatim codex review outputThis PR was reviewed using The implementation change may be functionally sound, but the newly added test file introduces a mypy error in the repository's required CI type-check job, so the patch is not merge-ready. Review comment:
Codex runner stderr (non-evidence) |
Both _prepare_for_rpc_impl (the RPC argument pre-pass) and
_serialize_for_isolation_impl looked up the process-global SerializerRegistry
"Tensor" serializer before the torch.Tensor branch. That serializer is bound to a
single mode at registration, and register_tensor_serializer runs once per
extension on the global singleton (last-writer-wins).
When a host runs a shared_memory (share_torch=True) extension alongside a json
(sealed/conda) extension, the global Tensor mode can be pinned to shared_memory,
so outbound tensors were encoded as a shared-memory TensorRef even on json
channels. A torch-free sealed worker then raised KeyError('data') decoding the
TensorRef, killing its RPC recv thread ("Socket closed or incomplete length
header").
Handle torch.Tensor before the registry lookup in both pre-passes so tensors are
deferred to the per-channel transport (JSONSocketTransport._json_default), which
already serializes per channel via serialize_tensor(mode=self._tensor_transport).
The global registry mode no longer leaks across channels.
Adds tests/test_tensor_transport_mode_isolation.py covering both pre-passes.
c828a89 to
f389b54
Compare
Codex Review -- Round 2 -- KICKOFFRunner: Prompt sent to codex reviewAwaiting codex output. The result will be posted as a separate comment on this PR when codex exits. |
Codex Review -- Round 2 -- RESULTPR: #6 Prompt sent to codex reviewVerbatim codex review outputThis PR was reviewed using No merge-blocking or advisory findings were identified in the latest PR diff. The earlier mypy issue is already fixed in the current head, and observed CI pytest failures are outside the changed code. Codex runner stderr (non-evidence)Post via
Exit: State
|
0e15c5f to
2fda457
Compare
Summary
Outbound tensors could be encoded with the wrong transport mode when a host runs
multiple isolated extensions with different tensor-transport modes, crashing
torch-free sealed/conda workers mid-execution.
Root cause
_prepare_for_rpc_impl(RPC argument pre-pass) and_serialize_for_isolation_implboth looked up the process-global
SerializerRegistry"Tensor" serializer beforethe
torch.Tensorbranch.register_tensor_serializerbinds a single mode and runsonce per extension on the global singleton (last-writer-wins). When a
shared_memory(
share_torch=True) extension is co-resident with ajson(sealed/conda) extension,the global Tensor mode can be pinned to
shared_memory, so outbound tensors wereemitted as a shared-memory
TensorRefeven on json channels. A torch-free sealedworker then raised
KeyError('data')decoding theTensorRef, killing its RPC recvthread (
Socket closed or incomplete length header).Fix
Handle
torch.Tensorbefore the registry lookup in both pre-passes so tensors deferto the per-channel transport (
JSONSocketTransport._json_default), which alreadyserializes per channel via
serialize_tensor(mode=self._tensor_transport). Theprocess-global registry mode no longer leaks across channels. No public API change;
no behavior change for single-mode hosts.
Verification
tests/test_tensor_transport_mode_isolation.pycovers both pre-passes — fails ondevelop, passes with the fix.rufflint + format clean).share_torch=Trueextension fails before / completes after; downstream isolation suite 111/111 green.Scope
Two-function reorder + one regression test. Internal change in
_internal/.