Add Metal DLPack zero-copy sharing by XXXXRT666 · Pull Request #3531 · ml-explore/mlx

XXXXRT666 · 2026-05-11T18:37:08Z

Proposed changes

This draft adds zero-copy Metal DLPack sharing for MLX arrays and PyTorch MPS tensors.

This PR builds on the merged DLPack import PR #3495 and requires nanobind support.

The main changes are:

Import Metal DLPack arrays by wrapping the underlying Metal buffer instead of copying through CPU.
Export MLX arrays to Metal DLPack using the MLX Metal buffer and DLPack byte_offset.
Add mx.from_dlpack(..., copy=...) controls for Metal DLPack inputs.
Keep mx.array(...) zero-copy for Metal DLPack inputs unless an explicit different dtype is requested.
Document the explicit synchronization requirements between PyTorch MPS and MLX.

The shared lifetime is tied to the exported or imported buffer. Synchronization remains explicit: PyTorch writes require torch.mps.synchronize() before MLX reads, and MLX writes require mx.eval(...) before PyTorch reads.

For MLX arrays exported to PyTorch, later MLX updates may rebind the MLX array to a new buffer while the PyTorch tensor continues to reference the exported buffer.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

megacpp · 2026-05-13T22:58:31Z

Hi @XXXXRT666 — read through this PR after @awni redirected us here from #3548. The nb::ndarray<nb::ro, nb::c_contig> approach over the in-flight nanobind PR (#1338) is materially cleaner than the manual capsule parsing we had in our downstream PoC, and lifting is_host_accessible() into mlx/allocator.h is the right level of abstraction. Closed the RFC; happy with this being the path forward for #2848.

Wanted to offer some testing help that complements the PyTorch MPS bring-up you have:

We maintain a downstream TileLang fork (https://github.com/DatasunriseOU/tilelang) whose TVM-FFI bridge exports kDLMetal DLPack capsules for tensors backed by id<MTLBuffer>. That gives a non-PyTorch Metal DLPack producer that exercises the same import path you're adding here. Specifically it covers:

kDLMetal producers that do not require the PyTorch-MPS workaround for __dlpack__ — exercises the import path directly.
Round-trip mx.array → DLPack → TVM-FFI Metal kernel → DLPack → mx.array zero-copy.
Custom Metal kernels (via mlx.fast.metal_kernel) consuming an imported mx.array whose underlying MTLBuffer was allocated outside MLX.
storageMode matrix (we hit Shared and Managed; Private is the obvious edge case the spec needs to nail down — your is_host_accessible() decision likely answers this implicitly but worth a sanity check from the producer side).
byte_offset != 0 cases that we'd previously rejected outright in our PoC — your PR seems to handle these via byte_offset-aware import; happy to write a TileLang-side test for it.

If useful, once the PR converges I can:

Pull this branch into our TileLang test matrix and report back on any rough edges (CI on macOS Metal hardware).
Send a minimal standalone repro (no TileLang dependency) for any of the above scenarios if you'd like them as additions to python/tests/test_array.py.
Beta-test the mx.from_dlpack(..., copy=...) semantics against the dtype-mismatch-shares case (002360faa) once the API stabilizes.

Tag me here when you'd like input — no rush, just don't want this to slip past once it's review-ready.

(For the orthogonal mx.empty() piece that was also in our PoC, opened it as a separate issue per @awni's guidance.)

Required by ml-explore/mlx PR ml-explore#3531 (Metal DLPack zero-copy sharing). SHA: 33f52e635db5e6229060481d16a167230a1a474b PR: wjakob/nanobind#1338 Branch: metal-dlpack-cast

McPatate · 2026-05-14T08:33:11Z

This would be super cool if it landed for end to end "0-copy" support in safetensors! I'm working (safetensors/safetensors#767) on adding reading bytes from disk in raw MTLBuffers, which can then be handed to the framework via dlpack with 0-copy. Works well with torch, would be happy to see that land in mlx!

Also, support for byte_offset !=0 would be nice (already in the PR but commenting to notify it's useful) since we can go one step further: currently the mps path is pread -> MTLBuffer, but that goes through kernel pages before hitting userspace buffer. Having byte_offset non zero support would enable mmap-ing the file and creating MTLBuffers that reference specific slices of the mmap, which would demand-fault pages from disk into the page cache on first access and give userspace access directly, leaving only the disk -> kernel-side copy.

Quick question on the dl_tensor.data convention, torch's mps treats it as id<MTLBuffer>, passing the contents segfaults. Curious to know which direction MLX will be taking, as it impacts us downstream!

Required by ml-explore/mlx PR ml-explore#3531 (Metal DLPack zero-copy sharing). SHA: 33f52e635db5e6229060481d16a167230a1a474b PR: wjakob/nanobind#1338 Branch: metal-dlpack-cast

XXXXRT666 · 2026-05-14T12:46:46Z

Quick question on the dl_tensor.data convention, torch's mps treats it as id<MTLBuffer>, passing the contents segfaults. Curious to know which direction MLX will be taking, as it impacts us downstream!

https://dmlc.github.io/dlpack/latest/c_api.html#c.DLTensor.data

The data pointer points to the allocated data. This will be CUDA device pointer, cl_mem handle in OpenCL, or id<MTLBuffer> for Metal.

XXXXRT666 · 2026-05-22T08:10:07Z

One API question: should mx.array(...) always copy DLPack inputs, and should zero-copy / copy control live in mx.asarray(..., copy=...) instead?

That would match the mental model used by NumPy/PyTorch more closely: array creates a new array, while asarray may avoid a copy depending on copy. In that design, mx.from_dlpack(..., copy=...) could remain the explicit DLPack entry point, while mx.array(torch_mps_tensor) would not unexpectedly share the underlying Metal buffer by default.

zcbenz · 2026-05-23T00:25:29Z

+  auto copied_shape = a.shape(); // |a| will be moved
+  auto dtype = a.dtype();
+  return array(
+      std::move(copied_shape),


Also there is no need to use std::move since shapes are no longer heap allocated, we don't migrate the old code but try to use the new pattern in new code.

Removed now

I think this should be handled explicitly rather than relying on argument evaluation order. Since a.shape() and {std::move(a)} are evaluated as separate function arguments, C++ may evaluate {std::move(a)} first. In that case, a.shape() would be called after a has been moved from.

Linux and Windows CPU CI failed because of this

zcbenz · 2026-05-23T00:34:51Z

One API question: should mx.array(...) always copy DLPack inputs, and should zero-copy / copy control live in mx.asarray(..., copy=...) instead?

That would match the mental model used by NumPy/PyTorch more closely: array creates a new array, while asarray may avoid a copy depending on copy. In that design, mx.from_dlpack(..., copy=...) could remain the explicit DLPack entry point, while mx.array(torch_mps_tensor) would not unexpectedly share the underlying Metal buffer by default.

I think this is very good design.

XXXXRT666 · 2026-05-23T08:18:50Z

mlx 0.31.2

mlx_to_torch  (median µs / call, lower is better)
  shape               current → mps   dlpack → mps   dlpack stay-on-cpu
  16K  f32                      378            191                   16
  1M   f32                      404            402                   17
  16M  f32 (64MB)              2717           2501                   16
  1M  bf16                      276 n/a (TypeError)      n/a (TypeError)
  16M bf16 (32MB)              1302 n/a (TypeError)      n/a (TypeError)

torch_to_mlx  (median µs / call, lower is better)
  shape                current (.cpu+numpy)      dlpack (from MPS)
  16K  f32                              175                    172
  1M   f32                              691                    666
  16M  f32 (64MB)                      8216                   8470
  1M  bf16                              522                    422
  16M bf16 (32MB)                      2023                   2037

mlx 0.32.0.dev20260523+4e8decde9

mlx_to_torch  (median µs / call, lower is better)
  shape               current → mps   dlpack → mps   dlpack stay-on-cpu
  16K  f32                      237             18                   15
  1M   f32                      371             16                   17
  16M  f32 (64MB)              2688             15                   15
  1M  bf16                      279             19                   16
  16M bf16 (32MB)              1340             16                   14

torch_to_mlx  (median µs / call, lower is better)
  shape                current (.cpu+numpy)      dlpack (from MPS)
  16K  f32                              189                     16
  1M   f32                              676                     16
  16M  f32 (64MB)                      8806                     17
  1M  bf16                              484                     16
  16M bf16 (32MB)                      2106                     16

benchmark function

# --- mlx -> torch candidates ---------------------------------------------------


def mlx_to_torch_current(arr: mx.array, device: torch.device) -> torch.Tensor:
    arr = mx.contiguous(arr)
    mx.eval(arr)
    buf = memoryview(arr)
    dtype_map = {
        mx.float32: torch.float32,
        mx.float16: torch.float16,
        mx.bfloat16: torch.bfloat16,
    }
    t = torch.frombuffer(buf, dtype=dtype_map[arr.dtype]).reshape(arr.shape)
    if device.type == "mps":
        t = t.to(device)
    return t


def mlx_to_torch_dlpack_mps(arr: mx.array, device: torch.device) -> torch.Tensor:
    mx.eval(arr)
    t = torch.from_dlpack(arr)
    if device.type == "mps":
        t = t.to(device)
    return t


def mlx_to_torch_dlpack_cpu(arr: mx.array, device: torch.device) -> torch.Tensor:
    """Force a CPU-typed capsule via `dl_device=(kDLCPU, 0)` (Phase 2+).
    Falls back to the no-kwarg form for builds that don't accept it."""
    mx.eval(arr)
    try:
        cap = arr.__dlpack__(dl_device=(1, 0))
    except TypeError:
        # Older builds: zero-arg lambda. Capsule is already kDLCPU there.
        cap = arr.__dlpack__()
    return torch.from_dlpack(cap)


# --- torch -> mlx candidates ---------------------------------------------------


def torch_to_mlx_current(t: torch.Tensor) -> mx.array:
    if t.device.type != "cpu":
        t = t.cpu()
    t = t.detach()
    if t.dtype == torch.bfloat16:
        return mx.array(t)
    return mx.array(t.numpy())


def torch_to_mlx_dlpack(t: torch.Tensor) -> mx.array:
    """Use mx.from_dlpack when the API exists; old MLX falls back to CPU copy."""
    if hasattr(mx, "from_dlpack"):
        return mx.from_dlpack(t)
    return mx.array(t.detach().cpu())

XXXXRT666 · 2026-05-23T08:38:06Z

I think this is very good design.

I updated the PR to follow this design: mx.array(...) now copies DLPack inputs, while mx.asarray(..., copy=...) and mx.from_dlpack(..., copy=...) provide the copy-control paths.

This reverts commit eb74695.

zcbenz

This basically looks good to me, thanks for the nice work!

XXXXRT666 · 2026-05-24T10:37:00Z

I just realized that nanobind does not handle c_contig arrays very well. For torch.Tensor, it calls contiguous(), but it does not perform torch.accelerator.synchronization afterwards.

zcbenz · 2026-05-26T00:55:01Z

+          auto src = static_cast<const SrcT*>(nd_array.data());
+          auto dst = out.data<DstT>();
+          for (size_t i = 0; i < out.size(); ++i) {
+            dst[i] = static_cast<DstT>(src[strided_offset(i, shape, strides)]);


This is going to be really slow, would it be practical to preserve the original strides and and just do memcpy?

out.set_data( mx::allocator::malloc(data_size * itemsize), data_size, strides); std::copy(src, src + data_size, dst);

Yes, I checked that PyTorch preserves the strides provided by DLPack on import, including for tensors that are not non_overlapping_and_dense. Strides only change if PyTorch performs a copy afterwards, for example when torch.as_tensor needs to change device or dtype, or when using torch.tensor, which always copies. In those cases, tensors that are not non_overlapping_and_dense are materialized into a compact layout.

I think we do not need to do that in MLX, and can simply preserve the original strides.

A non_overlapping_and_dense tensor refers to a tensor that is not like x[::2] or torch.tensor([1]).expand(3)

zcbenz · 2026-05-26T00:59:09Z

+  if (copy && !import_flags.row_contiguous) {
+    // Force the copy primitive to materialize the virtual strided input into a
+    // row-contiguous output instead of preserving a dense non-row layout.
+    import_flags.contiguous = false;


Can you collaborate on this? If the input is truly contiguous we shouldn't need this for copying, otherwise the flag was set for non-contiguous input.

This makes the astype copy force the array to be row_contiguous. However, I think this is no longer necessary if we preserve the strides and do not require imported arrays to always be row_contiguous

XXXXRT666 · 2026-05-26T09:37:30Z

CPU imports now copy the underlying storage span and preserve the DLPack strides.

For Metal:

copy=False preserves the DLPack strides by wrapping the original buffer.
copy=True preserves strides for non_overlapping_and_dense layouts, since those stay on the vector copy path.
Layouts that are not non_overlapping_and_dense may still go through the general copy path and be materialized to a compact layout.

XXXXRT666 · 2026-05-26T09:57:59Z

The earlier CI failures came from three independent issues.

First, the new array::set_data(..., offset, deleter) signature accidentally broke existing CUDA call sites that still used the old argument order, causing CUDA compilation failures.

Second, astype had a use-after-move bug: the array was moved into the input list while its shape was also read in the same expression, and the C++ argument evaluation order made this crash on some platforms.

Third, the macOS Metal validation failure was from a test mutating the original PyTorch MPS tensor after exporting/importing it through DLPack; I think it's a problem with PyTorch

METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 python -c 'import torch; x=torch.tensor([1.0,2.0,3.0], device="mps"); x+=1; torch.mps.synchronize(); print(x.cpu())'

will cause the same error without MLX

zcbenz mentioned this pull request May 13, 2026

RFC: DLPack consumer support for MLX arrays #3548

Closed

megacpp mentioned this pull request May 13, 2026

[ops] Add mx.empty() — uninitialized array allocation #3549

Closed

XXXXRT666 force-pushed the metal-dlpack-zero-copy-draft branch from 002360f to 4e16f1d Compare May 14, 2026 04:39

megacpp pushed a commit to DatasunriseOU/mlx that referenced this pull request May 14, 2026

Merge ml-explore#3531 (Metal DLPack zero-copy, rebased)

0b1f90b

XXXXRT666 force-pushed the metal-dlpack-zero-copy-draft branch from 4e16f1d to a17cd99 Compare May 19, 2026 07:44

XXXXRT666 marked this pull request as ready for review May 19, 2026 08:40

zcbenz reviewed May 21, 2026

View reviewed changes

Comment thread docs/src/usage/numpy.rst Outdated

Comment thread mlx/backend/cuda/allocator.cpp

Comment thread mlx/backend/metal/allocator.cpp

Comment thread CMakeLists.txt

Comment thread python/src/convert.cpp

zcbenz mentioned this pull request May 21, 2026

[Python] Fix mx.array DLPack dispatch #3476

Closed

zcbenz reviewed May 22, 2026

View reviewed changes

zcbenz mentioned this pull request May 22, 2026

Expose from_dlpack on __array_namespace__() #3579

Closed

4 tasks

zcbenz reviewed May 23, 2026

View reviewed changes

XXXXRT666 added 10 commits May 23, 2026 16:38

Support Metal DLPack zero-copy import

7af5b6a

Add from_dlpack copy controls

bbffe6a

Support Metal DLPack zero-copy sharing

11cda58

Share DLPack arrays when dtype matches

361143f

Clarify MPS DLPack host access test

206dce4

Pin nanobind with Metal DLPack support

a7c39c7

Use host accessibility check for Metal raw pointer

dc90eba

Add array copy for shared DLPack buffers

2dea5e9

Keep FFT torch baseline on CPU

ef12528

Handle private Metal DLPack buffers with copies

9400ca4

XXXXRT666 added 11 commits May 23, 2026 16:38

Leave nanobind shallow clone note

7b7b4d0

Reduce Metal DLPack test redundancy

310999b

Support DLPack copy export

da3559a

Copy private Metal DLPack buffers on import

3aaf373

Clean up Metal DLPack owner handling

05a073e

Clarify DLPack conversion dtype names

6cab38d

Clean up DLPack conversion paths

f88366d

Revert "Add array copy for shared DLPack buffers"

07784c2

This reverts commit eb74695.

Address DLPack review cleanup

09fa65d

Address DLPack buffer reuse review

f5bab00

Add copy control to asarray

a4378cf

XXXXRT666 force-pushed the metal-dlpack-zero-copy-draft branch from 0607c24 to a4378cf Compare May 23, 2026 08:43

zcbenz reviewed May 24, 2026

View reviewed changes

Address DLPack review feedback

7e87757

XXXXRT666 force-pushed the metal-dlpack-zero-copy-draft branch from 47326da to 4eaea96 Compare May 24, 2026 19:31

zcbenz reviewed May 25, 2026

View reviewed changes

Comment thread python/src/convert.cpp Outdated

Comment thread python/src/convert.cpp Outdated

Comment thread python/src/convert.cpp Outdated

Support strided DLPack arrays

a7c4a44

XXXXRT666 force-pushed the metal-dlpack-zero-copy-draft branch from 4eaea96 to a7c4a44 Compare May 25, 2026 04:49

Make copied DLPack imports row-contiguous

30e1484

zcbenz reviewed May 26, 2026

View reviewed changes

XXXXRT666 added 2 commits May 26, 2026 14:45

Fix DLPack copy build regressions

744e638

Avoid PyTorch MPS scalar updates in DLPack tests

f733754

XXXXRT666 force-pushed the metal-dlpack-zero-copy-draft branch from 210c578 to f733754 Compare May 26, 2026 08:05

Preserve DLPack storage layout on import

0c842fe

Reduce redundant code

c1ea3f3

Conversation

XXXXRT666 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Uh oh!

megacpp commented May 13, 2026

Uh oh!

McPatate commented May 14, 2026

Uh oh!

XXXXRT666 commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XXXXRT666 commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zcbenz May 23, 2026

Choose a reason for hiding this comment

Uh oh!

XXXXRT666 May 23, 2026

Choose a reason for hiding this comment

Uh oh!

XXXXRT666 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zcbenz commented May 23, 2026

Uh oh!

XXXXRT666 commented May 23, 2026

mlx 0.31.2

mlx 0.32.0.dev20260523+4e8decde9

Uh oh!

XXXXRT666 commented May 23, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XXXXRT666 commented May 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zcbenz May 26, 2026

Choose a reason for hiding this comment

Uh oh!

XXXXRT666 May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XXXXRT666 May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

XXXXRT666 commented May 11, 2026 •

edited

Loading

XXXXRT666 May 26, 2026 •

edited

Loading

XXXXRT666 May 26, 2026 •

edited

Loading

XXXXRT666 commented May 26, 2026 •

edited

Loading