Skip to content

[ops] Add mx.empty() — uninitialized array allocation #3549

@megacpp

Description

@megacpp

Summary

MLX is missing a primitive for allocating an uninitialized array, equivalent to numpy.empty / torch.empty / jnp.empty. This is useful when a buffer will be fully overwritten by a subsequent kernel — the implicit zero-fill of mx.zeros is wasted work in that case.

Adding mx.empty(shape, dtype=..., stream=...) would close that gap.

Motivation

The concrete use case we hit: when a TileLang Metal kernel produces an output tensor, the host-side allocation only needs the right shape/dtype/storage — the kernel will fully overwrite the contents. With only mx.zeros available today, we pay for a memset to zero before the kernel runs and then immediately overwrites every byte. For larger output tensors (e.g. attention outputs in a transformer block) the wasted zero-fill measurably hurts throughput.

The same pattern shows up any time an MLX array is used as a write-only output buffer of an external kernel (a custom Metal op, a DLPack-imported tensor about to be filled in place, etc.).

PyTorch / NumPy / JAX all expose this primitive (torch.empty, numpy.empty, jnp.empty) for the same reason.

Proposed API

mx.empty(shape, dtype=mx.float32, stream=None)

Semantics:

  • Allocates an array of the given shape and dtype on the active device.
  • Does not initialize the contents — the caller is expected to write into it before reading.
  • Reuses MLX's existing allocator and dtype rules, including the existing GPU float64 restriction.
  • Rejects negative dimensions with the standard MLX shape-validation error.
  • Optional stream= argument to match the rest of the MLX ops surface.

This is intentionally a thin wrapper around the existing allocation path — no new buffer-management complexity, just skipping the fill.

Prototype

We have a working implementation in our downstream fork:

  • DatasunriseOU@4acd37aAdd uninitialized array allocation
  • Diff: 60 LOC across 4 files: mlx/ops.cpp, mlx/ops.h, python/src/ops.cpp, python/tests/test_ops.py.

The prototype exposes the API exactly as proposed above. Tests cover default dtype, explicit dtype, negative-shape rejection, and the GPU float64 rejection path.

What we're offering

If maintainers are interested, we can rebase the prototype on current ml-explore/mlx@main and open a PR. The patch is small and independent of the DLPack work in #3531 — no shared surface, no ordering requirement.

If the team would prefer a slightly different shape (e.g. dtype as the first positional argument, or a different stream= default), happy to adjust before opening the PR.

Notes

  • One open design question: in debug builds, should mx.empty fill with NaN / sentinel values to surface uninitialized-read bugs in user code? PyTorch doesn't do this; NumPy doesn't do this. Our prototype follows the same convention (raw allocation, no debug-fill). Flagging it here in case MLX has a different preference.
  • This issue is intentionally narrow per the maintainer guidance on RFC: DLPack consumer support for MLX arrays #3548 — DLPack consumer work is being handled in Add Metal DLPack zero-copy sharing #3531, and this is a small orthogonal piece that came out of the same PoC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions