Skip to content

[Type] Add unpacked form for qd.vector for indexed register access#718

Merged
hughperkins merged 57 commits into
mainfrom
hp/qd-unpacked-layout
Jun 3, 2026
Merged

[Type] Add unpacked form for qd.vector for indexed register access#718
hughperkins merged 57 commits into
mainfrom
hp/qd-unpacked-layout

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented Jun 2, 2026

Summary

Adds an unpacked=True per-thread storage-layout kwarg to qd.types.vector for use on @qd.dataclass field annotations:

@qd.dataclass
class Tile:
    r: qd.types.vector(32, qd.f32, unpacked=True)

The value type is a vector. The kwarg declares that storage is unpacked: N independent scalar slots (one alloca each) instead of one packed slot. The optimiser can then promote each slot to a register independently and spill only the ones it has to, instead of having to commit to keeping or spilling the whole group as a unit. The motivation is high register pressure -- e.g. several concurrent 32x32 tiles in a Cholesky + triangular solve -- where LLVM's SROA + mem2reg bails out on a packed alloca and the whole vector ends up in local memory.

Today only indexed access (obj.r[i] for python-int / qd.static-resolved i) is exercised on an unpacked field; vector arithmetic / reductions / pass-by-value / runtime indexing / shared-memory copy as a unit on unpacked fields are planned follow-ups -- they only need an AST-side materialisation of the field as a vector value, not new types.

What's added

  • API surface: unpacked: bool = False kwarg on qd.types.vector(...) and VectorType.__init__. Plain qd.types.vector(N, dtype) is unchanged.
  • Field expansion: when a @qd.dataclass field is annotated with a vector type whose _is_unpacked flag is set, StructType.__init__ expands it into N synthetic scalar members _r0.._r{N-1}.
  • AST rewriting: ASTTransformer.build_Attribute recognises obj.r as an unpacked-group access via a _qd_unpacked_groups tag on the struct, and build_Subscript resolves obj.r[i] to a direct reference to the synthetic _r{i} field.
  • Metadata propagation: the _qd_unpacked_groups tag rides on every Struct / StructField instance the type produces -- through nested cast(), filled_with_scalar(), and per-index impl.subscript paths -- so outer.inner.r[i] and tile_field[n].r[i] (including nested) all resolve correctly.
  • Misuse guards: a vector type declared with unpacked=True refuses to be instantiated as a value (T(...), T.field(...), T.ndarray(...)) outside a @qd.dataclass field annotation, with an error pointing at the dataclass-annotation usage. Synthetic-field-name collisions (a user-declared field shadowing a future _r{i}, or vice versa) raise QuadrantsSyntaxError at class-decoration time.
  • User-guide doc: docs/source/user_guide/unpacked.md covers motivation, usage, constraints, the layout-vs-LLVM relationship, and a workflow for confirming spills (PTX dump + ptxas -v, ld.local grep, Nsight Compute, post-optimisation LLVM IR).

Test plan

  • pre-commit run -a clean (black / clang-format / trim-whitespace / ruff / pylint).
  • tests/python/test_unpacked.py: 13/13 pass on CUDA (RTX PRO 6000 Blackwell) -- 8 kernel tests cover the full AST-rewrite + NVPTX codegen path; 5 cover the python-side misuse / collision guards.
  • tests/python/test_api.py: 18/18 pass on CUDA -- confirms public-surface golden is unchanged (the kwarg form adds no new top-level symbols).

Follow-ups (not in this PR)

  • AST-side materialisation of obj.r as a vector value, so vector arithmetic / .norm() / runtime indexing / pass-by-value across @qd.func boundaries / shared-memory copy as a unit all start working on unpacked-layout fields.
  • Kernel-local unpacked allocations (i.e. usable outside of @qd.dataclass).

…ields

Adds a new ``qd.field_array(N, dtype)`` annotation for ``@qd.dataclass`` that
exposes a logical N-element array as ``obj.r[i]`` while storing N individually-
named synthetic scalar fields (``_r0..._r{N-1}``) under the hood. For python-int
indices (including ``qd.static(range(N))``-unrolled loop variables), the AST
transformer rewrites ``obj.r[i]`` directly to ``obj._r{i}``, so the generated
LLVM IR / PTX is byte-identical to a hand-rolled named-field struct.

Motivation: today's idiomatic ``r: qd.types.vector(N, dtype)`` group field
leaves an alloca that LLVM SROA can't decompose once register pressure crosses
a threshold (e.g. two concurrent tiles in a Cholesky+TRSM kernel), causing
runtime regressions via local-memory spills. The named-field cascade pattern
avoids the spill but balloons source size (32-way ``if k == N: self.rN = val``
write cascades duplicated at every callsite). ``field_array`` collapses those
cascades to one AST node per callsite while preserving the named-field IR.

Changes:
- ``lang/struct.py``: ``FieldArray`` type wrapper, ``field_array(count, dtype)``
  constructor, expansion in ``StructType.__init__`` (synthetic field names plus
  ``_field_groups`` metadata), propagation in ``StructType.__call__``,
  ``_FieldArrayRef`` transient proxy.
- ``lang/impl.py``: preserve ``_qd_field_groups`` across the ``Struct`` rewrap
  in ``expr_init``.
- ``lang/ast/ast_transformer.py``: ``build_Attribute`` returns a
  ``_FieldArrayRef`` for group access; ``build_Subscript`` resolves it to a
  direct field reference for python-int indices.
- ``tests/python/test_field_array.py``: 5 tests covering construction, static
  python-int index, qd.static loop-var index, runtime-index rejection (clear
  error), and static-index OOB rejection.

Runtime-int indexing is intentionally rejected with a friendly error pointing
at ``qd.static``; existing cascade helpers continue to handle the runtime case
by spelling out the ``_rN`` fields directly. Adding runtime-int support is a
small follow-up.

Verified on a field_array port of genesis ``_tile32.py``: PTX byte-identical to
the named-field S1 baseline (modulo the per-session-nonce comment) on both
``chol_kernel`` and ``chol_trsm_kernel``; zero local-memory spills (S1: 0/0,
FA: 0/0, F4-A vector-field variant: 42/97); 25% compile-time reduction on the
single-tile harness (5.60s -> 4.19s, 3-run mean). Source dropped from 1068 to
515 lines (-52%). Full writeup in perso_hugh/doc/qd_field_array_2026may23.md.

All 201 tests in test_py_dataclass.py + test_complex_struct.py + test_struct.py
continue to pass; the 5 new tests pass in 1.76s total.
Following review of PR #712, rename the indexed-fields-on-@qd.dataclass primitive from
`field_array` to `register_array` to better reflect its intent. The mechanism declares N
independently-allocated scalar struct members so SROA + mem2reg can register-promote each
slot independently, side-stepping the SROA-bailout that affects packed
`qd.types.vector(N, dtype)` storage under register pressure.

Renames:
  - Public:  qd.field_array(N, dtype)    -> qd.register_array(N, dtype)
             FieldArray                  -> RegisterArray
  - Private: _FieldArrayRef              -> _RegisterArrayRef
             _qd_field_groups (struct)   -> _qd_register_groups
             _field_groups   (type)      -> _register_groups
             _expand_field_array_naming  -> _expand_register_array_naming
             _qd_is_field_array_ref      -> _qd_is_register_array_ref
  - Tests:   tests/python/test_field_array.py -> test_register_array.py

The internal `_qd_field_for(i)` helper on the proxy is kept (its return value is a
struct *field*, which is still the correct domain term).

No behavioural change; the 5 pytest cases (construction, static-int subscript, qd.static
loop subscript, runtime-index rejection, static OOB rejection) all pass under the new name.
…ected list

test_api.py asserts the public surface of `qd` against a sorted golden list. The
register_array PR added two new public names (the constructor `register_array` and the
type wrapper `RegisterArray`); add them to the expected list so test_api[arch=*-quadrants]
passes again on macOS / CI.
…dule

Addresses the large-file lint on PR #712: the four register_array definitions
(RegisterArray, register_array, _expand_register_array_naming, _RegisterArrayRef) are
self-contained (only depend on numpy + QuadrantsSyntaxError) and have zero coupling to
struct.py internals, so they belong in their own module like field.py / matrix.py.

  - python/quadrants/lang/register_array.py: new module with the four definitions.
  - python/quadrants/lang/struct.py: drop the inline definitions, import from the new
    module. `__all__` still re-exports the public names so `from quadrants.lang.struct
    import *` (run via `quadrants.lang.__init__`) keeps surfacing them as `qd.register_array`
    and `qd.RegisterArray`.
  - python/quadrants/lang/ast/ast_transformer.py: import `_RegisterArrayRef` directly
    from the new module instead of via struct.py, avoiding a re-export hop.

No behavioural change; 224 tests + 9 xfail across test_register_array.py, test_api.py,
test_py_dataclass.py, test_complex_struct.py, test_struct.py all pass.
…ention

Run find_underwrapped.py over the rename diff and tighten every prose run that was wrapped
at the AI-default ~85-95c down to the project's 120c target. Pure prose-layout change --
no semantic edits. Only legitimate skips per the skill are the two doctest code blocks
inside the `Example::` docstring in register_array.py (21c and 76c runs, deliberately
formatted code samples).

Files touched: register_array.py, struct.py, impl.py, ast/ast_transformer.py,
tests/python/test_register_array.py. Tests still 23/23.
…tring concat)

  - ast_transformer.py: ruff/isort: register_array import sorts before snode in the
    quadrants.lang.* group.
  - struct.py: ruff/isort: _RegisterArrayRef sorts after _expand_register_array_naming
    inside the multi-line import.
  - register_array.py: black joins two short f-string-continuation literals onto one
    120c-safe line (in the OOB error path and __repr__).
  - test_register_array.py: black inserts a blank line between the docstring and the
    nested @qd.dataclass.

All hooks (black, clang-format, trailing-whitespace, eof, ruff, pylint) now pass.
Per-PR feedback request: a user-facing page covering the problem register_array solves
(SROA bailout on packed qd.types.vector under register pressure), how to use it
(@qd.dataclass annotation + python-int / qd.static indices), and its constraints (static
indices only, no vector arithmetic, count fixed at struct-definition time, naming
collisions to watch out for).

  - docs/source/user_guide/register_array.md: new doc page.
  - docs/source/user_guide/index.md: link the new page from Core concepts, next to
    compound_types.
  - docs/source/user_guide/compound_types.md: cross-link from the qd.dataclass overview.
Pyright on PR #712 fails with:

  python/quadrants/lang/impl.py:111:24 - error: Cannot assign to attribute "_qd_register_groups" for
  class "Struct". Attribute "_qd_register_groups" is unknown (reportAttributeAccessIssue)

`Struct` doesn't statically declare the metadata attr (it's an opt-in per-instance tag set
only when the dataclass has at least one register_array group). Use `setattr()` instead of
direct attribute assignment so pyright doesn't insist on a class-level declaration; the
runtime semantics are identical.

Verified locally: 0 errors, 9 warnings -- all 9 warnings pre-existing and unrelated to this PR.
…dback)

"register_array" reads as imperative ("register the array") and overcommits the type to
an implementation effect (register residency). "unpacked_array" describes the actual
layout the annotation produces -- each slot in its own `alloca` -- and contrasts cleanly
with the implicit packed-vector default (`qd.types.vector(N)`). Whether the slots end up
in registers is the optimiser's call; `unpacked_array` removes the packed-storage obstacle.

Renames:
  - Public:  qd.register_array(N, dtype)    -> qd.unpacked_array(N, dtype)
             RegisterArray                  -> UnpackedArray
  - Private: _RegisterArrayRef              -> _UnpackedArrayRef
             _qd_register_groups            -> _qd_unpacked_groups
             _register_groups (StructType)  -> _unpacked_groups
             _expand_register_array_naming  -> _expand_unpacked_array_naming
             _qd_is_register_array_ref      -> _qd_is_unpacked_array_ref
  - Module:  python/quadrants/lang/register_array.py
             -> python/quadrants/lang/unpacked_array.py
  - Tests:   tests/python/test_register_array.py
             -> tests/python/test_unpacked_array.py
  - Docs:    docs/source/user_guide/register_array.md
             -> docs/source/user_guide/unpacked_array.md (also reworded around the
             "packed vs unpacked layout" framing)
  - test_api.py expected list updated (UnpackedArray + unpacked_array in alphabetical
    position).

No behavioural change; all checks pass locally:
  - 224 passed + 9 xfailed across test_unpacked_array, test_api, test_py_dataclass,
    test_complex_struct, test_struct.
  - pre-commit run -a: black, clang-format, trailing-whitespace, eof, ruff, pylint all
    pass.
  - pyright on the touched files: 0 errors, 0 warnings.
User-facing language fix. 'Trace time' is internal Quadrants jargon for the python-side
AST walk; from the user's seat the relevant boundary is the compile/run split, so use
'compile time' in the doc and test docstring.

  - docs/source/user_guide/unpacked_array.md: 'AST-build time' / 'trace time' in the
    constraints section -> 'compile time'.
  - tests/python/test_unpacked_array.py: OOB-static-index test docstring updated to
    match.

No behavioural change; pure prose.
…call__

Previous comment said "propagate the register-array group metadata onto the Struct
instance ... on the per-trace expression" - both terms are stale / unclear:

  - "register-array" was left behind by the qd.register_array -> qd.unpacked_array
    rename (which matched identifiers, not free-form prose).
  - "per-trace expression" is internal jargon; what's actually happening is that
    StructType.__call__ runs on every kernel trace and builds a fresh `Struct` object
    representing the `Tile()` instantiation in the kernel's IR, and we tag THAT
    expression-object (vs the class-level StructType) because ASTTransformer.build_Attribute
    walks the instance.

Reword the comment to spell that out without the jargon. No behavioural change.
Pure rename of the class + internal refs + module, in preparation for a follow-up
that switches construction to the subscript form ``UnpackedVector[dtype, count]``
and moves the export to ``qd.types``.

- ``quadrants/lang/unpacked_array.py`` -> ``unpacked_vector.py``
- ``UnpackedArray`` -> ``UnpackedVector``
- ``_UnpackedArrayRef`` -> ``_UnpackedVectorRef``
- ``_expand_unpacked_array_naming`` -> ``_expand_unpacked_vector_naming``
- ``_qd_is_unpacked_array_ref`` -> ``_qd_is_unpacked_vector_ref``

The user-facing call-form ``qd.unpacked_array(N, dtype)`` factory still works
(returns an ``UnpackedVector`` now). The struct-side metadata key
``_qd_unpacked_groups`` keeps its name -- it was already generic.

CPU smoke test of ``@qd.dataclass`` with ``r: qd.unpacked_array(4, qd.f32)`` +
``t.r[i] = ...`` inside a kernel passes.
…nt]`` + drop call factory

Replace the ``qd.unpacked_array(N, dtype)`` factory function with the subscript
form ``qd.UnpackedVector[dtype, count]``:

- Adds ``UnpackedVector.__class_getitem__((dtype, count))`` returning the marker.
- Drops the ``unpacked_array(count, dtype)`` factory function and its export.
- Adds misuse guards on ``UnpackedVector`` instances: ``__getitem__`` and
  ``__call__`` both raise ``QuadrantsSyntaxError`` with a "use @qd.dataclass"
  pointer. Catches the common path where a user wraps the class in
  ``@dataclasses.dataclass`` (silent today, then a cryptic ``TypeError``).
- Catches bad subscript shapes early: missing count, dtype/count reversed.

Rationale: the subscript spelling makes the annotation visually read as a type
("this is a type", same shape as ``NDArray[dtype, ndim]``), not a value-construct,
which heads off the "I called it, where's my container?" confusion. The
sibling-asymmetry with ``qd.types.vector(N, dtype)`` is the only real cost; the
docs table now reflects it.

Argument order ``[dtype, count]`` matches ``NDArray[dtype, ndim]``.

- ``test_unpacked_array.py`` -> ``test_unpacked_vector.py`` with new spelling
  + new ``test_unpacked_vector_marker_subscript_rejected`` /
  ``_marker_call_rejected`` / ``_bad_subscript_arity`` tests for the guards.
- ``docs/source/user_guide/unpacked_array.md`` -> ``unpacked_vector.md``,
  rewritten to lead with the subscript form. Adds a "Common pitfalls" section
  covering the ``@qd.dataclass`` requirement and the subscript-not-call rule.
- Cross-links in ``index.md`` and ``compound_types.md`` updated.
- ``test_api.py`` golden list: drop the ``unpacked_array`` function entry.

CPU smoke + python-side misuse-guard tests pass.
The previous step left ``UnpackedVector`` exposed at both ``qd.UnpackedVector``
(via the ``from quadrants.lang import *`` star import in ``qd/__init__.py``) and
nothing in ``qd.types``. Bring the export in line with sibling annotations:

- ``qd.types.vector(N, dtype)``         -- packed sibling, lives in ``qd.types``
- ``qd.types.matrix(M, N, dtype)``      -- packed matrix, lives in ``qd.types``
- ``qd.types.NDArray[dtype, ndim]``     -- tensor handle, lives in ``qd.types``

Now ``qd.types.UnpackedVector[dtype, count]`` joins the same namespace.

- Drop ``UnpackedVector`` from ``struct.py``'s ``__all__`` so it no longer leaks
  into the top-level ``qd.*`` star import.
- Re-export ``UnpackedVector`` from ``quadrants/types/__init__.py``. The import
  sits at the bottom of the file because pulling in ``quadrants.lang.unpacked_vector``
  indirectly triggers ``quadrants.lang.util``, which imports ``Template`` from
  ``quadrants.types`` -- the late position lets ``Template`` finish initialising
  before that re-entry happens. Explained in an inline comment.
- ``test_api.py``: drop the top-level ``UnpackedVector`` entry from ``user_api[qd]``.
- ``test_unpacked_vector.py``: switch all use-sites to ``qd.types.UnpackedVector``.
- User-guide doc: switch all examples + table entry to ``qd.types.UnpackedVector``.

CPU smoke test verifies ``qd.UnpackedVector`` is gone, ``qd.types.UnpackedVector``
is the canonical export, and end-to-end kernel construction + read-back still works.
- ``quadrants/types/__init__.py``: add file-level ``# ruff: noqa: I001`` so the
  load-bearing import order (UnpackedVector at the bottom, to break a circular
  import with ``quadrants.lang.util``) survives ruff's import-sort hook.
- ``quadrants/lang/unpacked_vector.py``: black collapsed two multi-line f-string
  concatenations to single lines (cosmetic).
…or spills"

The "When to reach for it" section mentions ``ptxas`` spill counters and
``ld.local`` / ``st.local`` markers as signals, but the doc previously did not
explain how to obtain them. Quadrants compiles LLVM -> PTX directly via the LLVM
NVPTX target without invoking ``nvcc`` / ``ptxas``, so the standard
``ptxas --verbose`` output is not produced inline -- users have to dump the PTX
and run ``ptxas -v`` on it themselves.

- Add an ``# Advanced`` section near the bottom with ``## How to check for
  spills``, covering three workflows: (1) dump PTX via ``print_kernel_asm=True``
  + offline ``ptxas -v``, (2) grep the PTX for ``ld.local`` / ``st.local``,
  (3) ``ncu --section MemoryWorkloadAnalysis`` for runtime counters.
- Plus a note on ``print_kernel_llvm_ir_optimized=True`` for the post-opt
  LLVM IR, and the offline-cache gotcha that silently suppresses dumps on
  cached kernels.
- Forward-link from the existing spill-signal sentence in the "When to reach
  for it" section.
When a ``@qd.dataclass`` declares both an ``UnpackedVector`` group ``r`` and a
field whose name matches one of the synthetic expansions (e.g. ``_r0``), the
previous code silently overwrote one with the other depending on annotation
order, while ``_unpacked_groups['r']`` still claimed ``r[0]`` mapped to ``_r0``.
That meant ``s.r[0]`` accessed the wrong storage / dtype instead of producing
the duplicate-member error the docs promise.

- ``StructType.__init__``: check ``sub in self.members`` before inserting each
  synthetic field generated from an ``UnpackedVector`` group, and check ``k in
  self.members`` for each subsequent member declaration, raising a clear
  ``QuadrantsSyntaxError`` either way. Covers both annotation orderings.
- Tests: ``test_unpacked_vector_collision_with_earlier_field`` (``_r0`` declared
  before ``r: UnpackedVector[...]``) and
  ``test_unpacked_vector_collision_with_later_field`` (``_r2`` declared after).

Addresses codex review comment r3314094246.
…rewrap (codex P2)

When an ``@qd.dataclass`` with an ``UnpackedVector`` member is itself nested
inside an outer ``@qd.dataclass``, ``o.inner.r[i]`` failed at AST-build time
with ``AttributeError: 'StructN' object has no attribute 'r'``. The inner
struct's ``_qd_unpacked_groups`` tag was being stripped during the
``expr_init`` rewrap path because:

1. ``StructType.cast()`` produced a fresh ``Struct`` without re-attaching the
   tag, and outer-struct construction calls ``inner_dtype.cast(...)`` for each
   nested member.
2. ``expr_init`` used ``rhs.to_dict(...)`` to build the rewrap input, which
   recursively flattens nested ``Struct`` instances to plain ``dict`` -- so the
   per-instance tag never survived to the new inner ``Struct``.

Fixes:

- ``StructType.cast()`` and ``filled_with_scalar()`` now route through a new
  ``_attach_unpacked_groups()`` helper that tags the freshly-built struct with
  ``self._unpacked_groups`` whenever the type declared at least one group.
  ``StructType.__call__`` no longer needs its own tagging block -- ``cast`` is
  now the single tagging point, and the recursion through nested ``cast`` calls
  propagates the tag to every level.
- ``impl.expr_init`` builds the rewrap dict directly from ``rhs.__entries``
  (preserving nested ``Struct`` instances) instead of via ``to_dict``. Each
  entry is then re-run through ``expr_init`` by ``Struct.__init__``, so nested
  tags recurse correctly.

Test: ``test_unpacked_vector_nested_in_outer_dataclass`` constructs
``Outer{inner: Inner{r: UnpackedVector[f32, 4]}, scale}``, writes ``inner.r[i]``
via a ``qd.static`` loop, reads it back, and checks ``inner.r[2] * scale``.

Addresses codex review comment r3314094264.
…bscript (codex P2)

Calling ``Tile.field(shape=...)`` on a dataclass that declares
``r: UnpackedVector[dtype, N]`` produces a ``StructField`` whose entries are
the *expanded* synthetic fields ``_r0.._r{N-1}``. The group dictionary was
never copied onto the ``StructField`` (or the per-index ``_IntermediateStruct``
returned by ``impl.subscript``), so ``f[i].r[k]`` inside a kernel fell through
to a plain attribute lookup on the intermediate struct, producing
``AttributeError: 'Struct...' object has no attribute 'r'``.

Fixes:

- ``StructType.field()`` tags the freshly-built ``StructField`` with
  ``self._unpacked_groups``. Nested struct fields are handled by the recursion
  in ``Struct.field`` (which calls ``dtype.field(...)`` for each ``StructType``
  member).
- ``impl.subscript``'s ``StructField`` branch copies the tag onto the
  ``_IntermediateStruct`` it builds for ``f[i]``. Recursion through nested
  ``subscript`` calls handles ``outer_field[i].inner.r[k]``.

Tests:

- ``test_unpacked_vector_struct_field_subscript``: ``Tile.field(shape=(2,))``;
  per-element write+read via ``f[n].r[i]``.
- ``test_unpacked_vector_struct_field_subscript_nested``:
  ``Outer.field(shape=(1,))`` with ``Inner`` carrying the UnpackedVector;
  ``outer_field[0].inner.r[i]`` round-trip.

Addresses codex review comment r3314094263.
Reframes the user guide around the simpler ``registers vs local memory``
mental model that the reviewer asked for, and consolidates the duplicated
``Common Pitfalls`` + ``Constraints and limitations`` sections.

- Intro/motivation now leads with all-or-nothing residency for the packed
  vector vs. per-slot residency for ``UnpackedVector``. The SROA / ``alloca``
  / ``mem2reg`` / ``ld.local`` jargon moves out of the top section and into a
  new ``How the layout works at the LLVM level`` subsection under
  ``# Advanced``, where readers who want it can still find it.
- The ``so what?`` paragraph after the hand-rolled ``r0..r31`` example now
  spells out the costs: ugly, error-prone, slow to type-check and compile.
- Removes the low-level ``IR / PTX matches exactly`` framing at L51 in favour
  of the user-facing ``best of both worlds: clean syntax and per-slot register
  residency``.
- ``Common Pitfalls`` and ``Constraints and limitations`` merged into one
  ``Constraints and pitfalls`` section, removing the duplicate ``static
  indices`` bullet flagged at L120 and the ``count is fixed at struct-
  definition time`` bullet (which is true of every Quadrants annotation, so
  isn't UnpackedVector-specific). The ``static-indices`` requirement now
  reads as a design rationale rather than a limitation. The naming-collision
  bullet is updated to say collisions are *rejected* (true since the earlier
  ``synthetic field-name collisions`` commit), rather than ``the user should
  avoid them``.
- ``small`` is now defined relative to the per-thread register budget (a few
  dozen slots on current NVIDIA GPUs; 255 32-bit registers per thread max,
  shared with all live state). The ``measured or strongly suspect`` bullet
  forward-links straight to ``How to check for spills``.

Addresses codex / reviewer comments on motivation (L17, L30, L51),
duplication (L120), ``small`` (L105), and ``measured / suspect`` (L107).
Replaces non-ASCII typography in ``docs/source/user_guide/unpacked_vector.md``
with plain-ASCII equivalents so the file is portable across diff viewers,
grep tooling, and downstream renderers:

- ``U+2026`` (ellipsis) ``...`` -> ``...``  (already removed in the earlier
  rewrite, but adding to the list for completeness)
- ``U+2014`` (em-dash) ``--`` -> ``--``
- ``U+2192`` (rightwards arrow) ``->`` -> ``->``

Other docs in ``docs/source/user_guide/`` still contain em-dashes etc., but
those weren't part of the PR review and are left alone here.
The ``Static out-of-bounds is rejected at compile time`` bullet didn't add
anything for the reader: it described an obvious consequence of "static
indices only" + "count is N", with a quoted error message that's not useful
in a high-level overview. Drop it.
The ``Relationship to other annotations`` table is in the main user-facing
section, but the original ``storage layout`` column described each row as a
form of ``alloca`` -- a term the rewrite explicitly moved into the
``# Advanced / How the layout works at the LLVM level`` subsection. Restate
the same distinction using the ``slot`` vocabulary from the rewritten
motivation:

- ``qd.f32`` (per-field):       ``one independent slot per field``
- ``qd.types.vector(N, dtype)``: ``one slot holding N packed scalars``
- ``qd.types.UnpackedVector``:   ``N independent slots``

The Advanced section keeps the ``alloca`` / ``mem2reg`` / ``SROA`` framing
for readers who want the LLVM-level detail.
… namespace

``quadrants/lang/__init__.py`` builds ``__all__`` dynamically from ``dir()``
minus a hardcoded blocklist of submodule names (``struct``, ``matrix``,
``misc``, ...). The ``unpacked_vector`` submodule was missing from that
blocklist, so once ``quadrants.lang.struct`` did its
``from quadrants.lang.unpacked_vector import ...``, the submodule got
registered on the ``quadrants.lang`` package, picked up by ``__all__``, and
re-exported through ``from quadrants.lang import *`` in the top-level
``quadrants/__init__.py`` -- adding ``qd.unpacked_vector`` to the public
surface that ``test_api.py`` golden-asserts.

The intended public surface is ``qd.types.UnpackedVector`` only. Add
``unpacked_vector`` to the lang blocklist so the submodule stays private,
matching the treatment of ``struct``, ``matrix``, etc.

Fixes ``test_api[arch=arm64-quadrants]`` (and all other arches; the failure
was reproducible on any platform that ran the test).
Extracts three pieces of ``UnpackedVector``-specific logic from
``StructType`` into free functions in ``quadrants/lang/unpacked_vector.py``
(the existing feature module). No behaviour change, all existing tests
continue to pass.

Moved:

- ``expand_unpacked_vector_into(group, marker, members, groups, elements,
  cook_dtype)``: the per-member loop body that expands an
  ``UnpackedVector[dtype, N]`` annotation into N synthetic scalar fields,
  including the synthetic-vs-existing collision check. ``cook_dtype`` is
  injected to avoid the import cycle ``struct -> unpacked_vector ->
  lang.util``.
- ``_check_no_unpacked_collision(name, members)``: the symmetric guard for
  user-declared members whose name matches an already-expanded synthetic
  field. Three identical inline copies in the three non-UnpackedVector
  branches of ``StructType.__init__`` collapse to one helper call each.
- ``attach_unpacked_groups(target, groups)``: the no-op-if-empty tagging
  used by ``cast``, ``filled_with_scalar``, and ``field`` to ride the group
  dict on freshly-built ``Struct`` / ``StructField`` objects so the AST
  transformer can see it. Replaces the prior ``_attach_unpacked_groups``
  method on ``StructType``.

``struct.py`` shrinks by ~30 lines net. ``UnpackedVector`` feature surface
now lives entirely in ``unpacked_vector.py``, with ``StructType`` only
calling into the helpers from its existing per-member loop.

Drive-by: drop two no-longer-needed re-exports from ``struct.py``
(``_expand_unpacked_vector_naming``, ``_UnpackedVectorRef``); both
consumers already import them directly from ``unpacked_vector``.

Addresses the CI "feature code accreted in a hot file" advisory check.
@hughperkins hughperkins changed the title [Type] Add unpacked qd.vector [Type] Add unpacked form for qd.vector for indexed register access Jun 3, 2026
# Conflicts:
#	docs/source/user_guide/compound_types.md
…packed decision guide

Fold the standalone ``user_guide/unpacked.md`` page into ``user_guide/matrix_vector.md`` as the natural home for
documentation on the ``qd.types.vector`` factory. Add a "When to choose ``unpacked=True`` vs ``unpacked=False``"
decision table that summarises which option applies given a use case, with the default biased toward
``unpacked=False`` (plain vector). The motivation, constraints, mixing example, LLVM-level explanation, and
spill-detection workflow all carry over to the new section.

Also:
- Drop ``unpacked`` from the user-guide TOC (``index.md``).
- Update the cross-link in ``compound_types.md`` to point at the new section anchor.
- Update ``:doc:`` pointers in ``_unpacked.py`` and ``compound_types.py`` to reference ``matrix_vector`` instead of
  the now-removed ``unpacked`` page.

No behavioural changes -- documentation only.
…lot" / "packed group"

Per feedback, ``alloca`` is an advanced LLVM concept that user-guide readers won't necessarily know about. Rephrase
the four occurrences in ``matrix_vector.md`` to use more accessible vocabulary ("a single group of N scalars", "N
independent scalar slots", "stack-allocated group", "unpromoted stack slots in the IR"). No content removed; only
terminology adjusted.
…tail, not user-facing

The user only needs to know that ``obj.r[i]`` is *lowered* to a direct reference to the i-th synthetic member -- the
particular pass that does the rewriting (the AST transformer) is an internal compiler detail. Switch to passive
phrasing to remove the implementation reference while keeping the user-facing guarantee about lowering.
…iser"

Two SROA mentions in the LLVM-level Advanced section rephrased to "LLVM's optimiser" / "the optimiser". The
particular pass name is an implementation detail; what the reader needs to know is "the compiler tries to break the
packed group apart but bails out under pressure", which the new phrasing communicates directly.
… about "only some lanes are hot"

The previous wording said unpacked=True helps "when only some of its lanes are actually hot". That isn't the real
problem in the motivating tile case, where *all* lanes are hot. The actual issue is the all-or-nothing constraint on a
packed group: the compiler has to commit to keeping the whole vector in registers or spill it as a unit. Splitting the
field into per-slot storage lets the compiler keep as many slots as actually fit and spill only the remainder, even
when every slot is read often.
…g for register residency

Previous wording said the packed vector "is committed-to or spilled as a single unit" without explaining why that's
the case. Make the mechanism explicit: all N elements share one piece of stack memory, and the optimiser's
promote-to-registers-or-leave-in-memory decision is made once for the whole allocation. The unpacked layout splits the
field into N independent pieces of memory up front, so the optimiser gets N independent decisions and can keep as
many elements in registers as actually fit.
…raming sentence; the mechanism explanation that follows now stands on its own
The previous wording cited the 255 architectural cap as if it were the practical budget. In practice 255 is rarely
usable because occupancy drops sharply as registers/thread approaches that number; production kernels usually run with
around 32-64 registers/thread. Note the cap, explain why it isn't reachable in practice, and give the realistic
budget so a reader sizing an unpacked group has a number they can actually use.
- ``SSA values`` -> "per-slot register-resident values" (no need for the SSA term).
- ``mem2reg`` -> "the optimiser" (the specific LLVM pass isn't user-relevant).
- Drop the trailing "each access then becomes a ld.local / st.local" sentence in the LLVM-level paragraph; the
  preceding "spills to local memory as a unit" already conveys the consequence without the PTX mnemonics.
- Heading and prose for the PTX-inspection workflow: rename "Inspect the PTX directly for ld.local / st.local" to
  "Inspect the PTX directly for local-memory loads / stores", and add a one-sentence definition of the PTX mnemonics
  before the literal grep example uses them. Likewise for the shared-memory mnemonics in the trailing caveat.
…tionable; define alloca at first use

The previous "Look for unpromoted stack slots in the IR" instruction wasn't actionable -- the reader wouldn't know
what to grep for. Reintroduce ``alloca`` here, define what it is (every per-function stack allocation appears as one)
and why a surviving ``alloca`` matters (the optimiser couldn't promote it to a register, so it will become PTX local
memory), and add the literal grep command alongside the workflow above for ld.local / st.local.
…n origin/main)

origin/main renamed ``user_guide/tile16.md`` to ``user_guide/tile.md`` (as part of the tile16/tile32 -> tile
consolidation that also touched the test files). The remaining link in ``compound_types.md`` still pointed at the old
filename and was breaking the markdown-link-check CI job on this PR. Update to ``tile.md``.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Comment thread docs/source/user_guide/matrix_vector.md Outdated
Comment on lines +119 to +125
| If you... | Use |
|----------------------------------------------------------------------------------------------------------|----------------------------------|
| want vector arithmetic (`a + b`, `a.dot(b)`, `a.norm()`, swizzle, broadcast) at the use site | `unpacked=False` (the default) |
| need to index with a *runtime* integer (`v[k]` where `k` isn't a python-int) | `unpacked=False` |
| need to pass the vector across a `@qd.func` boundary, copy it into shared memory, or instantiate it as a kernel-local value | `unpacked=False` |
| are storing the field in a smallish struct, register pressure is low, and the kernel works as-is | `unpacked=False` |
| only ever access the field as `obj.r[i]` with python-int / `qd.static`-resolved `i` (unrolled inner loops), the group is small relative to the per-thread register budget, *and* the packed version is spilling under register pressure | `unpacked=True` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is the correct framing. basically, unpacked=True is strictly more limited on all axes compared to unpacked=False, so basically it is only relevant in one specific case, to avoid spilling under register pressure. If this happens, the way to go may be to adapt the code so that using unpacked=True because possible, at the cost of increasing the code complexity for extra perf boost. Which adaptation the code may require is exactly what you mention in this list.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow

  • why do you feel using unpacked=True incresaes code complexity?
  • why do you feel this framing is not correct? How aer you thinking of framing it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What increases the code complexity is doing all the necessary modification to accommodate the requirements for being able to use unpacked=True.

To frame it differently: always use unpacked=False, but if for some reason to need to squeeze more performance and you discover that you have a spilling issue to due register pressure, then you may consider switching to unpacked=True and doing all the necessary modifications to comply with the requirements to mentioned in this table, which is likely to increase code complexity somehow.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empirically, in the one case we use them currently, i.e. tiles, we don't need any of the things in the table. No increased complexity. This is why I feel the proposed framing is not clearly accurate.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with "To frame it differently: always use unpacked=False, but if for some reason to need to squeeze more performance and you discover that you have a spilling issue to due register pressure, then you may consider switching to unpacked=True". But I dont agree with "doing all the necessary modifications". They aren't really 'modifications', it's more like, if you need to do any of these things, then you cannot use unpacked I think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:

  • I think "need to pass the vector across a @qd.func boundary," is false, since it's inlined anyway.
  • the arithmetic bit is a bit of a 'straw man', since we could implment arithmetic on unpacked; we just dont have any such requirement currently
  • "are storing the field in a smallish struct, register pressure is low, and the kernel works as-is" isnt really a cosntraint, doesnt imply increasing code complexity

The only thing we are left with really is:

  • "need to index with a runtime integer (v[k] where k isn't a python-int".

This isn't really a 'complexity' issue. If you need this, you cannot use unpacked.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main trade-off might be compile time versus runtime. We have three optoins:

  • use vector => fastest compile time, slowest runtime
  • use hand coded scalars => slowest compile time, fastest runtime
  • use unpacked => medium compile time, also fastest runtime

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, hand coded scalars is a maintenance nightmare...)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with all these information, it draws a pretty faithful picture of the situation. Would you mind structure this and add it to the documentation?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

…for packed vs unpacked

Adds a feature-matrix table to the "Storage layout: packed vs unpacked vectors" section, complementing the existing
"When to choose" decision table. The matrix covers:
- The supported access patterns (indexed reads/writes; static vs runtime indices).
- The vector-value operations that work on packed and not on unpacked (arithmetic, reductions, swizzle, cross-func
  pass-by-value, shared-mem copy).
- The constructor / kernel-arg / .field() / .ndarray() paths that raise on unpacked.
- **Compile-time speed**: faster on packed, slower on unpacked (N synthetic members and more IR per field).
- **Runtime speed**: same as packed under low register pressure; faster than packed under register pressure (only
  slots that don't fit spill, rather than the whole vector).
@duburcqa
Copy link
Copy Markdown
Contributor

duburcqa commented Jun 3, 2026

ok to merge

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

@hughperkins hughperkins merged commit 57ad234 into main Jun 3, 2026
56 checks passed
@hughperkins hughperkins deleted the hp/qd-unpacked-layout branch June 3, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants