[Type] Add unpacked form for qd.vector for indexed register access by hughperkins · Pull Request #718 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-06-02T21:05:39Z

Summary

Adds an unpacked=True per-thread storage-layout kwarg to qd.types.vector for use on @qd.dataclass field annotations:

@qd.dataclass
class Tile:
    r: qd.types.vector(32, qd.f32, unpacked=True)

The value type is a vector. The kwarg declares that storage is unpacked: N independent scalar slots (one alloca each) instead of one packed slot. The optimiser can then promote each slot to a register independently and spill only the ones it has to, instead of having to commit to keeping or spilling the whole group as a unit. The motivation is high register pressure -- e.g. several concurrent 32x32 tiles in a Cholesky + triangular solve -- where LLVM's SROA + mem2reg bails out on a packed alloca and the whole vector ends up in local memory.

Today only indexed access (obj.r[i] for python-int / qd.static-resolved i) is exercised on an unpacked field; vector arithmetic / reductions / pass-by-value / runtime indexing / shared-memory copy as a unit on unpacked fields are planned follow-ups -- they only need an AST-side materialisation of the field as a vector value, not new types.

What's added

API surface: unpacked: bool = False kwarg on qd.types.vector(...) and VectorType.__init__. Plain qd.types.vector(N, dtype) is unchanged.
Field expansion: when a @qd.dataclass field is annotated with a vector type whose _is_unpacked flag is set, StructType.__init__ expands it into N synthetic scalar members _r0.._r{N-1}.
AST rewriting: ASTTransformer.build_Attribute recognises obj.r as an unpacked-group access via a _qd_unpacked_groups tag on the struct, and build_Subscript resolves obj.r[i] to a direct reference to the synthetic _r{i} field.
Metadata propagation: the _qd_unpacked_groups tag rides on every Struct / StructField instance the type produces -- through nested cast(), filled_with_scalar(), and per-index impl.subscript paths -- so outer.inner.r[i] and tile_field[n].r[i] (including nested) all resolve correctly.
Misuse guards: a vector type declared with unpacked=True refuses to be instantiated as a value (T(...), T.field(...), T.ndarray(...)) outside a @qd.dataclass field annotation, with an error pointing at the dataclass-annotation usage. Synthetic-field-name collisions (a user-declared field shadowing a future _r{i}, or vice versa) raise QuadrantsSyntaxError at class-decoration time.
User-guide doc: docs/source/user_guide/unpacked.md covers motivation, usage, constraints, the layout-vs-LLVM relationship, and a workflow for confirming spills (PTX dump + ptxas -v, ld.local grep, Nsight Compute, post-optimisation LLVM IR).

Test plan

pre-commit run -a clean (black / clang-format / trim-whitespace / ruff / pylint).
tests/python/test_unpacked.py: 13/13 pass on CUDA (RTX PRO 6000 Blackwell) -- 8 kernel tests cover the full AST-rewrite + NVPTX codegen path; 5 cover the python-side misuse / collision guards.
tests/python/test_api.py: 18/18 pass on CUDA -- confirms public-surface golden is unchanged (the kwarg form adds no new top-level symbols).

Follow-ups (not in this PR)

AST-side materialisation of obj.r as a vector value, so vector arithmetic / .norm() / runtime indexing / pass-by-value across @qd.func boundaries / shared-memory copy as a unit all start working on unpacked-layout fields.
Kernel-local unpacked allocations (i.e. usable outside of @qd.dataclass).

…ields Adds a new ``qd.field_array(N, dtype)`` annotation for ``@qd.dataclass`` that exposes a logical N-element array as ``obj.r[i]`` while storing N individually- named synthetic scalar fields (``_r0..._r{N-1}``) under the hood. For python-int indices (including ``qd.static(range(N))``-unrolled loop variables), the AST transformer rewrites ``obj.r[i]`` directly to ``obj._r{i}``, so the generated LLVM IR / PTX is byte-identical to a hand-rolled named-field struct. Motivation: today's idiomatic ``r: qd.types.vector(N, dtype)`` group field leaves an alloca that LLVM SROA can't decompose once register pressure crosses a threshold (e.g. two concurrent tiles in a Cholesky+TRSM kernel), causing runtime regressions via local-memory spills. The named-field cascade pattern avoids the spill but balloons source size (32-way ``if k == N: self.rN = val`` write cascades duplicated at every callsite). ``field_array`` collapses those cascades to one AST node per callsite while preserving the named-field IR. Changes: - ``lang/struct.py``: ``FieldArray`` type wrapper, ``field_array(count, dtype)`` constructor, expansion in ``StructType.__init__`` (synthetic field names plus ``_field_groups`` metadata), propagation in ``StructType.__call__``, ``_FieldArrayRef`` transient proxy. - ``lang/impl.py``: preserve ``_qd_field_groups`` across the ``Struct`` rewrap in ``expr_init``. - ``lang/ast/ast_transformer.py``: ``build_Attribute`` returns a ``_FieldArrayRef`` for group access; ``build_Subscript`` resolves it to a direct field reference for python-int indices. - ``tests/python/test_field_array.py``: 5 tests covering construction, static python-int index, qd.static loop-var index, runtime-index rejection (clear error), and static-index OOB rejection. Runtime-int indexing is intentionally rejected with a friendly error pointing at ``qd.static``; existing cascade helpers continue to handle the runtime case by spelling out the ``_rN`` fields directly. Adding runtime-int support is a small follow-up. Verified on a field_array port of genesis ``_tile32.py``: PTX byte-identical to the named-field S1 baseline (modulo the per-session-nonce comment) on both ``chol_kernel`` and ``chol_trsm_kernel``; zero local-memory spills (S1: 0/0, FA: 0/0, F4-A vector-field variant: 42/97); 25% compile-time reduction on the single-tile harness (5.60s -> 4.19s, 3-run mean). Source dropped from 1068 to 515 lines (-52%). Full writeup in perso_hugh/doc/qd_field_array_2026may23.md. All 201 tests in test_py_dataclass.py + test_complex_struct.py + test_struct.py continue to pass; the 5 new tests pass in 1.76s total.

Following review of PR #712, rename the indexed-fields-on-@qd.dataclass primitive from `field_array` to `register_array` to better reflect its intent. The mechanism declares N independently-allocated scalar struct members so SROA + mem2reg can register-promote each slot independently, side-stepping the SROA-bailout that affects packed `qd.types.vector(N, dtype)` storage under register pressure. Renames: - Public: qd.field_array(N, dtype) -> qd.register_array(N, dtype) FieldArray -> RegisterArray - Private: _FieldArrayRef -> _RegisterArrayRef _qd_field_groups (struct) -> _qd_register_groups _field_groups (type) -> _register_groups _expand_field_array_naming -> _expand_register_array_naming _qd_is_field_array_ref -> _qd_is_register_array_ref - Tests: tests/python/test_field_array.py -> test_register_array.py The internal `_qd_field_for(i)` helper on the proxy is kept (its return value is a struct *field*, which is still the correct domain term). No behavioural change; the 5 pytest cases (construction, static-int subscript, qd.static loop subscript, runtime-index rejection, static OOB rejection) all pass under the new name.

…ected list test_api.py asserts the public surface of `qd` against a sorted golden list. The register_array PR added two new public names (the constructor `register_array` and the type wrapper `RegisterArray`); add them to the expected list so test_api[arch=*-quadrants] passes again on macOS / CI.

…dule Addresses the large-file lint on PR #712: the four register_array definitions (RegisterArray, register_array, _expand_register_array_naming, _RegisterArrayRef) are self-contained (only depend on numpy + QuadrantsSyntaxError) and have zero coupling to struct.py internals, so they belong in their own module like field.py / matrix.py. - python/quadrants/lang/register_array.py: new module with the four definitions. - python/quadrants/lang/struct.py: drop the inline definitions, import from the new module. `__all__` still re-exports the public names so `from quadrants.lang.struct import *` (run via `quadrants.lang.__init__`) keeps surfacing them as `qd.register_array` and `qd.RegisterArray`. - python/quadrants/lang/ast/ast_transformer.py: import `_RegisterArrayRef` directly from the new module instead of via struct.py, avoiding a re-export hop. No behavioural change; 224 tests + 9 xfail across test_register_array.py, test_api.py, test_py_dataclass.py, test_complex_struct.py, test_struct.py all pass.

…ention Run find_underwrapped.py over the rename diff and tighten every prose run that was wrapped at the AI-default ~85-95c down to the project's 120c target. Pure prose-layout change -- no semantic edits. Only legitimate skips per the skill are the two doctest code blocks inside the `Example::` docstring in register_array.py (21c and 76c runs, deliberately formatted code samples). Files touched: register_array.py, struct.py, impl.py, ast/ast_transformer.py, tests/python/test_register_array.py. Tests still 23/23.

…tring concat) - ast_transformer.py: ruff/isort: register_array import sorts before snode in the quadrants.lang.* group. - struct.py: ruff/isort: _RegisterArrayRef sorts after _expand_register_array_naming inside the multi-line import. - register_array.py: black joins two short f-string-continuation literals onto one 120c-safe line (in the OOB error path and __repr__). - test_register_array.py: black inserts a blank line between the docstring and the nested @qd.dataclass. All hooks (black, clang-format, trailing-whitespace, eof, ruff, pylint) now pass.

Per-PR feedback request: a user-facing page covering the problem register_array solves (SROA bailout on packed qd.types.vector under register pressure), how to use it (@qd.dataclass annotation + python-int / qd.static indices), and its constraints (static indices only, no vector arithmetic, count fixed at struct-definition time, naming collisions to watch out for). - docs/source/user_guide/register_array.md: new doc page. - docs/source/user_guide/index.md: link the new page from Core concepts, next to compound_types. - docs/source/user_guide/compound_types.md: cross-link from the qd.dataclass overview.

Pyright on PR #712 fails with: python/quadrants/lang/impl.py:111:24 - error: Cannot assign to attribute "_qd_register_groups" for class "Struct". Attribute "_qd_register_groups" is unknown (reportAttributeAccessIssue) `Struct` doesn't statically declare the metadata attr (it's an opt-in per-instance tag set only when the dataclass has at least one register_array group). Use `setattr()` instead of direct attribute assignment so pyright doesn't insist on a class-level declaration; the runtime semantics are identical. Verified locally: 0 errors, 9 warnings -- all 9 warnings pre-existing and unrelated to this PR.

…dback) "register_array" reads as imperative ("register the array") and overcommits the type to an implementation effect (register residency). "unpacked_array" describes the actual layout the annotation produces -- each slot in its own `alloca` -- and contrasts cleanly with the implicit packed-vector default (`qd.types.vector(N)`). Whether the slots end up in registers is the optimiser's call; `unpacked_array` removes the packed-storage obstacle. Renames: - Public: qd.register_array(N, dtype) -> qd.unpacked_array(N, dtype) RegisterArray -> UnpackedArray - Private: _RegisterArrayRef -> _UnpackedArrayRef _qd_register_groups -> _qd_unpacked_groups _register_groups (StructType) -> _unpacked_groups _expand_register_array_naming -> _expand_unpacked_array_naming _qd_is_register_array_ref -> _qd_is_unpacked_array_ref - Module: python/quadrants/lang/register_array.py -> python/quadrants/lang/unpacked_array.py - Tests: tests/python/test_register_array.py -> tests/python/test_unpacked_array.py - Docs: docs/source/user_guide/register_array.md -> docs/source/user_guide/unpacked_array.md (also reworded around the "packed vs unpacked layout" framing) - test_api.py expected list updated (UnpackedArray + unpacked_array in alphabetical position). No behavioural change; all checks pass locally: - 224 passed + 9 xfailed across test_unpacked_array, test_api, test_py_dataclass, test_complex_struct, test_struct. - pre-commit run -a: black, clang-format, trailing-whitespace, eof, ruff, pylint all pass. - pyright on the touched files: 0 errors, 0 warnings.

User-facing language fix. 'Trace time' is internal Quadrants jargon for the python-side AST walk; from the user's seat the relevant boundary is the compile/run split, so use 'compile time' in the doc and test docstring. - docs/source/user_guide/unpacked_array.md: 'AST-build time' / 'trace time' in the constraints section -> 'compile time'. - tests/python/test_unpacked_array.py: OOB-static-index test docstring updated to match. No behavioural change; pure prose.

…call__ Previous comment said "propagate the register-array group metadata onto the Struct instance ... on the per-trace expression" - both terms are stale / unclear: - "register-array" was left behind by the qd.register_array -> qd.unpacked_array rename (which matched identifiers, not free-form prose). - "per-trace expression" is internal jargon; what's actually happening is that StructType.__call__ runs on every kernel trace and builds a fresh `Struct` object representing the `Tile()` instantiation in the kernel's IR, and we tag THAT expression-object (vs the class-level StructType) because ASTTransformer.build_Attribute walks the instance. Reword the comment to spell that out without the jargon. No behavioural change.

Pure rename of the class + internal refs + module, in preparation for a follow-up that switches construction to the subscript form ``UnpackedVector[dtype, count]`` and moves the export to ``qd.types``. - ``quadrants/lang/unpacked_array.py`` -> ``unpacked_vector.py`` - ``UnpackedArray`` -> ``UnpackedVector`` - ``_UnpackedArrayRef`` -> ``_UnpackedVectorRef`` - ``_expand_unpacked_array_naming`` -> ``_expand_unpacked_vector_naming`` - ``_qd_is_unpacked_array_ref`` -> ``_qd_is_unpacked_vector_ref`` The user-facing call-form ``qd.unpacked_array(N, dtype)`` factory still works (returns an ``UnpackedVector`` now). The struct-side metadata key ``_qd_unpacked_groups`` keeps its name -- it was already generic. CPU smoke test of ``@qd.dataclass`` with ``r: qd.unpacked_array(4, qd.f32)`` + ``t.r[i] = ...`` inside a kernel passes.

…nt]`` + drop call factory Replace the ``qd.unpacked_array(N, dtype)`` factory function with the subscript form ``qd.UnpackedVector[dtype, count]``: - Adds ``UnpackedVector.__class_getitem__((dtype, count))`` returning the marker. - Drops the ``unpacked_array(count, dtype)`` factory function and its export. - Adds misuse guards on ``UnpackedVector`` instances: ``__getitem__`` and ``__call__`` both raise ``QuadrantsSyntaxError`` with a "use @qd.dataclass" pointer. Catches the common path where a user wraps the class in ``@dataclasses.dataclass`` (silent today, then a cryptic ``TypeError``). - Catches bad subscript shapes early: missing count, dtype/count reversed. Rationale: the subscript spelling makes the annotation visually read as a type ("this is a type", same shape as ``NDArray[dtype, ndim]``), not a value-construct, which heads off the "I called it, where's my container?" confusion. The sibling-asymmetry with ``qd.types.vector(N, dtype)`` is the only real cost; the docs table now reflects it. Argument order ``[dtype, count]`` matches ``NDArray[dtype, ndim]``. - ``test_unpacked_array.py`` -> ``test_unpacked_vector.py`` with new spelling + new ``test_unpacked_vector_marker_subscript_rejected`` / ``_marker_call_rejected`` / ``_bad_subscript_arity`` tests for the guards. - ``docs/source/user_guide/unpacked_array.md`` -> ``unpacked_vector.md``, rewritten to lead with the subscript form. Adds a "Common pitfalls" section covering the ``@qd.dataclass`` requirement and the subscript-not-call rule. - Cross-links in ``index.md`` and ``compound_types.md`` updated. - ``test_api.py`` golden list: drop the ``unpacked_array`` function entry. CPU smoke + python-side misuse-guard tests pass.

The previous step left ``UnpackedVector`` exposed at both ``qd.UnpackedVector`` (via the ``from quadrants.lang import *`` star import in ``qd/__init__.py``) and nothing in ``qd.types``. Bring the export in line with sibling annotations: - ``qd.types.vector(N, dtype)`` -- packed sibling, lives in ``qd.types`` - ``qd.types.matrix(M, N, dtype)`` -- packed matrix, lives in ``qd.types`` - ``qd.types.NDArray[dtype, ndim]`` -- tensor handle, lives in ``qd.types`` Now ``qd.types.UnpackedVector[dtype, count]`` joins the same namespace. - Drop ``UnpackedVector`` from ``struct.py``'s ``__all__`` so it no longer leaks into the top-level ``qd.*`` star import. - Re-export ``UnpackedVector`` from ``quadrants/types/__init__.py``. The import sits at the bottom of the file because pulling in ``quadrants.lang.unpacked_vector`` indirectly triggers ``quadrants.lang.util``, which imports ``Template`` from ``quadrants.types`` -- the late position lets ``Template`` finish initialising before that re-entry happens. Explained in an inline comment. - ``test_api.py``: drop the top-level ``UnpackedVector`` entry from ``user_api[qd]``. - ``test_unpacked_vector.py``: switch all use-sites to ``qd.types.UnpackedVector``. - User-guide doc: switch all examples + table entry to ``qd.types.UnpackedVector``. CPU smoke test verifies ``qd.UnpackedVector`` is gone, ``qd.types.UnpackedVector`` is the canonical export, and end-to-end kernel construction + read-back still works.

- ``quadrants/types/__init__.py``: add file-level ``# ruff: noqa: I001`` so the load-bearing import order (UnpackedVector at the bottom, to break a circular import with ``quadrants.lang.util``) survives ruff's import-sort hook. - ``quadrants/lang/unpacked_vector.py``: black collapsed two multi-line f-string concatenations to single lines (cosmetic).

…or spills" The "When to reach for it" section mentions ``ptxas`` spill counters and ``ld.local`` / ``st.local`` markers as signals, but the doc previously did not explain how to obtain them. Quadrants compiles LLVM -> PTX directly via the LLVM NVPTX target without invoking ``nvcc`` / ``ptxas``, so the standard ``ptxas --verbose`` output is not produced inline -- users have to dump the PTX and run ``ptxas -v`` on it themselves. - Add an ``# Advanced`` section near the bottom with ``## How to check for spills``, covering three workflows: (1) dump PTX via ``print_kernel_asm=True`` + offline ``ptxas -v``, (2) grep the PTX for ``ld.local`` / ``st.local``, (3) ``ncu --section MemoryWorkloadAnalysis`` for runtime counters. - Plus a note on ``print_kernel_llvm_ir_optimized=True`` for the post-opt LLVM IR, and the offline-cache gotcha that silently suppresses dumps on cached kernels. - Forward-link from the existing spill-signal sentence in the "When to reach for it" section.

When a ``@qd.dataclass`` declares both an ``UnpackedVector`` group ``r`` and a field whose name matches one of the synthetic expansions (e.g. ``_r0``), the previous code silently overwrote one with the other depending on annotation order, while ``_unpacked_groups['r']`` still claimed ``r[0]`` mapped to ``_r0``. That meant ``s.r[0]`` accessed the wrong storage / dtype instead of producing the duplicate-member error the docs promise. - ``StructType.__init__``: check ``sub in self.members`` before inserting each synthetic field generated from an ``UnpackedVector`` group, and check ``k in self.members`` for each subsequent member declaration, raising a clear ``QuadrantsSyntaxError`` either way. Covers both annotation orderings. - Tests: ``test_unpacked_vector_collision_with_earlier_field`` (``_r0`` declared before ``r: UnpackedVector[...]``) and ``test_unpacked_vector_collision_with_later_field`` (``_r2`` declared after). Addresses codex review comment r3314094246.

…rewrap (codex P2) When an ``@qd.dataclass`` with an ``UnpackedVector`` member is itself nested inside an outer ``@qd.dataclass``, ``o.inner.r[i]`` failed at AST-build time with ``AttributeError: 'StructN' object has no attribute 'r'``. The inner struct's ``_qd_unpacked_groups`` tag was being stripped during the ``expr_init`` rewrap path because: 1. ``StructType.cast()`` produced a fresh ``Struct`` without re-attaching the tag, and outer-struct construction calls ``inner_dtype.cast(...)`` for each nested member. 2. ``expr_init`` used ``rhs.to_dict(...)`` to build the rewrap input, which recursively flattens nested ``Struct`` instances to plain ``dict`` -- so the per-instance tag never survived to the new inner ``Struct``. Fixes: - ``StructType.cast()`` and ``filled_with_scalar()`` now route through a new ``_attach_unpacked_groups()`` helper that tags the freshly-built struct with ``self._unpacked_groups`` whenever the type declared at least one group. ``StructType.__call__`` no longer needs its own tagging block -- ``cast`` is now the single tagging point, and the recursion through nested ``cast`` calls propagates the tag to every level. - ``impl.expr_init`` builds the rewrap dict directly from ``rhs.__entries`` (preserving nested ``Struct`` instances) instead of via ``to_dict``. Each entry is then re-run through ``expr_init`` by ``Struct.__init__``, so nested tags recurse correctly. Test: ``test_unpacked_vector_nested_in_outer_dataclass`` constructs ``Outer{inner: Inner{r: UnpackedVector[f32, 4]}, scale}``, writes ``inner.r[i]`` via a ``qd.static`` loop, reads it back, and checks ``inner.r[2] * scale``. Addresses codex review comment r3314094264.

…bscript (codex P2) Calling ``Tile.field(shape=...)`` on a dataclass that declares ``r: UnpackedVector[dtype, N]`` produces a ``StructField`` whose entries are the *expanded* synthetic fields ``_r0.._r{N-1}``. The group dictionary was never copied onto the ``StructField`` (or the per-index ``_IntermediateStruct`` returned by ``impl.subscript``), so ``f[i].r[k]`` inside a kernel fell through to a plain attribute lookup on the intermediate struct, producing ``AttributeError: 'Struct...' object has no attribute 'r'``. Fixes: - ``StructType.field()`` tags the freshly-built ``StructField`` with ``self._unpacked_groups``. Nested struct fields are handled by the recursion in ``Struct.field`` (which calls ``dtype.field(...)`` for each ``StructType`` member). - ``impl.subscript``'s ``StructField`` branch copies the tag onto the ``_IntermediateStruct`` it builds for ``f[i]``. Recursion through nested ``subscript`` calls handles ``outer_field[i].inner.r[k]``. Tests: - ``test_unpacked_vector_struct_field_subscript``: ``Tile.field(shape=(2,))``; per-element write+read via ``f[n].r[i]``. - ``test_unpacked_vector_struct_field_subscript_nested``: ``Outer.field(shape=(1,))`` with ``Inner`` carrying the UnpackedVector; ``outer_field[0].inner.r[i]`` round-trip. Addresses codex review comment r3314094263.

Reframes the user guide around the simpler ``registers vs local memory`` mental model that the reviewer asked for, and consolidates the duplicated ``Common Pitfalls`` + ``Constraints and limitations`` sections. - Intro/motivation now leads with all-or-nothing residency for the packed vector vs. per-slot residency for ``UnpackedVector``. The SROA / ``alloca`` / ``mem2reg`` / ``ld.local`` jargon moves out of the top section and into a new ``How the layout works at the LLVM level`` subsection under ``# Advanced``, where readers who want it can still find it. - The ``so what?`` paragraph after the hand-rolled ``r0..r31`` example now spells out the costs: ugly, error-prone, slow to type-check and compile. - Removes the low-level ``IR / PTX matches exactly`` framing at L51 in favour of the user-facing ``best of both worlds: clean syntax and per-slot register residency``. - ``Common Pitfalls`` and ``Constraints and limitations`` merged into one ``Constraints and pitfalls`` section, removing the duplicate ``static indices`` bullet flagged at L120 and the ``count is fixed at struct- definition time`` bullet (which is true of every Quadrants annotation, so isn't UnpackedVector-specific). The ``static-indices`` requirement now reads as a design rationale rather than a limitation. The naming-collision bullet is updated to say collisions are *rejected* (true since the earlier ``synthetic field-name collisions`` commit), rather than ``the user should avoid them``. - ``small`` is now defined relative to the per-thread register budget (a few dozen slots on current NVIDIA GPUs; 255 32-bit registers per thread max, shared with all live state). The ``measured or strongly suspect`` bullet forward-links straight to ``How to check for spills``. Addresses codex / reviewer comments on motivation (L17, L30, L51), duplication (L120), ``small`` (L105), and ``measured / suspect`` (L107).

Replaces non-ASCII typography in ``docs/source/user_guide/unpacked_vector.md`` with plain-ASCII equivalents so the file is portable across diff viewers, grep tooling, and downstream renderers: - ``U+2026`` (ellipsis) ``...`` -> ``...`` (already removed in the earlier rewrite, but adding to the list for completeness) - ``U+2014`` (em-dash) ``--`` -> ``--`` - ``U+2192`` (rightwards arrow) ``->`` -> ``->`` Other docs in ``docs/source/user_guide/`` still contain em-dashes etc., but those weren't part of the PR review and are left alone here.

The ``Static out-of-bounds is rejected at compile time`` bullet didn't add anything for the reader: it described an obvious consequence of "static indices only" + "count is N", with a quoted error message that's not useful in a high-level overview. Drop it.

The ``Relationship to other annotations`` table is in the main user-facing section, but the original ``storage layout`` column described each row as a form of ``alloca`` -- a term the rewrite explicitly moved into the ``# Advanced / How the layout works at the LLVM level`` subsection. Restate the same distinction using the ``slot`` vocabulary from the rewritten motivation: - ``qd.f32`` (per-field): ``one independent slot per field`` - ``qd.types.vector(N, dtype)``: ``one slot holding N packed scalars`` - ``qd.types.UnpackedVector``: ``N independent slots`` The Advanced section keeps the ``alloca`` / ``mem2reg`` / ``SROA`` framing for readers who want the LLVM-level detail.

… namespace ``quadrants/lang/__init__.py`` builds ``__all__`` dynamically from ``dir()`` minus a hardcoded blocklist of submodule names (``struct``, ``matrix``, ``misc``, ...). The ``unpacked_vector`` submodule was missing from that blocklist, so once ``quadrants.lang.struct`` did its ``from quadrants.lang.unpacked_vector import ...``, the submodule got registered on the ``quadrants.lang`` package, picked up by ``__all__``, and re-exported through ``from quadrants.lang import *`` in the top-level ``quadrants/__init__.py`` -- adding ``qd.unpacked_vector`` to the public surface that ``test_api.py`` golden-asserts. The intended public surface is ``qd.types.UnpackedVector`` only. Add ``unpacked_vector`` to the lang blocklist so the submodule stays private, matching the treatment of ``struct``, ``matrix``, etc. Fixes ``test_api[arch=arm64-quadrants]`` (and all other arches; the failure was reproducible on any platform that ran the test).

…taichi into hp/qd-field-array

Extracts three pieces of ``UnpackedVector``-specific logic from ``StructType`` into free functions in ``quadrants/lang/unpacked_vector.py`` (the existing feature module). No behaviour change, all existing tests continue to pass. Moved: - ``expand_unpacked_vector_into(group, marker, members, groups, elements, cook_dtype)``: the per-member loop body that expands an ``UnpackedVector[dtype, N]`` annotation into N synthetic scalar fields, including the synthetic-vs-existing collision check. ``cook_dtype`` is injected to avoid the import cycle ``struct -> unpacked_vector -> lang.util``. - ``_check_no_unpacked_collision(name, members)``: the symmetric guard for user-declared members whose name matches an already-expanded synthetic field. Three identical inline copies in the three non-UnpackedVector branches of ``StructType.__init__`` collapse to one helper call each. - ``attach_unpacked_groups(target, groups)``: the no-op-if-empty tagging used by ``cast``, ``filled_with_scalar``, and ``field`` to ride the group dict on freshly-built ``Struct`` / ``StructField`` objects so the AST transformer can see it. Replaces the prior ``_attach_unpacked_groups`` method on ``StructType``. ``struct.py`` shrinks by ~30 lines net. ``UnpackedVector`` feature surface now lives entirely in ``unpacked_vector.py``, with ``StructType`` only calling into the helpers from its existing per-member loop. Drive-by: drop two no-longer-needed re-exports from ``struct.py`` (``_expand_unpacked_vector_naming``, ``_UnpackedVectorRef``); both consumers already import them directly from ``unpacked_vector``. Addresses the CI "feature code accreted in a hot file" advisory check.

…taichi into hp/qd-field-array

# Conflicts: # docs/source/user_guide/compound_types.md

…packed decision guide Fold the standalone ``user_guide/unpacked.md`` page into ``user_guide/matrix_vector.md`` as the natural home for documentation on the ``qd.types.vector`` factory. Add a "When to choose ``unpacked=True`` vs ``unpacked=False``" decision table that summarises which option applies given a use case, with the default biased toward ``unpacked=False`` (plain vector). The motivation, constraints, mixing example, LLVM-level explanation, and spill-detection workflow all carry over to the new section. Also: - Drop ``unpacked`` from the user-guide TOC (``index.md``). - Update the cross-link in ``compound_types.md`` to point at the new section anchor. - Update ``:doc:`` pointers in ``_unpacked.py`` and ``compound_types.py`` to reference ``matrix_vector`` instead of the now-removed ``unpacked`` page. No behavioural changes -- documentation only.

…lot" / "packed group" Per feedback, ``alloca`` is an advanced LLVM concept that user-guide readers won't necessarily know about. Rephrase the four occurrences in ``matrix_vector.md`` to use more accessible vocabulary ("a single group of N scalars", "N independent scalar slots", "stack-allocated group", "unpromoted stack slots in the IR"). No content removed; only terminology adjusted.

…tail, not user-facing The user only needs to know that ``obj.r[i]`` is *lowered* to a direct reference to the i-th synthetic member -- the particular pass that does the rewriting (the AST transformer) is an internal compiler detail. Switch to passive phrasing to remove the implementation reference while keeping the user-facing guarantee about lowering.

…iser" Two SROA mentions in the LLVM-level Advanced section rephrased to "LLVM's optimiser" / "the optimiser". The particular pass name is an implementation detail; what the reader needs to know is "the compiler tries to break the packed group apart but bails out under pressure", which the new phrasing communicates directly.

…ceding example

…cient register storage"

… about "only some lanes are hot" The previous wording said unpacked=True helps "when only some of its lanes are actually hot". That isn't the real problem in the motivating tile case, where *all* lanes are hot. The actual issue is the all-or-nothing constraint on a packed group: the compiler has to commit to keeping the whole vector in registers or spill it as a unit. Splitting the field into per-slot storage lets the compiler keep as many slots as actually fit and spill only the remainder, even when every slot is read often.

…g for register residency Previous wording said the packed vector "is committed-to or spilled as a single unit" without explaining why that's the case. Make the mechanism explicit: all N elements share one piece of stack memory, and the optimiser's promote-to-registers-or-leave-in-memory decision is made once for the whole allocation. The unpacked layout splits the field into N independent pieces of memory up front, so the optimiser gets N independent decisions and can keep as many elements in registers as actually fit.

…raming sentence; the mechanism explanation that follows now stands on its own

… plain ..." advisory sentence

The previous wording cited the 255 architectural cap as if it were the practical budget. In practice 255 is rarely usable because occupancy drops sharply as registers/thread approaches that number; production kernels usually run with around 32-64 registers/thread. Note the cap, explain why it isn't reachable in practice, and give the realistic budget so a reader sizing an unpacked group has a number they can actually use.

- ``SSA values`` -> "per-slot register-resident values" (no need for the SSA term). - ``mem2reg`` -> "the optimiser" (the specific LLVM pass isn't user-relevant). - Drop the trailing "each access then becomes a ld.local / st.local" sentence in the LLVM-level paragraph; the preceding "spills to local memory as a unit" already conveys the consequence without the PTX mnemonics. - Heading and prose for the PTX-inspection workflow: rename "Inspect the PTX directly for ld.local / st.local" to "Inspect the PTX directly for local-memory loads / stores", and add a one-sentence definition of the PTX mnemonics before the literal grep example uses them. Likewise for the shared-memory mnemonics in the trailing caveat.

…tionable; define alloca at first use The previous "Look for unpromoted stack slots in the IR" instruction wasn't actionable -- the reader wouldn't know what to grep for. Reintroduce ``alloca`` here, define what it is (every per-function stack allocation appears as one) and why a surviving ``alloca`` matters (the optimiser couldn't promote it to a register, so it will become PTX local memory), and add the literal grep command alongside the workflow above for ld.local / st.local.

…n origin/main) origin/main renamed ``user_guide/tile16.md`` to ``user_guide/tile.md`` (as part of the tile16/tile32 -> tile consolidation that also touched the test files). The remaining link in ``compound_types.md`` still pointed at the old filename and was breaking the markdown-link-check CI job on this PR. Update to ``tile.md``.

github-actions · 2026-06-03T13:38:39Z

Total: 7 file(s) changed, +301 -7 code lines.

duburcqa · 2026-06-03T13:43:14Z

+| If you...                                                                                                | Use                              |
+|----------------------------------------------------------------------------------------------------------|----------------------------------|
+| want vector arithmetic (`a + b`, `a.dot(b)`, `a.norm()`, swizzle, broadcast) at the use site             | `unpacked=False` (the default)   |
+| need to index with a *runtime* integer (`v[k]` where `k` isn't a python-int)                             | `unpacked=False`                 |
+| need to pass the vector across a `@qd.func` boundary, copy it into shared memory, or instantiate it as a kernel-local value | `unpacked=False`                 |
+| are storing the field in a smallish struct, register pressure is low, and the kernel works as-is        | `unpacked=False`                 |
+| only ever access the field as `obj.r[i]` with python-int / `qd.static`-resolved `i` (unrolled inner loops), the group is small relative to the per-thread register budget, *and* the packed version is spilling under register pressure | `unpacked=True`                  |


I'm not sure it is the correct framing. basically, unpacked=True is strictly more limited on all axes compared to unpacked=False, so basically it is only relevant in one specific case, to avoid spilling under register pressure. If this happens, the way to go may be to adapt the code so that using unpacked=True because possible, at the cost of increasing the code complexity for extra perf boost. Which adaptation the code may require is exactly what you mention in this list.

Not sure I follow

why do you feel using unpacked=True incresaes code complexity?

why do you feel this framing is not correct? How aer you thinking of framing it?

What increases the code complexity is doing all the necessary modification to accommodate the requirements for being able to use unpacked=True.

To frame it differently: always use unpacked=False, but if for some reason to need to squeeze more performance and you discover that you have a spilling issue to due register pressure, then you may consider switching to unpacked=True and doing all the necessary modifications to comply with the requirements to mentioned in this table, which is likely to increase code complexity somehow.

empirically, in the one case we use them currently, i.e. tiles, we don't need any of the things in the table. No increased complexity. This is why I feel the proposed framing is not clearly accurate.

I'm fine with "To frame it differently: always use unpacked=False, but if for some reason to need to squeeze more performance and you discover that you have a spilling issue to due register pressure, then you may consider switching to unpacked=True". But I dont agree with "doing all the necessary modifications". They aren't really 'modifications', it's more like, if you need to do any of these things, then you cannot use unpacked I think?

Note:

I think "need to pass the vector across a @qd.func boundary," is false, since it's inlined anyway.

the arithmetic bit is a bit of a 'straw man', since we could implment arithmetic on unpacked; we just dont have any such requirement currently

"are storing the field in a smallish struct, register pressure is low, and the kernel works as-is" isnt really a cosntraint, doesnt imply increasing code complexity

The only thing we are left with really is:

"need to index with a runtime integer (v[k] where k isn't a python-int".

This isn't really a 'complexity' issue. If you need this, you cannot use unpacked.

I think the main trade-off might be compile time versus runtime. We have three optoins:

use vector => fastest compile time, slowest runtime

use hand coded scalars => slowest compile time, fastest runtime

use unpacked => medium compile time, also fastest runtime

(also, hand coded scalars is a maintenance nightmare...)

I think with all these information, it draws a pretty faithful picture of the situation. Would you mind structure this and add it to the documentation?

github-actions · 2026-06-03T14:39:42Z

Diff coverage: 87% · 290 lines, 39 missing

…for packed vs unpacked Adds a feature-matrix table to the "Storage layout: packed vs unpacked vectors" section, complementing the existing "When to choose" decision table. The matrix covers: - The supported access patterns (indexed reads/writes; static vs runtime indices). - The vector-value operations that work on packed and not on unpacked (arithmetic, reductions, swizzle, cross-func pass-by-value, shared-mem copy). - The constructor / kernel-arg / .field() / .ndarray() paths that raise on unpacked. - **Compile-time speed**: faster on packed, slower on unpacked (N synthetic members and more IR per field). - **Runtime speed**: same as packed under low register pressure; faster than packed under register pressure (only slots that don't fit spill, rather than the whole vector).

…feature-matrix row

… choose ..."

…ature-matrix header

duburcqa · 2026-06-03T16:05:47Z

ok to merge

github-actions · 2026-06-03T16:41:34Z

Total: 7 file(s) changed, +301 -7 code lines.

github-actions · 2026-06-03T17:40:44Z

Diff coverage: 87% · 290 lines, 39 missing

hughperkins added 30 commits May 23, 2026 02:55

Merge branch 'main' into hp/qd-field-array

0a353cf

Merge branch 'hp/qd-field-array' of github.com:Genesis-Embodied-AI/gs…

9bd1b9c

…taichi into hp/qd-field-array

Merge branch 'main' into hp/qd-field-array

e46c168

Merge branch 'hp/qd-field-array' of github.com:Genesis-Embodied-AI/gs…

21424cd

…taichi into hp/qd-field-array

simplify lnaguage at end of compound type

3726e61

hughperkins changed the title ~~[Type] Add unpacked qd.vector~~ [Type] Add unpacked form for qd.vector for indexed register access Jun 3, 2026

hughperkins added 15 commits June 3, 2026 05:14

Merge remote-tracking branch 'origin/main' into hp/qd-unpacked-layout

88892a3

# Conflicts: # docs/source/user_guide/compound_types.md

docs(matrix_vector): clarify that 65 scalar members refers to the pre…

b1b494a

…ceding example

docs(matrix_vector): "if registers are tight" -> "if there is insuffi…

2eb09f8

…cient register storage"

docs(matrix_vector): drop the "targeted fix for a specific problem" f…

6919093

…raming sentence; the mechanism explanation that follows now stands on its own

docs(matrix_vector): drop "if you're not measuring spills, default to…

f950a2d

… plain ..." advisory sentence

duburcqa reviewed Jun 3, 2026

View reviewed changes

hughperkins added 5 commits June 3, 2026 08:33

docs(matrix_vector): merge vector-arithmetic and reductions into one …

a9ff432

…feature-matrix row

docs(matrix_vector): move "What works on each layout" before "When to…

ec74a23

… choose ..."

docs(matrix_vector): use short "packed" / "unpacked" labels in the fe…

b38d249

…ature-matrix header

update how to choose section

f6c8c40

Merge branch 'main' into hp/qd-unpacked-layout

a748f5a

hughperkins merged commit 57ad234 into main Jun 3, 2026
56 checks passed

hughperkins deleted the hp/qd-unpacked-layout branch June 3, 2026 18:23

Conversation

hughperkins commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's added

Test plan

Follow-ups (not in this PR)

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

duburcqa commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented Jun 2, 2026 •

edited

Loading