[Type] Add unpacked form for qd.vector for indexed register access#718
Conversation
…ields
Adds a new ``qd.field_array(N, dtype)`` annotation for ``@qd.dataclass`` that
exposes a logical N-element array as ``obj.r[i]`` while storing N individually-
named synthetic scalar fields (``_r0..._r{N-1}``) under the hood. For python-int
indices (including ``qd.static(range(N))``-unrolled loop variables), the AST
transformer rewrites ``obj.r[i]`` directly to ``obj._r{i}``, so the generated
LLVM IR / PTX is byte-identical to a hand-rolled named-field struct.
Motivation: today's idiomatic ``r: qd.types.vector(N, dtype)`` group field
leaves an alloca that LLVM SROA can't decompose once register pressure crosses
a threshold (e.g. two concurrent tiles in a Cholesky+TRSM kernel), causing
runtime regressions via local-memory spills. The named-field cascade pattern
avoids the spill but balloons source size (32-way ``if k == N: self.rN = val``
write cascades duplicated at every callsite). ``field_array`` collapses those
cascades to one AST node per callsite while preserving the named-field IR.
Changes:
- ``lang/struct.py``: ``FieldArray`` type wrapper, ``field_array(count, dtype)``
constructor, expansion in ``StructType.__init__`` (synthetic field names plus
``_field_groups`` metadata), propagation in ``StructType.__call__``,
``_FieldArrayRef`` transient proxy.
- ``lang/impl.py``: preserve ``_qd_field_groups`` across the ``Struct`` rewrap
in ``expr_init``.
- ``lang/ast/ast_transformer.py``: ``build_Attribute`` returns a
``_FieldArrayRef`` for group access; ``build_Subscript`` resolves it to a
direct field reference for python-int indices.
- ``tests/python/test_field_array.py``: 5 tests covering construction, static
python-int index, qd.static loop-var index, runtime-index rejection (clear
error), and static-index OOB rejection.
Runtime-int indexing is intentionally rejected with a friendly error pointing
at ``qd.static``; existing cascade helpers continue to handle the runtime case
by spelling out the ``_rN`` fields directly. Adding runtime-int support is a
small follow-up.
Verified on a field_array port of genesis ``_tile32.py``: PTX byte-identical to
the named-field S1 baseline (modulo the per-session-nonce comment) on both
``chol_kernel`` and ``chol_trsm_kernel``; zero local-memory spills (S1: 0/0,
FA: 0/0, F4-A vector-field variant: 42/97); 25% compile-time reduction on the
single-tile harness (5.60s -> 4.19s, 3-run mean). Source dropped from 1068 to
515 lines (-52%). Full writeup in perso_hugh/doc/qd_field_array_2026may23.md.
All 201 tests in test_py_dataclass.py + test_complex_struct.py + test_struct.py
continue to pass; the 5 new tests pass in 1.76s total.
Following review of PR #712, rename the indexed-fields-on-@qd.dataclass primitive from `field_array` to `register_array` to better reflect its intent. The mechanism declares N independently-allocated scalar struct members so SROA + mem2reg can register-promote each slot independently, side-stepping the SROA-bailout that affects packed `qd.types.vector(N, dtype)` storage under register pressure. Renames: - Public: qd.field_array(N, dtype) -> qd.register_array(N, dtype) FieldArray -> RegisterArray - Private: _FieldArrayRef -> _RegisterArrayRef _qd_field_groups (struct) -> _qd_register_groups _field_groups (type) -> _register_groups _expand_field_array_naming -> _expand_register_array_naming _qd_is_field_array_ref -> _qd_is_register_array_ref - Tests: tests/python/test_field_array.py -> test_register_array.py The internal `_qd_field_for(i)` helper on the proxy is kept (its return value is a struct *field*, which is still the correct domain term). No behavioural change; the 5 pytest cases (construction, static-int subscript, qd.static loop subscript, runtime-index rejection, static OOB rejection) all pass under the new name.
…ected list test_api.py asserts the public surface of `qd` against a sorted golden list. The register_array PR added two new public names (the constructor `register_array` and the type wrapper `RegisterArray`); add them to the expected list so test_api[arch=*-quadrants] passes again on macOS / CI.
…dule Addresses the large-file lint on PR #712: the four register_array definitions (RegisterArray, register_array, _expand_register_array_naming, _RegisterArrayRef) are self-contained (only depend on numpy + QuadrantsSyntaxError) and have zero coupling to struct.py internals, so they belong in their own module like field.py / matrix.py. - python/quadrants/lang/register_array.py: new module with the four definitions. - python/quadrants/lang/struct.py: drop the inline definitions, import from the new module. `__all__` still re-exports the public names so `from quadrants.lang.struct import *` (run via `quadrants.lang.__init__`) keeps surfacing them as `qd.register_array` and `qd.RegisterArray`. - python/quadrants/lang/ast/ast_transformer.py: import `_RegisterArrayRef` directly from the new module instead of via struct.py, avoiding a re-export hop. No behavioural change; 224 tests + 9 xfail across test_register_array.py, test_api.py, test_py_dataclass.py, test_complex_struct.py, test_struct.py all pass.
…ention Run find_underwrapped.py over the rename diff and tighten every prose run that was wrapped at the AI-default ~85-95c down to the project's 120c target. Pure prose-layout change -- no semantic edits. Only legitimate skips per the skill are the two doctest code blocks inside the `Example::` docstring in register_array.py (21c and 76c runs, deliberately formatted code samples). Files touched: register_array.py, struct.py, impl.py, ast/ast_transformer.py, tests/python/test_register_array.py. Tests still 23/23.
…tring concat)
- ast_transformer.py: ruff/isort: register_array import sorts before snode in the
quadrants.lang.* group.
- struct.py: ruff/isort: _RegisterArrayRef sorts after _expand_register_array_naming
inside the multi-line import.
- register_array.py: black joins two short f-string-continuation literals onto one
120c-safe line (in the OOB error path and __repr__).
- test_register_array.py: black inserts a blank line between the docstring and the
nested @qd.dataclass.
All hooks (black, clang-format, trailing-whitespace, eof, ruff, pylint) now pass.
Per-PR feedback request: a user-facing page covering the problem register_array solves
(SROA bailout on packed qd.types.vector under register pressure), how to use it
(@qd.dataclass annotation + python-int / qd.static indices), and its constraints (static
indices only, no vector arithmetic, count fixed at struct-definition time, naming
collisions to watch out for).
- docs/source/user_guide/register_array.md: new doc page.
- docs/source/user_guide/index.md: link the new page from Core concepts, next to
compound_types.
- docs/source/user_guide/compound_types.md: cross-link from the qd.dataclass overview.
Pyright on PR #712 fails with: python/quadrants/lang/impl.py:111:24 - error: Cannot assign to attribute "_qd_register_groups" for class "Struct". Attribute "_qd_register_groups" is unknown (reportAttributeAccessIssue) `Struct` doesn't statically declare the metadata attr (it's an opt-in per-instance tag set only when the dataclass has at least one register_array group). Use `setattr()` instead of direct attribute assignment so pyright doesn't insist on a class-level declaration; the runtime semantics are identical. Verified locally: 0 errors, 9 warnings -- all 9 warnings pre-existing and unrelated to this PR.
…dback)
"register_array" reads as imperative ("register the array") and overcommits the type to
an implementation effect (register residency). "unpacked_array" describes the actual
layout the annotation produces -- each slot in its own `alloca` -- and contrasts cleanly
with the implicit packed-vector default (`qd.types.vector(N)`). Whether the slots end up
in registers is the optimiser's call; `unpacked_array` removes the packed-storage obstacle.
Renames:
- Public: qd.register_array(N, dtype) -> qd.unpacked_array(N, dtype)
RegisterArray -> UnpackedArray
- Private: _RegisterArrayRef -> _UnpackedArrayRef
_qd_register_groups -> _qd_unpacked_groups
_register_groups (StructType) -> _unpacked_groups
_expand_register_array_naming -> _expand_unpacked_array_naming
_qd_is_register_array_ref -> _qd_is_unpacked_array_ref
- Module: python/quadrants/lang/register_array.py
-> python/quadrants/lang/unpacked_array.py
- Tests: tests/python/test_register_array.py
-> tests/python/test_unpacked_array.py
- Docs: docs/source/user_guide/register_array.md
-> docs/source/user_guide/unpacked_array.md (also reworded around the
"packed vs unpacked layout" framing)
- test_api.py expected list updated (UnpackedArray + unpacked_array in alphabetical
position).
No behavioural change; all checks pass locally:
- 224 passed + 9 xfailed across test_unpacked_array, test_api, test_py_dataclass,
test_complex_struct, test_struct.
- pre-commit run -a: black, clang-format, trailing-whitespace, eof, ruff, pylint all
pass.
- pyright on the touched files: 0 errors, 0 warnings.
User-facing language fix. 'Trace time' is internal Quadrants jargon for the python-side
AST walk; from the user's seat the relevant boundary is the compile/run split, so use
'compile time' in the doc and test docstring.
- docs/source/user_guide/unpacked_array.md: 'AST-build time' / 'trace time' in the
constraints section -> 'compile time'.
- tests/python/test_unpacked_array.py: OOB-static-index test docstring updated to
match.
No behavioural change; pure prose.
…call__
Previous comment said "propagate the register-array group metadata onto the Struct
instance ... on the per-trace expression" - both terms are stale / unclear:
- "register-array" was left behind by the qd.register_array -> qd.unpacked_array
rename (which matched identifiers, not free-form prose).
- "per-trace expression" is internal jargon; what's actually happening is that
StructType.__call__ runs on every kernel trace and builds a fresh `Struct` object
representing the `Tile()` instantiation in the kernel's IR, and we tag THAT
expression-object (vs the class-level StructType) because ASTTransformer.build_Attribute
walks the instance.
Reword the comment to spell that out without the jargon. No behavioural change.
Pure rename of the class + internal refs + module, in preparation for a follow-up that switches construction to the subscript form ``UnpackedVector[dtype, count]`` and moves the export to ``qd.types``. - ``quadrants/lang/unpacked_array.py`` -> ``unpacked_vector.py`` - ``UnpackedArray`` -> ``UnpackedVector`` - ``_UnpackedArrayRef`` -> ``_UnpackedVectorRef`` - ``_expand_unpacked_array_naming`` -> ``_expand_unpacked_vector_naming`` - ``_qd_is_unpacked_array_ref`` -> ``_qd_is_unpacked_vector_ref`` The user-facing call-form ``qd.unpacked_array(N, dtype)`` factory still works (returns an ``UnpackedVector`` now). The struct-side metadata key ``_qd_unpacked_groups`` keeps its name -- it was already generic. CPU smoke test of ``@qd.dataclass`` with ``r: qd.unpacked_array(4, qd.f32)`` + ``t.r[i] = ...`` inside a kernel passes.
…nt]`` + drop call factory
Replace the ``qd.unpacked_array(N, dtype)`` factory function with the subscript
form ``qd.UnpackedVector[dtype, count]``:
- Adds ``UnpackedVector.__class_getitem__((dtype, count))`` returning the marker.
- Drops the ``unpacked_array(count, dtype)`` factory function and its export.
- Adds misuse guards on ``UnpackedVector`` instances: ``__getitem__`` and
``__call__`` both raise ``QuadrantsSyntaxError`` with a "use @qd.dataclass"
pointer. Catches the common path where a user wraps the class in
``@dataclasses.dataclass`` (silent today, then a cryptic ``TypeError``).
- Catches bad subscript shapes early: missing count, dtype/count reversed.
Rationale: the subscript spelling makes the annotation visually read as a type
("this is a type", same shape as ``NDArray[dtype, ndim]``), not a value-construct,
which heads off the "I called it, where's my container?" confusion. The
sibling-asymmetry with ``qd.types.vector(N, dtype)`` is the only real cost; the
docs table now reflects it.
Argument order ``[dtype, count]`` matches ``NDArray[dtype, ndim]``.
- ``test_unpacked_array.py`` -> ``test_unpacked_vector.py`` with new spelling
+ new ``test_unpacked_vector_marker_subscript_rejected`` /
``_marker_call_rejected`` / ``_bad_subscript_arity`` tests for the guards.
- ``docs/source/user_guide/unpacked_array.md`` -> ``unpacked_vector.md``,
rewritten to lead with the subscript form. Adds a "Common pitfalls" section
covering the ``@qd.dataclass`` requirement and the subscript-not-call rule.
- Cross-links in ``index.md`` and ``compound_types.md`` updated.
- ``test_api.py`` golden list: drop the ``unpacked_array`` function entry.
CPU smoke + python-side misuse-guard tests pass.
The previous step left ``UnpackedVector`` exposed at both ``qd.UnpackedVector`` (via the ``from quadrants.lang import *`` star import in ``qd/__init__.py``) and nothing in ``qd.types``. Bring the export in line with sibling annotations: - ``qd.types.vector(N, dtype)`` -- packed sibling, lives in ``qd.types`` - ``qd.types.matrix(M, N, dtype)`` -- packed matrix, lives in ``qd.types`` - ``qd.types.NDArray[dtype, ndim]`` -- tensor handle, lives in ``qd.types`` Now ``qd.types.UnpackedVector[dtype, count]`` joins the same namespace. - Drop ``UnpackedVector`` from ``struct.py``'s ``__all__`` so it no longer leaks into the top-level ``qd.*`` star import. - Re-export ``UnpackedVector`` from ``quadrants/types/__init__.py``. The import sits at the bottom of the file because pulling in ``quadrants.lang.unpacked_vector`` indirectly triggers ``quadrants.lang.util``, which imports ``Template`` from ``quadrants.types`` -- the late position lets ``Template`` finish initialising before that re-entry happens. Explained in an inline comment. - ``test_api.py``: drop the top-level ``UnpackedVector`` entry from ``user_api[qd]``. - ``test_unpacked_vector.py``: switch all use-sites to ``qd.types.UnpackedVector``. - User-guide doc: switch all examples + table entry to ``qd.types.UnpackedVector``. CPU smoke test verifies ``qd.UnpackedVector`` is gone, ``qd.types.UnpackedVector`` is the canonical export, and end-to-end kernel construction + read-back still works.
- ``quadrants/types/__init__.py``: add file-level ``# ruff: noqa: I001`` so the load-bearing import order (UnpackedVector at the bottom, to break a circular import with ``quadrants.lang.util``) survives ruff's import-sort hook. - ``quadrants/lang/unpacked_vector.py``: black collapsed two multi-line f-string concatenations to single lines (cosmetic).
…or spills" The "When to reach for it" section mentions ``ptxas`` spill counters and ``ld.local`` / ``st.local`` markers as signals, but the doc previously did not explain how to obtain them. Quadrants compiles LLVM -> PTX directly via the LLVM NVPTX target without invoking ``nvcc`` / ``ptxas``, so the standard ``ptxas --verbose`` output is not produced inline -- users have to dump the PTX and run ``ptxas -v`` on it themselves. - Add an ``# Advanced`` section near the bottom with ``## How to check for spills``, covering three workflows: (1) dump PTX via ``print_kernel_asm=True`` + offline ``ptxas -v``, (2) grep the PTX for ``ld.local`` / ``st.local``, (3) ``ncu --section MemoryWorkloadAnalysis`` for runtime counters. - Plus a note on ``print_kernel_llvm_ir_optimized=True`` for the post-opt LLVM IR, and the offline-cache gotcha that silently suppresses dumps on cached kernels. - Forward-link from the existing spill-signal sentence in the "When to reach for it" section.
When a ``@qd.dataclass`` declares both an ``UnpackedVector`` group ``r`` and a field whose name matches one of the synthetic expansions (e.g. ``_r0``), the previous code silently overwrote one with the other depending on annotation order, while ``_unpacked_groups['r']`` still claimed ``r[0]`` mapped to ``_r0``. That meant ``s.r[0]`` accessed the wrong storage / dtype instead of producing the duplicate-member error the docs promise. - ``StructType.__init__``: check ``sub in self.members`` before inserting each synthetic field generated from an ``UnpackedVector`` group, and check ``k in self.members`` for each subsequent member declaration, raising a clear ``QuadrantsSyntaxError`` either way. Covers both annotation orderings. - Tests: ``test_unpacked_vector_collision_with_earlier_field`` (``_r0`` declared before ``r: UnpackedVector[...]``) and ``test_unpacked_vector_collision_with_later_field`` (``_r2`` declared after). Addresses codex review comment r3314094246.
…rewrap (codex P2)
When an ``@qd.dataclass`` with an ``UnpackedVector`` member is itself nested
inside an outer ``@qd.dataclass``, ``o.inner.r[i]`` failed at AST-build time
with ``AttributeError: 'StructN' object has no attribute 'r'``. The inner
struct's ``_qd_unpacked_groups`` tag was being stripped during the
``expr_init`` rewrap path because:
1. ``StructType.cast()`` produced a fresh ``Struct`` without re-attaching the
tag, and outer-struct construction calls ``inner_dtype.cast(...)`` for each
nested member.
2. ``expr_init`` used ``rhs.to_dict(...)`` to build the rewrap input, which
recursively flattens nested ``Struct`` instances to plain ``dict`` -- so the
per-instance tag never survived to the new inner ``Struct``.
Fixes:
- ``StructType.cast()`` and ``filled_with_scalar()`` now route through a new
``_attach_unpacked_groups()`` helper that tags the freshly-built struct with
``self._unpacked_groups`` whenever the type declared at least one group.
``StructType.__call__`` no longer needs its own tagging block -- ``cast`` is
now the single tagging point, and the recursion through nested ``cast`` calls
propagates the tag to every level.
- ``impl.expr_init`` builds the rewrap dict directly from ``rhs.__entries``
(preserving nested ``Struct`` instances) instead of via ``to_dict``. Each
entry is then re-run through ``expr_init`` by ``Struct.__init__``, so nested
tags recurse correctly.
Test: ``test_unpacked_vector_nested_in_outer_dataclass`` constructs
``Outer{inner: Inner{r: UnpackedVector[f32, 4]}, scale}``, writes ``inner.r[i]``
via a ``qd.static`` loop, reads it back, and checks ``inner.r[2] * scale``.
Addresses codex review comment r3314094264.
…bscript (codex P2)
Calling ``Tile.field(shape=...)`` on a dataclass that declares
``r: UnpackedVector[dtype, N]`` produces a ``StructField`` whose entries are
the *expanded* synthetic fields ``_r0.._r{N-1}``. The group dictionary was
never copied onto the ``StructField`` (or the per-index ``_IntermediateStruct``
returned by ``impl.subscript``), so ``f[i].r[k]`` inside a kernel fell through
to a plain attribute lookup on the intermediate struct, producing
``AttributeError: 'Struct...' object has no attribute 'r'``.
Fixes:
- ``StructType.field()`` tags the freshly-built ``StructField`` with
``self._unpacked_groups``. Nested struct fields are handled by the recursion
in ``Struct.field`` (which calls ``dtype.field(...)`` for each ``StructType``
member).
- ``impl.subscript``'s ``StructField`` branch copies the tag onto the
``_IntermediateStruct`` it builds for ``f[i]``. Recursion through nested
``subscript`` calls handles ``outer_field[i].inner.r[k]``.
Tests:
- ``test_unpacked_vector_struct_field_subscript``: ``Tile.field(shape=(2,))``;
per-element write+read via ``f[n].r[i]``.
- ``test_unpacked_vector_struct_field_subscript_nested``:
``Outer.field(shape=(1,))`` with ``Inner`` carrying the UnpackedVector;
``outer_field[0].inner.r[i]`` round-trip.
Addresses codex review comment r3314094263.
Reframes the user guide around the simpler ``registers vs local memory`` mental model that the reviewer asked for, and consolidates the duplicated ``Common Pitfalls`` + ``Constraints and limitations`` sections. - Intro/motivation now leads with all-or-nothing residency for the packed vector vs. per-slot residency for ``UnpackedVector``. The SROA / ``alloca`` / ``mem2reg`` / ``ld.local`` jargon moves out of the top section and into a new ``How the layout works at the LLVM level`` subsection under ``# Advanced``, where readers who want it can still find it. - The ``so what?`` paragraph after the hand-rolled ``r0..r31`` example now spells out the costs: ugly, error-prone, slow to type-check and compile. - Removes the low-level ``IR / PTX matches exactly`` framing at L51 in favour of the user-facing ``best of both worlds: clean syntax and per-slot register residency``. - ``Common Pitfalls`` and ``Constraints and limitations`` merged into one ``Constraints and pitfalls`` section, removing the duplicate ``static indices`` bullet flagged at L120 and the ``count is fixed at struct- definition time`` bullet (which is true of every Quadrants annotation, so isn't UnpackedVector-specific). The ``static-indices`` requirement now reads as a design rationale rather than a limitation. The naming-collision bullet is updated to say collisions are *rejected* (true since the earlier ``synthetic field-name collisions`` commit), rather than ``the user should avoid them``. - ``small`` is now defined relative to the per-thread register budget (a few dozen slots on current NVIDIA GPUs; 255 32-bit registers per thread max, shared with all live state). The ``measured or strongly suspect`` bullet forward-links straight to ``How to check for spills``. Addresses codex / reviewer comments on motivation (L17, L30, L51), duplication (L120), ``small`` (L105), and ``measured / suspect`` (L107).
Replaces non-ASCII typography in ``docs/source/user_guide/unpacked_vector.md`` with plain-ASCII equivalents so the file is portable across diff viewers, grep tooling, and downstream renderers: - ``U+2026`` (ellipsis) ``...`` -> ``...`` (already removed in the earlier rewrite, but adding to the list for completeness) - ``U+2014`` (em-dash) ``--`` -> ``--`` - ``U+2192`` (rightwards arrow) ``->`` -> ``->`` Other docs in ``docs/source/user_guide/`` still contain em-dashes etc., but those weren't part of the PR review and are left alone here.
The ``Static out-of-bounds is rejected at compile time`` bullet didn't add anything for the reader: it described an obvious consequence of "static indices only" + "count is N", with a quoted error message that's not useful in a high-level overview. Drop it.
The ``Relationship to other annotations`` table is in the main user-facing section, but the original ``storage layout`` column described each row as a form of ``alloca`` -- a term the rewrite explicitly moved into the ``# Advanced / How the layout works at the LLVM level`` subsection. Restate the same distinction using the ``slot`` vocabulary from the rewritten motivation: - ``qd.f32`` (per-field): ``one independent slot per field`` - ``qd.types.vector(N, dtype)``: ``one slot holding N packed scalars`` - ``qd.types.UnpackedVector``: ``N independent slots`` The Advanced section keeps the ``alloca`` / ``mem2reg`` / ``SROA`` framing for readers who want the LLVM-level detail.
… namespace ``quadrants/lang/__init__.py`` builds ``__all__`` dynamically from ``dir()`` minus a hardcoded blocklist of submodule names (``struct``, ``matrix``, ``misc``, ...). The ``unpacked_vector`` submodule was missing from that blocklist, so once ``quadrants.lang.struct`` did its ``from quadrants.lang.unpacked_vector import ...``, the submodule got registered on the ``quadrants.lang`` package, picked up by ``__all__``, and re-exported through ``from quadrants.lang import *`` in the top-level ``quadrants/__init__.py`` -- adding ``qd.unpacked_vector`` to the public surface that ``test_api.py`` golden-asserts. The intended public surface is ``qd.types.UnpackedVector`` only. Add ``unpacked_vector`` to the lang blocklist so the submodule stays private, matching the treatment of ``struct``, ``matrix``, etc. Fixes ``test_api[arch=arm64-quadrants]`` (and all other arches; the failure was reproducible on any platform that ran the test).
…taichi into hp/qd-field-array
Extracts three pieces of ``UnpackedVector``-specific logic from ``StructType`` into free functions in ``quadrants/lang/unpacked_vector.py`` (the existing feature module). No behaviour change, all existing tests continue to pass. Moved: - ``expand_unpacked_vector_into(group, marker, members, groups, elements, cook_dtype)``: the per-member loop body that expands an ``UnpackedVector[dtype, N]`` annotation into N synthetic scalar fields, including the synthetic-vs-existing collision check. ``cook_dtype`` is injected to avoid the import cycle ``struct -> unpacked_vector -> lang.util``. - ``_check_no_unpacked_collision(name, members)``: the symmetric guard for user-declared members whose name matches an already-expanded synthetic field. Three identical inline copies in the three non-UnpackedVector branches of ``StructType.__init__`` collapse to one helper call each. - ``attach_unpacked_groups(target, groups)``: the no-op-if-empty tagging used by ``cast``, ``filled_with_scalar``, and ``field`` to ride the group dict on freshly-built ``Struct`` / ``StructField`` objects so the AST transformer can see it. Replaces the prior ``_attach_unpacked_groups`` method on ``StructType``. ``struct.py`` shrinks by ~30 lines net. ``UnpackedVector`` feature surface now lives entirely in ``unpacked_vector.py``, with ``StructType`` only calling into the helpers from its existing per-member loop. Drive-by: drop two no-longer-needed re-exports from ``struct.py`` (``_expand_unpacked_vector_naming``, ``_UnpackedVectorRef``); both consumers already import them directly from ``unpacked_vector``. Addresses the CI "feature code accreted in a hot file" advisory check.
…taichi into hp/qd-field-array
# Conflicts: # docs/source/user_guide/compound_types.md
…packed decision guide Fold the standalone ``user_guide/unpacked.md`` page into ``user_guide/matrix_vector.md`` as the natural home for documentation on the ``qd.types.vector`` factory. Add a "When to choose ``unpacked=True`` vs ``unpacked=False``" decision table that summarises which option applies given a use case, with the default biased toward ``unpacked=False`` (plain vector). The motivation, constraints, mixing example, LLVM-level explanation, and spill-detection workflow all carry over to the new section. Also: - Drop ``unpacked`` from the user-guide TOC (``index.md``). - Update the cross-link in ``compound_types.md`` to point at the new section anchor. - Update ``:doc:`` pointers in ``_unpacked.py`` and ``compound_types.py`` to reference ``matrix_vector`` instead of the now-removed ``unpacked`` page. No behavioural changes -- documentation only.
…lot" / "packed group"
Per feedback, ``alloca`` is an advanced LLVM concept that user-guide readers won't necessarily know about. Rephrase
the four occurrences in ``matrix_vector.md`` to use more accessible vocabulary ("a single group of N scalars", "N
independent scalar slots", "stack-allocated group", "unpromoted stack slots in the IR"). No content removed; only
terminology adjusted.
…tail, not user-facing The user only needs to know that ``obj.r[i]`` is *lowered* to a direct reference to the i-th synthetic member -- the particular pass that does the rewriting (the AST transformer) is an internal compiler detail. Switch to passive phrasing to remove the implementation reference while keeping the user-facing guarantee about lowering.
…iser" Two SROA mentions in the LLVM-level Advanced section rephrased to "LLVM's optimiser" / "the optimiser". The particular pass name is an implementation detail; what the reader needs to know is "the compiler tries to break the packed group apart but bails out under pressure", which the new phrasing communicates directly.
…cient register storage"
… about "only some lanes are hot" The previous wording said unpacked=True helps "when only some of its lanes are actually hot". That isn't the real problem in the motivating tile case, where *all* lanes are hot. The actual issue is the all-or-nothing constraint on a packed group: the compiler has to commit to keeping the whole vector in registers or spill it as a unit. Splitting the field into per-slot storage lets the compiler keep as many slots as actually fit and spill only the remainder, even when every slot is read often.
…g for register residency Previous wording said the packed vector "is committed-to or spilled as a single unit" without explaining why that's the case. Make the mechanism explicit: all N elements share one piece of stack memory, and the optimiser's promote-to-registers-or-leave-in-memory decision is made once for the whole allocation. The unpacked layout splits the field into N independent pieces of memory up front, so the optimiser gets N independent decisions and can keep as many elements in registers as actually fit.
…raming sentence; the mechanism explanation that follows now stands on its own
… plain ..." advisory sentence
The previous wording cited the 255 architectural cap as if it were the practical budget. In practice 255 is rarely usable because occupancy drops sharply as registers/thread approaches that number; production kernels usually run with around 32-64 registers/thread. Note the cap, explain why it isn't reachable in practice, and give the realistic budget so a reader sizing an unpacked group has a number they can actually use.
- ``SSA values`` -> "per-slot register-resident values" (no need for the SSA term). - ``mem2reg`` -> "the optimiser" (the specific LLVM pass isn't user-relevant). - Drop the trailing "each access then becomes a ld.local / st.local" sentence in the LLVM-level paragraph; the preceding "spills to local memory as a unit" already conveys the consequence without the PTX mnemonics. - Heading and prose for the PTX-inspection workflow: rename "Inspect the PTX directly for ld.local / st.local" to "Inspect the PTX directly for local-memory loads / stores", and add a one-sentence definition of the PTX mnemonics before the literal grep example uses them. Likewise for the shared-memory mnemonics in the trailing caveat.
…tionable; define alloca at first use The previous "Look for unpromoted stack slots in the IR" instruction wasn't actionable -- the reader wouldn't know what to grep for. Reintroduce ``alloca`` here, define what it is (every per-function stack allocation appears as one) and why a surviving ``alloca`` matters (the optimiser couldn't promote it to a register, so it will become PTX local memory), and add the literal grep command alongside the workflow above for ld.local / st.local.
…n origin/main) origin/main renamed ``user_guide/tile16.md`` to ``user_guide/tile.md`` (as part of the tile16/tile32 -> tile consolidation that also touched the test files). The remaining link in ``compound_types.md`` still pointed at the old filename and was breaking the markdown-link-check CI job on this PR. Update to ``tile.md``.
| | If you... | Use | | ||
| |----------------------------------------------------------------------------------------------------------|----------------------------------| | ||
| | want vector arithmetic (`a + b`, `a.dot(b)`, `a.norm()`, swizzle, broadcast) at the use site | `unpacked=False` (the default) | | ||
| | need to index with a *runtime* integer (`v[k]` where `k` isn't a python-int) | `unpacked=False` | | ||
| | need to pass the vector across a `@qd.func` boundary, copy it into shared memory, or instantiate it as a kernel-local value | `unpacked=False` | | ||
| | are storing the field in a smallish struct, register pressure is low, and the kernel works as-is | `unpacked=False` | | ||
| | only ever access the field as `obj.r[i]` with python-int / `qd.static`-resolved `i` (unrolled inner loops), the group is small relative to the per-thread register budget, *and* the packed version is spilling under register pressure | `unpacked=True` | |
There was a problem hiding this comment.
I'm not sure it is the correct framing. basically, unpacked=True is strictly more limited on all axes compared to unpacked=False, so basically it is only relevant in one specific case, to avoid spilling under register pressure. If this happens, the way to go may be to adapt the code so that using unpacked=True because possible, at the cost of increasing the code complexity for extra perf boost. Which adaptation the code may require is exactly what you mention in this list.
There was a problem hiding this comment.
Not sure I follow
- why do you feel using unpacked=True incresaes code complexity?
- why do you feel this framing is not correct? How aer you thinking of framing it?
There was a problem hiding this comment.
What increases the code complexity is doing all the necessary modification to accommodate the requirements for being able to use unpacked=True.
To frame it differently: always use unpacked=False, but if for some reason to need to squeeze more performance and you discover that you have a spilling issue to due register pressure, then you may consider switching to unpacked=True and doing all the necessary modifications to comply with the requirements to mentioned in this table, which is likely to increase code complexity somehow.
There was a problem hiding this comment.
empirically, in the one case we use them currently, i.e. tiles, we don't need any of the things in the table. No increased complexity. This is why I feel the proposed framing is not clearly accurate.
There was a problem hiding this comment.
I'm fine with "To frame it differently: always use unpacked=False, but if for some reason to need to squeeze more performance and you discover that you have a spilling issue to due register pressure, then you may consider switching to unpacked=True". But I dont agree with "doing all the necessary modifications". They aren't really 'modifications', it's more like, if you need to do any of these things, then you cannot use unpacked I think?
There was a problem hiding this comment.
Note:
- I think "need to pass the vector across a
@qd.funcboundary," is false, since it's inlined anyway. - the arithmetic bit is a bit of a 'straw man', since we could implment arithmetic on unpacked; we just dont have any such requirement currently
- "are storing the field in a smallish struct, register pressure is low, and the kernel works as-is" isnt really a cosntraint, doesnt imply increasing code complexity
The only thing we are left with really is:
- "need to index with a runtime integer (
v[k]wherekisn't a python-int".
This isn't really a 'complexity' issue. If you need this, you cannot use unpacked.
There was a problem hiding this comment.
I think the main trade-off might be compile time versus runtime. We have three optoins:
- use vector => fastest compile time, slowest runtime
- use hand coded scalars => slowest compile time, fastest runtime
- use unpacked => medium compile time, also fastest runtime
There was a problem hiding this comment.
(also, hand coded scalars is a maintenance nightmare...)
There was a problem hiding this comment.
I think with all these information, it draws a pretty faithful picture of the situation. Would you mind structure this and add it to the documentation?
…for packed vs unpacked Adds a feature-matrix table to the "Storage layout: packed vs unpacked vectors" section, complementing the existing "When to choose" decision table. The matrix covers: - The supported access patterns (indexed reads/writes; static vs runtime indices). - The vector-value operations that work on packed and not on unpacked (arithmetic, reductions, swizzle, cross-func pass-by-value, shared-mem copy). - The constructor / kernel-arg / .field() / .ndarray() paths that raise on unpacked. - **Compile-time speed**: faster on packed, slower on unpacked (N synthetic members and more IR per field). - **Runtime speed**: same as packed under low register pressure; faster than packed under register pressure (only slots that don't fit spill, rather than the whole vector).
…feature-matrix row
…ature-matrix header
|
ok to merge |
Summary
Adds an
unpacked=Trueper-thread storage-layout kwarg toqd.types.vectorfor use on@qd.dataclassfield annotations:The value type is a vector. The kwarg declares that storage is unpacked:
Nindependent scalar slots (oneallocaeach) instead of one packed slot. The optimiser can then promote each slot to a register independently and spill only the ones it has to, instead of having to commit to keeping or spilling the whole group as a unit. The motivation is high register pressure -- e.g. several concurrent 32x32 tiles in a Cholesky + triangular solve -- where LLVM's SROA +mem2regbails out on a packed alloca and the whole vector ends up in local memory.Today only indexed access (
obj.r[i]for python-int /qd.static-resolvedi) is exercised on an unpacked field; vector arithmetic / reductions / pass-by-value / runtime indexing / shared-memory copy as a unit on unpacked fields are planned follow-ups -- they only need an AST-side materialisation of the field as a vector value, not new types.What's added
unpacked: bool = Falsekwarg onqd.types.vector(...)andVectorType.__init__. Plainqd.types.vector(N, dtype)is unchanged.@qd.dataclassfield is annotated with a vector type whose_is_unpackedflag is set,StructType.__init__expands it intoNsynthetic scalar members_r0.._r{N-1}.ASTTransformer.build_Attributerecognisesobj.ras an unpacked-group access via a_qd_unpacked_groupstag on the struct, andbuild_Subscriptresolvesobj.r[i]to a direct reference to the synthetic_r{i}field._qd_unpacked_groupstag rides on everyStruct/StructFieldinstance the type produces -- through nestedcast(),filled_with_scalar(), and per-indeximpl.subscriptpaths -- soouter.inner.r[i]andtile_field[n].r[i](including nested) all resolve correctly.unpacked=Truerefuses to be instantiated as a value (T(...),T.field(...),T.ndarray(...)) outside a@qd.dataclassfield annotation, with an error pointing at the dataclass-annotation usage. Synthetic-field-name collisions (a user-declared field shadowing a future_r{i}, or vice versa) raiseQuadrantsSyntaxErrorat class-decoration time.docs/source/user_guide/unpacked.mdcovers motivation, usage, constraints, the layout-vs-LLVM relationship, and a workflow for confirming spills (PTX dump +ptxas -v,ld.localgrep, Nsight Compute, post-optimisation LLVM IR).Test plan
pre-commit run -aclean (black / clang-format / trim-whitespace / ruff / pylint).tests/python/test_unpacked.py: 13/13 pass on CUDA (RTX PRO 6000 Blackwell) -- 8 kernel tests cover the full AST-rewrite + NVPTX codegen path; 5 cover the python-side misuse / collision guards.tests/python/test_api.py: 18/18 pass on CUDA -- confirms public-surface golden is unchanged (the kwarg form adds no new top-level symbols).Follow-ups (not in this PR)
obj.ras a vector value, so vector arithmetic /.norm()/ runtime indexing / pass-by-value across@qd.funcboundaries / shared-memory copy as a unit all start working on unpacked-layout fields.@qd.dataclass).