From 5c7d66db076e24bcc44e9e47f2a52de9c902a0e7 Mon Sep 17 00:00:00 2001
From: chris-colinsky <chris@lunarcommand.xyz>
Date: Tue, 9 Jun 2026 16:54:50 -0700
Subject: [PATCH 1/2] Implement wire-byte stability (proposal 0047)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add intra-impl wire-byte stability to the OpenAI provider so
equivalent OA inputs produce byte-identical wire output regardless
of dict insertion order. A new ``_canonicalize_dict_keys`` helper
recursively sorts dict keys at every nesting level while preserving
caller-supplied array ordering (the spec's split: object keys are
sorted, array order is caller-controlled).

The helper applies at four user-supplied-dict boundaries: tool
definitions (the ``function`` record top-level plus the parameters
JSON Schema), ``response_format.json_schema.schema``, RuntimeConfig
extras, and the JSON encoding of ``tool_call.arguments``. A top-
level belt-and-suspenders pass over the assembled body catches
anything the per-field passes miss.

Closes proposal 0047 end-to-end: pieces 1 and 2 (Response.usage
cache fields sourced from prompt_tokens_details + OTel observer
emits the cache attributes) landed in v0.12.0; this is piece 3.
Prompt-management §13 cross-variable substring stability is
satisfied by the existing Jinja2 strict-undefined render path on
both TextPrompt and ChatPrompt; pinned by new tests.

A new ``docs/concepts/prompts.md`` section explains APC, what OA
handles for users (wire-byte canonicalization, deterministic
rendering), what users own (the spec's five informative authoring
patterns), and a vLLM debugging callout for the cache-attribute-
not-appearing case (server-side ``--enable-prefix-caching`` plus
``--enable-prompt-tokens-details``).

Scope is the Chat Completions endpoint only. The OpenAI Responses
API endpoint and the Anthropic / Gemini wire-format mappings are
deferred (no python consumer today).

Behavior change worth flagging: ``tool_call.arguments`` JSON
encoding now uses ``sort_keys=True``. Functionally equivalent
(parses to the same dict) but byte-different from the previous
insertion-order encoding.
---
 CHANGELOG.md                             |   1 +
 conformance.toml                         |  31 +-
 docs/concepts/prompts.md                 |  67 ++++
 src/openarmature/llm/providers/openai.py |  77 +++-
 tests/unit/test_llm_provider.py          | 426 ++++++++++++++++++++++-
 tests/unit/test_prompts.py               |  95 +++++
 6 files changed, 681 insertions(+), 16 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 0aac2b4..43bebee 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,6 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
 
 ### Added
 
+- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). The OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding). A new `_canonicalize_json_schema` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split between "object keys MUST be sorted" and "array order MUST be preserved per caller-supplied order"). A top-level belt-and-suspenders canonicalization pass over the assembled body catches anything the per-field passes miss. Combined with the existing `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from `prompt_tokens_details` (v0.12.0) and the OTel observer's `openarmature.llm.cache_read.input_tokens` + `openarmature.llm.cache_creation.input_tokens` attributes (also v0.12.0), this closes proposal 0047 end-to-end. Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
 - **`LlmFailedEvent` typed event variant** (proposal 0058, spec v0.53.0). Carves LLM provider failures into a spec-normatively-typed event variant alongside `LlmCompletionEvent`. 17 mirrored identity / scoping / request-side fields + 3 failure-specific fields (`error_category` always-present from the llm-provider §7 normative category enumeration; optional `error_type` for vendor-specific detail or upstream exception class name; always-present `error_message`). `OpenAIProvider.complete()` emits the typed event alongside the §7 exception on both raise paths — adapter-caught provider exceptions AND pre-send validation raises. Caller-side exception flow unchanged; the exception still raises out of `complete()`. Mutually exclusive with `LlmCompletionEvent` on the same call. Both bundled observers (OTel + Langfuse) consume `LlmFailedEvent` directly: same `openarmature.llm.complete` span / Generation shape as the success path with ERROR status / level + `openarmature.error.category` attribute (OTel) / `error_category` as statusMessage (Langfuse), `start_time` back-dated by `latency_ms` so the failure duration reflects the time-to-raise.
 
 ### Changed
diff --git a/conformance.toml b/conformance.toml
index da4501d..c6b1914 100644
--- a/conformance.toml
+++ b/conformance.toml
@@ -266,11 +266,34 @@ status = "implemented"
 since = "0.11.0"
 
 # Spec v0.39.0 (proposal 0047).  Implicit prefix-cache wire-byte
-# stability.  Cross-provider invariant requiring intra-impl byte
-# equality across calls with equivalent inputs.  Queued for v0.13.0
-# alongside 0049 (LLM provider hardening + typed event batch).
+# stability.  Cross-capability proposal landed in v0.13.0 across
+# three pieces: (1) ``Response.usage`` cache-stat fields
+# (``cached_tokens`` / ``cache_creation_tokens``) sourced from the
+# OpenAI ``prompt_tokens_details`` payload, with conditional emission
+# preserved (absent-vs-zero distinction stays observable) — landed
+# in the v0.12.0 cycle as the proposal's payload-side prerequisite;
+# (2) OTel observer emits ``openarmature.llm.cache_read.input_tokens``
+# (and optional ``openarmature.llm.cache_creation.input_tokens``)
+# when the corresponding usage field is populated — also v0.12.0;
+# (3) §8.1 intra-impl wire-byte canonicalization in the OpenAI
+# adapter — landed here. The canonicalizer recursively sorts dict
+# keys at every nesting level while preserving caller-supplied
+# array order, applied at the four user-input boundaries
+# (``tool.parameters`` / ``tool.function`` record top-level per
+# spec Q5, ``response_format.json_schema.schema``, ``RuntimeConfig``
+# extras, ``tool_call.arguments`` JSON encoding) plus a top-level
+# belt-and-suspenders pass over the assembled request body.  Scope
+# is the Chat Completions endpoint only; the OpenAI Responses API
+# endpoint is deferred to a future cycle (no python consumer
+# today).  Prompt-management §13 cross-variable substring stability
+# is satisfied by the existing Jinja2 ``StrictUndefined`` render
+# path; pinned by ``tests/unit/test_prompts.py::
+# test_cross_variable_substring_stability``.  Anthropic / Gemini
+# wire-byte conformance fixtures stay deferred — neither provider
+# is implemented in python today.
 [proposals."0047"]
-status = "not-yet"
+status = "implemented"
+since = "0.13.0"
 
 # Spec v0.40.0 (proposal 0048).  Read-symmetric invocation metadata.
 # Adds ``get_invocation_metadata()`` symmetric to the existing
diff --git a/docs/concepts/prompts.md b/docs/concepts/prompts.md
index bb99691..08f87b0 100644
--- a/docs/concepts/prompts.md
+++ b/docs/concepts/prompts.md
@@ -365,6 +365,73 @@ The filesystem backend layout is
 `<root>/<label>/<name>.j2`; for the example above,
 `./prompts/production/greeting.j2`.
 
+## Prefix-cache friendly authoring (APC)
+
+Inference engines that implement Automatic Prefix Caching
+(vLLM with `--enable-prefix-caching`, OpenAI's hosted prompt
+caching, llama.cpp's prefix reuse, others) skip recomputing
+attention for token prefixes they have already processed in
+a recent request. The cache hit is decided by **byte equality**
+of the prefix. A single reordered key, a shuffled tool
+definition, or a timestamp embedded in the system prompt
+invalidates the cache and re-runs full attention from the
+first changed byte.
+
+OpenArmature handles the wire-byte half of this contract for
+you. The OpenAI provider canonicalizes every user-supplied dict
+on the wire — tool parameter schemas, response-format schemas,
+`RuntimeConfig` extras, tool-call arguments — so equivalent OA
+inputs produce byte-identical wire output regardless of dict
+insertion order. Prompt rendering is deterministic by
+construction: same `Prompt` plus same variables produces
+byte-identical `PromptResult.messages` (spec
+[prompt-management §13](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/prompt-management/spec.md#13-determinism)).
+
+Authoring discipline that maximizes APC hit rates is
+out of OA's hands — it's about how you structure the prompts.
+The spec's [llm-provider §14 *APC-friendly authoring
+guidance*](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/llm-provider/spec.md#14-apc-friendly-authoring-guidance-informative)
+lists five informative patterns; the headline:
+
+1. **Place variables and chat history at the end of templates.**
+   Stable static prefix at the front maximizes cacheable bytes.
+2. **No timestamps, UUIDs, or other nondeterministic values
+   in static segments.** They poison the cache prefix on every
+   request.
+3. **Stable few-shot ordering.** Pick once, reuse across
+   requests; don't shuffle.
+4. **Sort retrieval results before injecting** when the
+   downstream consumer doesn't care about order.
+5. **Cache-friendly tool ordering.** Define tools in a stable
+   order across calls.
+
+### Debugging "the cache attribute isn't showing up"
+
+When the OTel observer is running but
+`openarmature.llm.cache_read.input_tokens` doesn't appear on
+your `openarmature.llm.complete` spans, the cause is almost
+always server-side: the inference engine either isn't
+configured to surface cache stats, or isn't running with prefix
+caching enabled at all.
+
+- **vLLM**: launch with `--enable-prefix-caching` AND
+  `--enable-prompt-tokens-details`. The first turns APC on;
+  the second tells vLLM to populate
+  `usage.prompt_tokens_details.cached_tokens` on the wire
+  response. Both flags are required for the attribute to
+  surface.
+- **OpenAI hosted (Chat Completions)**: prompt caching is
+  on automatically for prompts ≥1024 tokens; the
+  `prompt_tokens_details.cached_tokens` field appears on
+  qualifying responses without configuration.
+
+OA's role is to source the field when present (provider-side)
+and emit the attribute when populated (observer-side); without
+the upstream signal, neither happens — and that's the right
+behavior (per the spec's absent-vs-zero distinction, an absent
+attribute means "the provider didn't report," not "zero
+hits").
+
 ## What's out of scope (for now)
 
 - **Specific vendor backends**: Langfuse, PromptLayer, etc.,
diff --git a/src/openarmature/llm/providers/openai.py b/src/openarmature/llm/providers/openai.py
index 4e8b9f5..0e68b95 100644
--- a/src/openarmature/llm/providers/openai.py
+++ b/src/openarmature/llm/providers/openai.py
@@ -714,18 +714,24 @@ def _build_request_body(
                 body["stop"] = config.stop_sequences
             # Pass-through any provider-specific extras (extra="allow"
             # on RuntimeConfig); spec §6 mandates implementations MUST
-            # accept and forward undeclared fields untouched.
+            # accept and forward undeclared fields untouched. Spec 0047
+            # §8: canonicalize each extra value at the user-input
+            # boundary so dict-typed extras (vLLM ``guided_decoding``,
+            # etc.) render with stable key ordering.
             extras = config.model_extra or {}
             for k, v in extras.items():
-                body.setdefault(k, v)
+                body.setdefault(k, _canonicalize_dict_keys(v))
         # response_format is omitted entirely on the fallback path —
         # the schema travels in the augmented system message instead.
         if schema_dict is not None and include_response_format:
+            # Spec 0047 §8.1.5 / Q5 ack: response_format.json_schema.schema
+            # is a user-supplied JSON Schema and flows through the same
+            # canonicalization path as tool.parameters.
             body["response_format"] = {
                 "type": "json_schema",
                 "json_schema": {
                     "name": _derive_schema_name(schema_dict),
-                    "schema": schema_dict,
+                    "schema": _canonicalize_dict_keys(schema_dict),
                     "strict": strict_mode_supported(schema_dict),
                 },
             }
@@ -752,7 +758,12 @@ def _build_request_body(
                 }
             else:
                 body["tool_choice"] = tool_choice
-        return body
+        # Spec 0047 §8 belt-and-suspenders: walk the assembled body
+        # once more sorting any dict at every nesting level, in case
+        # a future code path introduces a user-input boundary the
+        # per-field canonicalization above doesn't cover. Cheap (the
+        # body is small) and explicit.
+        return _canonicalize_dict_keys(body)
 
     # ------------------------------------------------------------------
     # Response parsing (spec §8.1.2)
@@ -1132,8 +1143,11 @@ def _message_to_wire(msg: Message) -> dict[str, Any]:
                         "name": tc.name,
                         # Canonical compact form (no inter-token spaces). Matches
                         # the spec's wire-mapping fixture (005, cases shape) and
-                        # the form OpenAI itself emits.
-                        "arguments": json.dumps(tc.arguments or {}, separators=(",", ":")),
+                        # the form OpenAI itself emits. ``sort_keys=True`` per
+                        # spec 0047 §8 — tool-call arguments are a
+                        # caller-supplied dict and the JSON-encoded string
+                        # MUST be byte-stable across equivalent inputs.
+                        "arguments": json.dumps(tc.arguments or {}, separators=(",", ":"), sort_keys=True),
                     },
                 }
                 for tc in msg.tool_calls
@@ -1169,14 +1183,55 @@ def _block_to_wire(block: ContentBlock) -> dict[str, Any]:
     return {"type": "image_url", "image_url": image_url}
 
 
+# Spec 0047 §8 *Intra-impl wire-byte stability* canonicalizer.
+# Recursively sorts dict keys at every nesting level; preserves list
+# ordering (per Q5 ack on the proposal-0047 coord thread — array
+# ORDER is caller-supplied and stays as-is; object KEYS inside
+# arrays get sorted via the dict-recursion branch). Applied at every
+# user-supplied-dict boundary in the wire body so equivalent OA
+# inputs produce byte-identical wire output for APC hit reliability.
+#
+# Recursion depth: bounded by the depth of the input dict, not by
+# any internal accumulator. Python's default recursion limit (1000)
+# is two orders of magnitude above realistic JSON Schema depths
+# (typical schemas top out at 5-10 nesting levels — OpenAI's API
+# rejects deeper ones at the wire layer before the cache prefix
+# matters). We don't impose our own cap; if a caller hands us a
+# 1000-deep nested dict, RecursionError surfaces immediately at
+# canonicalization time rather than producing silently-broken wire
+# bytes downstream.
+#
+# Byte-stability requires Python's dict insertion-order preservation
+# guarantee (PEP 468, 3.7+) AND httpx serializing the body via the
+# stdlib ``json.dumps`` default (which respects dict iteration
+# order). Both are stable contracts on the supported Python versions
+# + httpx 0.27+. If a future httpx release internalizes ordering
+# (e.g., switches to alphabetical key emission), the canonicalizer
+# becomes redundant but tests would continue to pass; if it
+# randomizes ordering, the wire-byte tests in
+# ``tests/unit/test_llm_provider.py`` would fail loudly.
+def _canonicalize_dict_keys(value: Any) -> Any:
+    if isinstance(value, dict):
+        return {k: _canonicalize_dict_keys(value[k]) for k in sorted(cast("dict[str, Any]", value))}
+    if isinstance(value, list):
+        return [_canonicalize_dict_keys(v) for v in cast("list[Any]", value)]
+    return value
+
+
 def _tool_to_wire(tool: Tool) -> dict[str, Any]:
+    # Per spec 0047 §8 ack (coord Q5): the byte-stability rule covers
+    # tool DEFINITIONS broadly — not just the parameters subtree.
+    # Sort the function record's top-level keys + recursively
+    # canonicalize the parameters JSON Schema.
     return {
         "type": "function",
-        "function": {
-            "name": tool.name,
-            "description": tool.description,
-            "parameters": tool.parameters,
-        },
+        "function": _canonicalize_dict_keys(
+            {
+                "name": tool.name,
+                "description": tool.description,
+                "parameters": tool.parameters,
+            }
+        ),
     }
 
 
diff --git a/tests/unit/test_llm_provider.py b/tests/unit/test_llm_provider.py
index 4c5aa55..4d93bb5 100644
--- a/tests/unit/test_llm_provider.py
+++ b/tests/unit/test_llm_provider.py
@@ -11,9 +11,10 @@
 
 from __future__ import annotations
 
+import json
 from collections.abc import Callable
 from contextvars import Token
-from typing import cast
+from typing import Any, cast
 
 import httpx
 import pytest
@@ -1902,3 +1903,426 @@ async def test_llm_completion_event_sources_node_identity_from_calling_context()
     assert typed.attempt_index == 2
     assert typed.fan_out_index == 3
     assert typed.branch_name == "fast"
+
+
+# ---------------------------------------------------------------------------
+# Proposal 0047: intra-impl wire-byte stability
+# ---------------------------------------------------------------------------
+
+
+async def test_wire_byte_equality_across_dict_key_insertion_order_on_tool_parameters() -> None:
+    # Spec 0047 §8 intra-impl wire-byte stability: two structurally-
+    # equivalent calls whose tool.parameters dicts differ only in
+    # key insertion order MUST produce byte-identical wire bytes.
+    # Caller-supplied JSON Schemas are the primary source of byte
+    # drift under APC; locking them down here pins the contract.
+    from openarmature.llm import Tool
+
+    captured: list[bytes] = []
+
+    def _handler(req: httpx.Request) -> httpx.Response:
+        captured.append(bytes(req.content))
+        return httpx.Response(
+            200,
+            json={
+                "id": "x",
+                "object": "chat.completion",
+                "created": 0,
+                "model": "m",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": "ok"},
+                        "finish_reason": "stop",
+                    }
+                ],
+                "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+            },
+        )
+
+    # Tool A: parameters dict built with one key order.
+    tool_a = Tool(
+        name="lookup",
+        description="Look something up.",
+        parameters={
+            "type": "object",
+            "properties": {
+                "query": {"type": "string"},
+                "limit": {"type": "integer"},
+            },
+            "required": ["query"],
+        },
+    )
+    # Tool B: SAME schema, but ``properties`` keys + top-level keys
+    # in a different insertion order.
+    tool_b = Tool(
+        name="lookup",
+        description="Look something up.",
+        parameters={
+            "required": ["query"],
+            "properties": {
+                "limit": {"type": "integer"},
+                "query": {"type": "string"},
+            },
+            "type": "object",
+        },
+    )
+    provider = OpenAIProvider(
+        base_url="http://test", model="m", api_key="k", transport=httpx.MockTransport(_handler)
+    )
+    try:
+        await provider.complete([UserMessage(content="hi")], tools=[tool_a])
+        await provider.complete([UserMessage(content="hi")], tools=[tool_b])
+    finally:
+        await provider.aclose()
+
+    assert len(captured) == 2
+    assert captured[0] == captured[1], (
+        f"wire bytes differ under permuted dict keys:\n  A: {captured[0]!r}\n  B: {captured[1]!r}"
+    )
+
+
+async def test_wire_byte_equality_across_runtime_config_extras_dict_order() -> None:
+    # Spec 0047 §8: RuntimeConfig.extras keys flow through with sorted
+    # ordering even when the caller supplied them in a different
+    # insertion order. Catches the vLLM ``guided_decoding={"choice":
+    # ["a", "b"]}``-style extras where dict-typed values are the
+    # primary cache-stability hit.
+    from openarmature.llm import RuntimeConfig
+
+    captured: list[bytes] = []
+
+    def _handler(req: httpx.Request) -> httpx.Response:
+        captured.append(bytes(req.content))
+        return httpx.Response(
+            200,
+            json={
+                "id": "x",
+                "object": "chat.completion",
+                "created": 0,
+                "model": "m",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": "ok"},
+                        "finish_reason": "stop",
+                    }
+                ],
+                "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+            },
+        )
+
+    config_a = RuntimeConfig.model_validate(
+        {"guided_decoding": {"choice": ["a", "b"], "backend": "outlines"}}
+    )
+    config_b = RuntimeConfig.model_validate(
+        {"guided_decoding": {"backend": "outlines", "choice": ["a", "b"]}}
+    )
+    provider = OpenAIProvider(
+        base_url="http://test", model="m", api_key="k", transport=httpx.MockTransport(_handler)
+    )
+    try:
+        await provider.complete([UserMessage(content="hi")], config=config_a)
+        await provider.complete([UserMessage(content="hi")], config=config_b)
+    finally:
+        await provider.aclose()
+
+    assert len(captured) == 2
+    assert captured[0] == captured[1]
+
+
+async def test_wire_byte_array_ordering_preserved() -> None:
+    # Spec 0047 §8 / Q5: array ORDER is caller-supplied and MUST be
+    # preserved — only dict KEYS get sorted. Verify that swapping
+    # the order of items in ``stop_sequences`` produces DIFFERENT
+    # wire bytes (the canonicalizer must not silently sort the list).
+    from openarmature.llm import RuntimeConfig
+
+    captured: list[bytes] = []
+
+    def _handler(req: httpx.Request) -> httpx.Response:
+        captured.append(bytes(req.content))
+        return httpx.Response(
+            200,
+            json={
+                "id": "x",
+                "object": "chat.completion",
+                "created": 0,
+                "model": "m",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": "ok"},
+                        "finish_reason": "stop",
+                    }
+                ],
+                "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+            },
+        )
+
+    config_a = RuntimeConfig(stop_sequences=["foo", "bar"])
+    config_b = RuntimeConfig(stop_sequences=["bar", "foo"])
+    provider = OpenAIProvider(
+        base_url="http://test", model="m", api_key="k", transport=httpx.MockTransport(_handler)
+    )
+    try:
+        await provider.complete([UserMessage(content="hi")], config=config_a)
+        await provider.complete([UserMessage(content="hi")], config=config_b)
+    finally:
+        await provider.aclose()
+
+    assert len(captured) == 2
+    assert captured[0] != captured[1], (
+        "caller-supplied list order MUST be preserved on the wire; "
+        f"got identical bytes for [foo,bar] and [bar,foo]: {captured[0]!r}"
+    )
+
+
+async def test_wire_byte_equality_across_tool_call_arguments_dict_order() -> None:
+    # Spec 0047 §8: tool_call.arguments is a caller-supplied dict
+    # JSON-encoded into a string field. The encoded string MUST be
+    # byte-stable across equivalent dicts with different key orders.
+    captured: list[bytes] = []
+
+    def _handler(req: httpx.Request) -> httpx.Response:
+        captured.append(bytes(req.content))
+        return httpx.Response(
+            200,
+            json={
+                "id": "x",
+                "object": "chat.completion",
+                "created": 0,
+                "model": "m",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": "ok"},
+                        "finish_reason": "stop",
+                    }
+                ],
+                "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+            },
+        )
+
+    assistant_a = AssistantMessage(
+        content="",
+        tool_calls=[ToolCall(id="c1", name="lookup", arguments={"query": "x", "limit": 5})],
+    )
+    assistant_b = AssistantMessage(
+        content="",
+        tool_calls=[ToolCall(id="c1", name="lookup", arguments={"limit": 5, "query": "x"})],
+    )
+    provider = OpenAIProvider(
+        base_url="http://test", model="m", api_key="k", transport=httpx.MockTransport(_handler)
+    )
+    try:
+        await provider.complete(
+            [
+                UserMessage(content="hi"),
+                assistant_a,
+                ToolMessage(content="result", tool_call_id="c1"),
+            ]
+        )
+        await provider.complete(
+            [
+                UserMessage(content="hi"),
+                assistant_b,
+                ToolMessage(content="result", tool_call_id="c1"),
+            ]
+        )
+    finally:
+        await provider.aclose()
+
+    assert len(captured) == 2
+    assert captured[0] == captured[1]
+
+
+async def test_wire_byte_equality_response_format_schema_under_key_permutation() -> None:
+    # Spec 0047 §8 / Q5: response_format.json_schema.schema is a
+    # caller-supplied JSON Schema that flows through the same
+    # canonicalization path as tool.parameters. Verify byte-equality
+    # under recursive key permutation including nested ``properties``.
+    captured: list[bytes] = []
+
+    def _handler(req: httpx.Request) -> httpx.Response:
+        captured.append(bytes(req.content))
+        return httpx.Response(
+            200,
+            json={
+                "id": "x",
+                "object": "chat.completion",
+                "created": 0,
+                "model": "m",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": '{"answer": "ok"}'},
+                        "finish_reason": "stop",
+                    }
+                ],
+                "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+            },
+        )
+
+    schema_a: dict[str, Any] = {
+        "type": "object",
+        "properties": {
+            "answer": {"type": "string"},
+            "score": {"type": "number"},
+        },
+        "required": ["answer"],
+        "additionalProperties": False,
+    }
+    schema_b: dict[str, Any] = {
+        "additionalProperties": False,
+        "required": ["answer"],
+        "properties": {
+            "score": {"type": "number"},
+            "answer": {"type": "string"},
+        },
+        "type": "object",
+    }
+    provider = OpenAIProvider(
+        base_url="http://test", model="m", api_key="k", transport=httpx.MockTransport(_handler)
+    )
+    try:
+        await provider.complete([UserMessage(content="hi")], response_schema=schema_a)
+        await provider.complete([UserMessage(content="hi")], response_schema=schema_b)
+    finally:
+        await provider.aclose()
+
+    assert len(captured) == 2
+    assert captured[0] == captured[1]
+
+
+def test_canonicalize_dict_keys_sorts_recursively_and_preserves_lists() -> None:
+    # Locks down the helper's contract directly — defensive in case
+    # the wire-byte tests above ever miss a regression that surfaces
+    # only in deeply nested or list-of-objects shapes.
+    from openarmature.llm.providers.openai import _canonicalize_dict_keys
+
+    src: dict[str, Any] = {
+        "z": 1,
+        "a": {
+            "y": [{"d": 4, "c": 3}, {"b": 2, "a": 1}],
+            "x": "v",
+        },
+    }
+    result = _canonicalize_dict_keys(src)
+    # Top-level keys sorted.
+    assert list(result.keys()) == ["a", "z"]
+    # Nested dict keys sorted.
+    assert list(result["a"].keys()) == ["x", "y"]
+    # List ordering preserved (the two objects stay in source order).
+    assert result["a"]["y"][0] == {"c": 3, "d": 4}
+    assert result["a"]["y"][1] == {"a": 1, "b": 2}
+    # Inside-list dicts have sorted keys.
+    assert list(result["a"]["y"][0].keys()) == ["c", "d"]
+    assert list(result["a"]["y"][1].keys()) == ["a", "b"]
+
+
+async def test_wire_body_top_level_keys_arrive_sorted() -> None:
+    # Direct assertion on the belt-and-suspenders pass at the end of
+    # _build_request_body — independent of any single apply site. Walks
+    # the captured JSON body and confirms every dict at every nesting
+    # level has lexicographically-sorted keys. Catches a regression
+    # where a future code path adds a key after the belt-and-suspenders
+    # pass would have run, or where the pass itself gets removed.
+    captured: list[bytes] = []
+
+    def _handler(req: httpx.Request) -> httpx.Response:
+        captured.append(bytes(req.content))
+        return httpx.Response(
+            200,
+            json={
+                "id": "x",
+                "object": "chat.completion",
+                "created": 0,
+                "model": "m",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": "ok"},
+                        "finish_reason": "stop",
+                    }
+                ],
+                "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+            },
+        )
+
+    provider = OpenAIProvider(
+        base_url="http://test", model="m", api_key="k", transport=httpx.MockTransport(_handler)
+    )
+    try:
+        await provider.complete([UserMessage(content="hi")])
+    finally:
+        await provider.aclose()
+
+    assert len(captured) == 1
+    body = json.loads(captured[0])
+
+    def _assert_sorted(node: Any, path: str) -> None:
+        if isinstance(node, dict):
+            keys = list(cast("dict[str, Any]", node).keys())
+            assert keys == sorted(keys), f"keys at {path} not sorted: {keys}"
+            for k, v in cast("dict[str, Any]", node).items():
+                _assert_sorted(v, f"{path}.{k}")
+        elif isinstance(node, list):
+            for i, v in enumerate(cast("list[Any]", node)):
+                _assert_sorted(v, f"{path}[{i}]")
+
+    _assert_sorted(body, "<root>")
+
+
+async def test_wire_byte_equality_across_image_content_blocks() -> None:
+    # The image content-block wire shape (``_block_to_wire``) is
+    # fully OA-controlled — no caller-supplied dict passes through.
+    # But the canonicalization pass at the body root walks through
+    # it, and we want byte-equality across equivalent calls to be
+    # observably stable (a future refactor that introduces a caller-
+    # supplied source dict at this boundary would need to keep this
+    # test passing). Two calls with the same image + same surrounding
+    # text produce identical wire bytes.
+    from openarmature.llm.messages import ImageBlock, ImageSourceURL, TextBlock
+
+    captured: list[bytes] = []
+
+    def _handler(req: httpx.Request) -> httpx.Response:
+        captured.append(bytes(req.content))
+        return httpx.Response(
+            200,
+            json={
+                "id": "x",
+                "object": "chat.completion",
+                "created": 0,
+                "model": "m",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": "ok"},
+                        "finish_reason": "stop",
+                    }
+                ],
+                "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+            },
+        )
+
+    def _msg() -> UserMessage:
+        return UserMessage(
+            content=[
+                TextBlock(text="what is this?"),
+                ImageBlock(source=ImageSourceURL(url="https://example.com/img.png"), detail="auto"),
+            ]
+        )
+
+    provider = OpenAIProvider(
+        base_url="http://test", model="m", api_key="k", transport=httpx.MockTransport(_handler)
+    )
+    try:
+        await provider.complete([_msg()])
+        await provider.complete([_msg()])
+    finally:
+        await provider.aclose()
+
+    assert len(captured) == 2
+    assert captured[0] == captured[1]
diff --git a/tests/unit/test_prompts.py b/tests/unit/test_prompts.py
index 475e534..24e601d 100644
--- a/tests/unit/test_prompts.py
+++ b/tests/unit/test_prompts.py
@@ -729,3 +729,98 @@ def __init__(self, prompt: Prompt) -> None:
 
     async def fetch(self, name: str, label: str = "production") -> Prompt:
         return self._prompt
+
+
+# ---------------------------------------------------------------------------
+# Proposal 0047 §13: cross-variable substring stability
+# ---------------------------------------------------------------------------
+
+
+def test_cross_variable_substring_stability_text_prompt() -> None:
+    # Spec 0047 §13 *Determinism* — *Cross-variable substring stability*
+    # (normative clause): the static substring of a rendered output —
+    # the portion not derived from variable substitution — MUST be
+    # identical across renders that differ ONLY in unrelated variable
+    # bindings. Two renders of the same template with different
+    # user-bound values flanking a common static segment must agree on
+    # that static segment byte-for-byte. Jinja2's StrictUndefined render
+    # path satisfies this naturally; the test pins the contract so a
+    # future render-time mutation (e.g., introducing context-aware
+    # whitespace normalization) would fail loud rather than silently
+    # break APC hit rates.
+    template = "system: classify the input.\nuser: {{ user_text }}\n\ncontext: {{ context }}\n"
+    prompt = _make_prompt(template)
+    manager = PromptManager(_StubBackend(prompt))
+
+    result_a = manager.render(prompt, {"user_text": "alice", "context": "ctx1"})
+    result_b = manager.render(prompt, {"user_text": "bob", "context": "ctx2"})
+
+    rendered_a = result_a.messages[0].content
+    rendered_b = result_b.messages[0].content
+    assert isinstance(rendered_a, str) and isinstance(rendered_b, str)
+
+    # The static prefix (everything before the first substitution) MUST
+    # be byte-identical across renders.
+    static_prefix = "system: classify the input.\nuser: "
+    assert rendered_a.startswith(static_prefix)
+    assert rendered_b.startswith(static_prefix)
+    # The static infix between the two substitutions MUST be byte-
+    # identical too.
+    static_infix = "\n\ncontext: "
+    assert static_infix in rendered_a
+    assert static_infix in rendered_b
+    # Confirm the substitutions actually landed in their slots (so the
+    # test is verifying substring stability, not just unconditional
+    # equality on a degenerate render).
+    assert "alice" in rendered_a and "bob" in rendered_b
+    assert "ctx1" in rendered_a and "ctx2" in rendered_b
+
+
+def test_cross_variable_substring_stability_chat_prompt() -> None:
+    # Spec 0047 §13's substring stability rule applies to the multi-
+    # segment chat-prompt variant too — the proposal's normative text
+    # calls out "system prefix text, few-shot exchange text, segment
+    # role markers" explicitly. Each rendered segment's static portions
+    # (the role marker shape + the inter-segment formatting + the
+    # template's literal substrings) MUST be byte-identical across
+    # renders that differ only in variable bindings.
+    from openarmature.prompts import ChatPrompt, ContentSegment
+
+    chat_prompt = ChatPrompt(
+        name="classifier",
+        version="v1",
+        label="production",
+        chat_template=[
+            ContentSegment(role="system", content="Classify the input as ham or spam."),
+            ContentSegment(role="user", content="Subject: {{ subject }}\n\nBody: {{ body }}"),
+        ],
+        template_hash="sha256:chat-v1",
+        fetched_at=datetime.now(UTC),
+    )
+    manager = PromptManager(_StubBackend(chat_prompt))
+
+    result_a = manager.render(chat_prompt, {"subject": "alice's email", "body": "hello"})
+    result_b = manager.render(chat_prompt, {"subject": "bob's email", "body": "world"})
+
+    # Both renders MUST produce the same segment shape (same number of
+    # messages, same roles in the same order).
+    assert len(result_a.messages) == len(result_b.messages)
+    for msg_a, msg_b in zip(result_a.messages, result_b.messages, strict=True):
+        assert type(msg_a) is type(msg_b)
+    # Static (non-substituted) system segment MUST be byte-identical.
+    sys_a = result_a.messages[0].content
+    sys_b = result_b.messages[0].content
+    assert sys_a == sys_b
+    # User segment's static infix between the two substitutions MUST
+    # be byte-identical.
+    user_a = result_a.messages[1].content
+    user_b = result_b.messages[1].content
+    assert isinstance(user_a, str) and isinstance(user_b, str)
+    static_prefix = "Subject: "
+    static_infix = "\n\nBody: "
+    assert user_a.startswith(static_prefix) and user_b.startswith(static_prefix)
+    assert static_infix in user_a and static_infix in user_b
+    # Confirm the substitutions actually differ — guards against a
+    # degenerate-equality false pass.
+    assert "alice's email" in user_a and "bob's email" in user_b
+    assert user_a.endswith("hello") and user_b.endswith("world")

From 1bbbb82bd494a5cce18c0bc7e056587a1c2a2ac9 Mon Sep 17 00:00:00 2001
From: chris-colinsky <chris@lunarcommand.xyz>
Date: Tue, 9 Jun 2026 17:05:33 -0700
Subject: [PATCH 2/2] Address PR 145 review

Two dead-pointer fixes flagged by CoPilot, both review-round-rename
casualties:

1. CHANGELOG entry referenced ``_canonicalize_json_schema``; the
   helper was renamed to ``_canonicalize_dict_keys`` because it
   canonicalizes every user-supplied dict on the wire, not just
   JSON Schemas.

2. ``conformance.toml`` 0047 leading-comment block pointed at
   ``test_cross_variable_substring_stability``; that test got
   split into ``..._text_prompt`` and ``..._chat_prompt`` when
   coverage extended to the ChatPrompt variant.
---
 CHANGELOG.md     | 2 +-
 conformance.toml | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 43bebee..613754c 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,7 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
 
 ### Added
 
-- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). The OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding). A new `_canonicalize_json_schema` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split between "object keys MUST be sorted" and "array order MUST be preserved per caller-supplied order"). A top-level belt-and-suspenders canonicalization pass over the assembled body catches anything the per-field passes miss. Combined with the existing `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from `prompt_tokens_details` (v0.12.0) and the OTel observer's `openarmature.llm.cache_read.input_tokens` + `openarmature.llm.cache_creation.input_tokens` attributes (also v0.12.0), this closes proposal 0047 end-to-end. Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
+- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). The OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding). A new `_canonicalize_dict_keys` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split between "object keys MUST be sorted" and "array order MUST be preserved per caller-supplied order"). A top-level belt-and-suspenders canonicalization pass over the assembled body catches anything the per-field passes miss. Combined with the existing `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from `prompt_tokens_details` (v0.12.0) and the OTel observer's `openarmature.llm.cache_read.input_tokens` + `openarmature.llm.cache_creation.input_tokens` attributes (also v0.12.0), this closes proposal 0047 end-to-end. Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
 - **`LlmFailedEvent` typed event variant** (proposal 0058, spec v0.53.0). Carves LLM provider failures into a spec-normatively-typed event variant alongside `LlmCompletionEvent`. 17 mirrored identity / scoping / request-side fields + 3 failure-specific fields (`error_category` always-present from the llm-provider §7 normative category enumeration; optional `error_type` for vendor-specific detail or upstream exception class name; always-present `error_message`). `OpenAIProvider.complete()` emits the typed event alongside the §7 exception on both raise paths — adapter-caught provider exceptions AND pre-send validation raises. Caller-side exception flow unchanged; the exception still raises out of `complete()`. Mutually exclusive with `LlmCompletionEvent` on the same call. Both bundled observers (OTel + Langfuse) consume `LlmFailedEvent` directly: same `openarmature.llm.complete` span / Generation shape as the success path with ERROR status / level + `openarmature.error.category` attribute (OTel) / `error_category` as statusMessage (Langfuse), `start_time` back-dated by `latency_ms` so the failure duration reflects the time-to-raise.
 
 ### Changed
diff --git a/conformance.toml b/conformance.toml
index c6b1914..4d31243 100644
--- a/conformance.toml
+++ b/conformance.toml
@@ -288,7 +288,9 @@ since = "0.11.0"
 # today).  Prompt-management §13 cross-variable substring stability
 # is satisfied by the existing Jinja2 ``StrictUndefined`` render
 # path; pinned by ``tests/unit/test_prompts.py::
-# test_cross_variable_substring_stability``.  Anthropic / Gemini
+# test_cross_variable_substring_stability_text_prompt`` and
+# ``test_cross_variable_substring_stability_chat_prompt``.
+# Anthropic / Gemini
 # wire-byte conformance fixtures stay deferred — neither provider
 # is implemented in python today.
 [proposals."0047"]