v3.46-trained: 60-step training + audit (18/26, −3 vs fresh 21/26) by FluffyAIcode · Pull Request #28 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-21T23:16:15Z

Child of #27. GPU training path per SPRINT_CLOSEOUT_v3.46.md §5.

Infrastructure

GPU: NVIDIA H200 (143 GiB VRAM, driver 575.57.08) via SSH to vast.ai.
Stack: torch 2.11.0+cu128, transformers 5.5.4, Qwen2.5-1.5B-Instruct bf16.
All §8 sanity checks (1–4) pass on the remote before training:
- git status clean on AgentMemory/v346-trained-gpu-7e97
- torch.cuda.is_available() == True, device NVIDIA H200
- Cfg.use_top1_exclusive_content_bias is False, Cfg.tail_slot_residual_dominant is False
- diag_4_23_cond_buffer.py → rank of ' control' = 1 on both paraphrases via _last_cond_tail_slots.

Training (commit `95a9ec1`, with loader fix `d8d1a85`)

python3 train_v346.py --steps 60 --out ckpt/v346_trained.pt

335 s wall, ~5.6 s/step after step 0 warmup.
113.8 M trainable non-backbone params; 11 memories stored pre-training.
Cfg unchanged vs v3.46 (§5.4 lock); no Trainer loss additions (§5.4).
Checkpoint: 455 MB, 202 non-backbone tensors, git-ignored per ckpt/*.pt, provenance recorded in save blob.

§5.6 mechanism observables (data, not prediction):

Observable	Pre	Post	§5.6 range	In range?
`tail_head.slot_heads[1][0].weight.abs().mean()`	`0.0`	`7.30e-4`	`[1e-4, 1e-2]`	yes
`vocab_proj.proj[-1].weight.abs().mean()`	`0.0`	`5.49e-4`	`[1e-4, 1e-2]`	yes

Both necessary conditions met — §5.6 explicitly stated these do not guarantee the audit flips.

Audit (commit `19a4ec4`)

AMS_TRAINED_WEIGHTS=ckpt/v346_trained.pt AMS_DETERMINISTIC=1 \
  python3 v331_blackbox_eval.py

Elapsed 1250 s on H200. Loader reports loaded=202 skipped=0 shape_errs=0 on every case except 4.25 (which deliberately scales L_mem and has 2 expected shape mismatches; see commit d8d1a85).

Score: 18/26 (fresh-init baseline was 21/26; Δ = −3)

Regressions (PASS → FAIL)

Case	Trained observed	Threshold	Why
4.17 retrieval_prefix_decode_correlation_audit	`prefix_l2_shift = 3.22e+11`, correlation `null`	finite correlation	trained prefix magnitude saturated; `sa=3.0×` pulled it without a norm constraint
4.20 rerank_stability_probe	`space_P2` jaccard 0.429 (spearman 0.961)	both pairs jaccard ≥ 0.6	trained clusters sharper but more paraphrase-brittle
4.25 prefix_length_scaling_probe	`avg_mass_ratio_B_over_A = 0.82` (L_mem 8 → 16)	> 1.10	trained slots anti-correlate with L_mem: more slots ⇒ dilution, not amplification

Pre-existing FAILs also got marginally worse: 4.8 "The pianist" unique_ratio 0.343 → 0.296; 4.21 avg_max_repeat 4.67 → 5.0.

Zero cases flipped FAIL → PASS.

Axis coverage

Axis	Fresh	Trained
A compression	FAIL (8.97 / 10.0 — structural)	FAIL (8.97 / 10.0)
B injection cost	PASS	PASS
C fidelity	FAIL (8/11)	FAIL (6/11)
D stability	FAIL (2/3)	FAIL (1/3)

Structural read (see `SPRINT_CLOSEOUT_v3.46.md` §1.5)

60 steps on a 12-text corpus with semantic_alignment at weight 3.0 and no prefix-norm constraint drove the ctx encoder to saturate prefix magnitude, while the newly-trainable tail/vocab paths gained small weights that reinforce the corpus's own repetition. This is §5.7 option-A territory — pre-amplification gap under the current bridge depth/width and loss family — now confirmed with data.

Anti-patterns explicitly ruled out per §3.3 / §5.4 / §5.7:

Not: tuning semantic_alignment weight, cfg_scale, or any Cfg parameter post-audit.
Not: adding prefix-norm regularizer or vocab_bias floor at decode time (§5.7 option-B requires SPEC amendment, not done here).
Not: trivially extending training to 100–300 steps — with no norm constraint, longer training will regress 4.17 further.

Files changed

scheme_b_v344.py (+66 lines) — MemLLM._maybe_load_trained_weights opt-in hook.
train_v346.py (+157 lines, new) — GPU training driver per §5.3.
SPRINT_CLOSEOUT_v3.46.md — §1.4/§1.5 trained-audit table, §1.3 axis update, §2 version row, §7 PR table.
reports/v346_trained_blackbox/ — full report JSON+MD, audit stdout, train logs.

Commits on this branch

95a9ec1 — add train_v346.py + AMS_TRAINED_WEIGHTS loader
d8d1a85 — loader: shape mismatch warn+skip (not fatal); still fail-loud if 0 loaded
19a4ec4 — trained audit artifacts + SPRINT update

Per SPEC §7.7 norm, this PR does not claim the channel is "working" or "not working" — it reports axis-C at 6/11 and axis-D at 1/3 under the current Cfg with a 60-step trained checkpoint, and records three specific mechanism-level regressions with their numerical causes.

Per SPRINT_CLOSEOUT_v3.46.md \u00a75.3/\u00a75.4 and \u00a75 loader note. train_v346.py - Copies the v344 driver template, points to scheme_b_v344 (= v3.46 SUT). - Asserts v3.46 Cfg invariants (use_top1_exclusive_content_bias=False, tail_slot_residual_dominant=False). - Requires CUDA by default; AMS_ALLOW_CPU_TRAIN=1 to override. - Logs pre/post "mechanism-level observable" probes per \u00a75.6: tail_head.slot_heads[1][0].weight.abs().mean and vocab_proj.proj[-1].weight.abs().mean. - Saves non-backbone state_dict + non-backbone buffers to ckpt/v346_trained.pt with provenance + Cfg snapshot. scheme_b_v344.MemLLM._maybe_load_trained_weights - New hook called at end of load(); opt-in via AMS_TRAINED_WEIGHTS env. - Loads non-backbone tensors into matching params/buffers; backbone excluded. - Strict shape check: raises on mismatch (protects against loading the v344/v348 ckpts per \u00a76 warning about shape incompatibility). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…atal Root cause: 4.25 prefix_length_scaling_probe intentionally builds model_b with L_mem doubled (default 8 -> 16). The checkpoint was trained with L_mem=8, so L_mem-dependent tensors (e.g. mem_tokens[L_mem, d_LLM]) legitimately don't fit model_b — this is not a corrupt/incompatible ckpt, it's a deliberate Cfg scan. Old behavior: raise RuntimeError on any shape mismatch -> errored 4.25. New behavior: - Per-tensor shape mismatch is logged and skipped (first 5 detailed, rest summarized). - Hard failure only when the ckpt had non-backbone content (>10 tensors) AND zero tensors loaded — that is the §6 'wrong-SUT ckpt' pattern we must catch. Keeps the §6 protection against loading v344_trained.pt / v348_stacked.pt against a v3.46 SUT (they would mostly shape-mismatch and hit the loaded==0 guard), while letting L_mem-scaling probes proceed. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Child PR of #27. Training driver train_v346.py run for 60 steps on NVIDIA H200 (vast.ai), elapsed 335 s, mechanism observables per \u00a75.6 moved into target range (tail_head slot1 |w|_mean: 0 -> 7.30e-4; vocab_proj |w|_mean: 0 -> 5.49e-4, both in [1e-4, 1e-2]). Necessary conditions met; sufficient: not. Audit with AMS_TRAINED_WEIGHTS=ckpt/v346_trained.pt, AMS_DETERMINISTIC=1, elapsed 1250 s. Results (as data, per SPEC \u00a77.7 norm, no Delta-pass-count was predicted): PASS 18, FAIL 8 (was 21, 5). Zero cases flipped FAIL -> PASS. Three cases flipped PASS -> FAIL: 4.17 retrieval_prefix_decode_correlation_audit (prefix_l2_shift = 3.22e+11, correlation undefined -- trained prefix magnitude blew up) 4.20 rerank_stability_probe (space_P2 jaccard 0.429 < 0.6) 4.25 prefix_length_scaling_probe (L_mem 8->16 reduces starter mass to 0.82x, probe requires >1.10x) Regressions 4.8/4.21 also got worse: 'The pianist' unique_ratio 0.343 -> 0.296, avg_max_repeat 4.67 -> 5.0. Axis C: 8/11 -> 6/11. Axis D: 2/3 -> 1/3. Structural read (\u00a71.5): 60 steps on 12-text corpus with semantic_alignment weight 3.0 and no prefix-norm constraint caused the ctx encoder to saturate prefix magnitude while tail/vocab paths gained just enough weight to reinforce the corpus's own repetition pattern. This is \u00a75.7 option-A territory (pre-amplification gap) confirmed with data rather than predicted. Artifacts committed: reports/v346_trained_blackbox/report.{json,md} reports/v346_trained_blackbox/stdout.log reports/v346_trained_blackbox/train_log.jsonl reports/v346_trained_blackbox/train_stdout.log No Cfg changes (\u00a75.4), no Trainer loss additions (\u00a75.4). ckpt/v346_trained.pt is git-ignored per existing ckpt/*.pt rule; provenance recorded in the torch.save blob and in report metadata. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits April 21, 2026 23:15

cursor Bot changed the title ~~v3.46-trained: training driver + AMS_TRAINED_WEIGHTS loader (pre-training commit)~~ v3.46-trained: 60-step training + audit (18/26, −3 vs fresh 21/26) Apr 22, 2026

FluffyAIcode mentioned this pull request Apr 22, 2026

v3.46 · session-layer viability: prefix channel (A/C) is the ship track; B_ams_text is a diagnostic #29

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.46-trained: 60-step training + audit (18/26, −3 vs fresh 21/26)#28

v3.46-trained: 60-step training + audit (18/26, −3 vs fresh 21/26)#28
FluffyAIcode wants to merge 3 commits intoAgentMemory/v346-revertE-topk-nonexclusive-7e97from
AgentMemory/v346-trained-gpu-7e97

FluffyAIcode commented Apr 21, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Infrastructure

Training (commit 95a9ec1, with loader fix d8d1a85)

Audit (commit 19a4ec4)

Score: 18/26 (fresh-init baseline was 21/26; Δ = −3)

Regressions (PASS → FAIL)

Zero cases flipped FAIL → PASS.

Axis coverage

Structural read (see SPRINT_CLOSEOUT_v3.46.md §1.5)

Anti-patterns explicitly ruled out per §3.3 / §5.4 / §5.7:

Files changed

Commits on this branch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 21, 2026 •

edited by cursor Bot

Loading

Training (commit `95a9ec1`, with loader fix `d8d1a85`)

Audit (commit `19a4ec4`)

Structural read (see `SPRINT_CLOSEOUT_v3.46.md` §1.5)