This guide explains the knobs used by azchess.tools.bench_local_loop during Matrix0 self-play, training, and checkpoint comparison. It is optimized for the current Apple Silicon local-loop workflow.
Use the local loop to answer one narrow question at a time:
- did generator data get better?
- did policy labels get sharper?
- did value targets get less biased?
- did training improve heldout metrics?
- did a candidate generate better games than its parent?
Do not use a full training cycle until the generator-only run passes basic data-quality checks. The current bootstrap_007 path is a guarded fresh self-play loop: small fresh batches, stable anchor data, full heldout source-sliced eval, parent policy distillation, and promotion only when aggregate value improves without source-slice regression.
Current mainline parent:
checkpoints/bootstrap_007_fresh_anchor_best.pt
Inspect these in local_loop_report.json under fresh_data.quality.
legal_policy_mass: target policy mass on legal moves. Should be near1.0.policy_top_prob: average probability assigned to the most visited move. Higher means sharper labels.policy_entropy: entropy of the MCTS visit distribution. Lower means sharper labels.policy_support: number of moves with non-trivial target probability. Lower means more focused labels.avg_sims: actual average simulations per move.source_metrics: the same metrics split bycapped,tablebase,terminal, anddraw_adjudication.source_metrics.<source>.final_piece_count: material remaining when the game ended or capped.source_metrics.<source>.final_halfmove_clock: final halfmove clock; high values indicate fifty-move pressure.source_metrics.<source>.final_legal_count: final mobility at termination/cap.source_metrics.<source>.final_can_claim_draw: count of games where a draw claim was available at the final position.
Current rough targets:
legal_policy_mass ~= 1.0policy_top_prob >= 0.25for sharp-search experimentspolicy_entropy < 1.9for sharp-search experiments- low draw adjudication unless a draw-specific run is intentional
- enough tablebase/terminal data for value learning, or explicit anchor data
- capped games whose final-position metadata explains why they capped
- fresh capped fraction below the active gate, currently
0.67to0.75 - heldout source value deltas no worse than the active source gate, currently
+2e-6fortablebase,terminal, andcapped
MCTS simulations per move.
- Higher: better search, slower games, possibly more decisive outcomes.
- Lower: faster, noisier labels, often more capped games.
- Current probes:
50,75,100.
Base PUCT exploration strength.
- Higher: explores more moves, broader visit targets.
- Lower: exploits policy/value more, sharper labels.
- Current sharp-search baseline:
1.6.
Linear schedule for cpuct by ply.
- Higher start helps opening exploration.
- Lower end sharpens later search.
- Current sharp-search baseline:
1.8 -> 1.2over32plies.
Root exploration noise.
dirichlet-frac: how much root prior mass is replaced by noise.dirichlet-alpha: shape of noise distribution.dirichlet-plies: only apply noise before this ply.
Current sharp-search baseline uses lower noise:
--dirichlet-frac 0.10
--dirichlet-plies 12
Adds random jitter to child selection scores.
- Useful for diversity.
- Bad for clean label-quality experiments.
- Current sharp-search baseline:
0.0.
As of the May 10, 2026 local-loop hardening pass, 0.0 is exact: MCTS selection no longer adds hidden random tie-breaking jitter when this is disabled. Rerun old no-jitter generator probes before drawing conclusions from them.
The batched collector also applies virtual loss through the actual leaf-collection path. Restart generator processes after updating this code; an already-running generator will keep the old search behavior.
Disables extra prior noise when policy logits look uniform.
For the current model, this is important. The model often has broad priors; adding entropy noise makes already-soft labels softer and noisier.
Controls move sampling from MCTS visit counts.
- Higher temperature: more diverse games, more noise.
- Lower temperature: more deterministic/self-consistent games.
- Current sharp-search baseline:
0.8 -> 0.15over24moves.
Uniform random legal moves before MCTS starts.
- More: more opening diversity, lower reproducibility.
- Less: cleaner comparison, less variety.
- Current sharp-search baseline:
8.
Applies temperature only to saved MCTS policy targets. It does not affect move sampling.
1.0: preserve raw visit-count distribution.< 1.0: sharpen low-simulation targets before saving.0.0: save one-hot argmax targets.
Use this only after search correctness is verified. With a weak model and 50 simulations, correct MCTS can visit almost every legal move; target-only sharpening is a pragmatic way to train a clearer policy signal while keeping generator behavior unchanged.
Draw adjudication can dominate training data. Use it intentionally.
Adjudicate draw when board.halfmove_clock reaches this value.
Adjudicate low-material positions as draws when total material is below this threshold.
- Set
0to disable material-based early draws.
Minimum plies before heuristic draw rules apply.
This applies to heuristic draw rules, not forced draw conditions.
Sliding-window repetition heuristic.
- Set both to
0to disable this heuristic.
Minimum plies before claimable repetition/fifty-move draws are adjudicated.
Do not adjudicate claimable threefold repetition. This is useful for diagnosing whether repetition claims are flooding data with neutral labels.
Do not adjudicate claimable fifty-move draws. The explicit --draw-halfmove-cap can still apply if heuristics are enabled.
Weight for capped/unfinished games when using search values as weak bootstrap targets.
0.0: ignore capped games for value.0.25: weakly train value on search estimates for capped games.- Higher values are risky unless search values are demonstrably reliable.
For capped-heavy datasets, value learning is weak even when policy labels are useful.
Utility for creating a copied data directory with value weights changed for selected result sources. This is useful when existing capped self-play labels are policy-useful but should not contribute to value loss.
Example:
.venv/bin/python -m azchess.tools.reweight_npz_values \
--input-dir logs/local_loop/bootstrap_006_anchor_only_nossl_candidate_generator_32g_ptt050_hbfix/data \
--output-dir logs/local_loop/bootstrap_006_anchor_only_nossl_candidate_generator_32g_ptt050_vw0/data \
--source capped \
--source unfinished \
--value-weight 0.0This preserves policy targets and terminal/tablebase value labels while setting matching value_weight, meta_value_weight, and zero-valued bootstrap metadata in the copied shards.
Number of training steps in the local loop.
Use generator-only first with --skip-train. Only train after data quality passes.
Training batch size.
On Apple Silicon, keep this conservative unless memory metrics are stable.
Penalty that trains raw policy logits to put probability on legal moves.
Current baseline: 0.05.
Strict local-loop data mode requires legal masks when this is enabled.
Extra policy CE term computed after masking logits and targets to legal moves.
Use this when full policy CE improves mostly by increasing legal probability mass,
but policy_legal_ce / policy_legal_kl are flat or worse. This directly
trains ranking among legal moves.
Source-aware value-loss gates. These operate on shard meta_result_source values and only affect value loss.
Recommended local-loop split while capped games dominate:
--value-include-source terminal \
--value-include-source tablebase \
--value-include-source draw_adjudication \
--value-include-source resignationThis keeps capped/unfinished self-play useful for policy targets but prevents weak capped bootstrap values from steering value learning.
Alternative:
--value-exclude-source capped \
--value-exclude-source unfinishedUse include mode for promotion-oriented runs because it fails closed when new/unknown result sources appear.
Source-aware value-loss multipliers. Use this when a source is rare in the training mix but is a hard promotion guard. Matching uses the same result-source prefix style as include/exclude filters.
Current bootstrap_007 diagnosis: terminal positions are underrepresented in the anchor data, but terminal value MSE is repeatedly the source slice that blocks otherwise improving candidates. Prefer reweighting over duplicating shards:
--value-source-weight terminal=2.0 \
--value-source-weight capped=1.5Keep tablebase at the default 1.0 unless a run shows tablebase becoming the limiting guard. Do not combine high terminal weights with loose source gates; the point is to align training with the existing gate, not bypass it.
Parent value distillation guards against the source-conflict failure mode where a candidate improves one source slice by shifting the whole value head up/down, then regresses another heldout source.
--value-distill-weight is a per-position parent value anchor. Use this first for guarded local-loop scouts; it directly damps prediction drift while still allowing target MSE to move the model.
--value-mean-distill-weight is a source-wise mean anchor. It is useful as a diagnostic, but early bootstrap_007 probes showed that mean-only anchoring can still permit same-sign value prediction drift across all sources.
Both options require a teacher checkpoint:
--policy-distill-checkpoint {parent} \
--policy-distill-weight 1.0 \
--value-distill-weight 0.25Anchor-only probes on May 14 showed the current root issue clearly:
tablebase+cappedvalue training improved capped/tablebase but regressed terminal.terminal_onlyimproved terminal but regressed capped/tablebase.- balanced source training mostly behaved like a same-sign value bias shift.
--value-distill-weight 0.25reduced the bias shift and improved aggregate heldout value, but terminal still needs a source-gate check before an unattended loop.
Source-aware policy-loss gates. These operate on shard meta_result_source values and affect the main policy CE plus legal-policy CE. Legal-mass regularization remains global.
Use this when teacher data is useful for value but hurts heldout legal move ranking:
--policy-exclude-source teacher:This keeps teacher positions in the batch for value loss if allowed by the value source filter, while preventing the teacher policy distribution from overriding the current self-play/anchor policy curriculum.
Overrides SSL loss weight.
Use 0.0 for no-SSL ablations. With the current training path this disables SSL target creation and SSL forward compute for the training loop, while the checkpoint architecture can still contain SSL heads.
Policy target smoothing.
Use 0.0 for current self-play label-quality experiments. Smoothing can hide whether MCTS labels are already too soft.
Copies stable prior shards into the training replay buffer after fresh self-play generation.
Use this when fresh data has useful policy labels but poor value outcome mix.
Example:
--train-anchor-data-dir logs/local_loop/bootstrap_003_capped_value_48g/data
--train-anchor-max-files 16
Limits how many fresh self-play shards remain visible to the training stage.
The local-loop report still records the full generator output in data_after_selfplay. Extra fresh shards are moved under data/excluded_selfplay, outside the training scan paths. Use this when a long generator run produces good policy labels but too many capped games compared with the anchor set.
Moved fresh shards are also removed from local-loop metadata so the training data manager does not sample excluded files.
Example:
--train-fresh-max-files 48
--train-anchor-data-dir logs/local_loop/bootstrap_003_capped_value_48g/data
Runs repeated bench_local_loop cycles and promotes candidates only if the configured gate passes. Current fresh-loop runs use one cycle at a time while the gains are small:
.venv/bin/python -m azchess.tools.local_loop_cycle \
--cycles 1 \
--seed-best-checkpoint \
--stop-on-reject \
--prune-cycle-checkpoints
Use multi-cycle runs only after repeated single-cycle promotions are stable.
Promotion gate for per-source heldout value regression. This rejects candidates that improve aggregate heldout value by overfitting one source while damaging another. The current fresh-loop source guard has been relaxed to 0.000002 after repeated healthy candidates missed only by small source-slice noise:
--max-source-value-mse-delta 0.000002
The active heldout sources are:
--eval-result-source tablebase
--eval-result-source terminal
--eval-result-source capped
capped is included only because it is part of the current anchor/training mix at low value weight. Treat capped-source improvements as secondary; tablebase and terminal slices remain the sources that decide whether value learning is actually moving forward.
Promotion gate on fresh self-play outcome mix. Capped games are still useful at low value weight, but an all-capped fresh batch has been a bad promotion signal. Current settings:
--max-fresh-capped-fraction 0.67
Use 0.75 only for a scout or when the terminal/tablebase source guards are comfortably clean.
local_loop_cycle passes this cap down to bench_local_loop as a pre-train fresh-quality gate by default. A run whose fresh self-play exceeds the capped fraction writes a report with train_skipped=true and stops before anchor copy, training, and eval. Use --disable-pretrain-fresh-quality-gate only for diagnostics where you intentionally want the full train/eval report from a bad fresh batch.
Deletes generated cycle checkpoint files and TensorBoard event files after each cycle. Keep this on for local Apple Silicon work; otherwise each rejected or accepted cycle can leave multiple 600MB+ .pt files.
Promotion archives are intentionally left for safety and should be pruned manually after confirming the promoted best checkpoint exists:
find logs/local_loop/<run>/archives -type f -name '*.pt' -deleteSelection-time source guard inside bench_local_loop. This prevents the selected training chunk from being chosen solely by aggregate value improvement when any heldout source slice regresses too much:
--eval-select-max-source-value-mse-delta 0.000002
Use this together with the cycle-level --max-source-value-mse-delta.
As of May 15, 2026, the active loop is a short, source-guarded, value-distilled scout. The model can still promote, but rejected chunks now show the real blocker: aggregate value often improves while the terminal heldout source regresses. The current recipe deliberately keeps runs short, preserves parent policy behavior, and uses terminal source weighting plus parent value distillation to test whether terminal can stay inside the guard.
--games 24
--sims 50
--max-game-len 240
--train-steps 10
--eval-select-interval 5
--lr 1.5e-8
--warmup-steps 10
--capped-value-weight 0.25
--policy-include-source __none__
--policy-distill-checkpoint {parent}
--policy-distill-weight 1.0
--value-distill-weight 0.5
--train-anchor-data-dir logs/local_loop/bootstrap_003_capped_value_48g/data
--train-anchor-source-prefix tablebase
--train-anchor-source-prefix terminal
--train-anchor-source-prefix capped
--train-result-source-mix terminal=0.25
--train-result-source-mix tablebase=0.55
--train-result-source-mix capped=0.20
--value-include-source tablebase
--value-include-source terminal
--value-include-source capped
--value-source-weight terminal=3.0
--value-source-weight capped=1.75
--checkpoint-state model
--initial-checkpoint-state model_ema
This is still a guarded scout recipe, not an unattended long-run recipe. Promote only through local_loop_cycle, keep the policy drift gates active, and inspect eval_selection.candidates on every reject.
Eval-select candidate records include selection_failures. Read these before changing knobs. A failure like source:terminal:value_weighted_mse means terminal protection is still insufficient; source:capped:value_weighted_mse means terminal/tablebase pressure is overcorrecting and capped bootstrap value needs more protection or a lower LR. If terminal remains the blocker after the terminal=0.25, terminal weight=3.0, value-distill=0.5 scout, stop cycle runs and patch diagnostics to split terminal zero-value and terminal decisive positions.
Use --disable-pretrain-fresh-quality-gate for these diagnostics. Otherwise a fresh capped fraction above the pretrain cap can stop before training/eval and hide the source-slice failure that we need to see.
Every run request should include a full copy-paste command with:
source /Users/admin/Downloads/VSCode/Matrix0/.venv/bin/activate
cd /Users/admin/Downloads/VSCode/Matrix0
RUN=...
ANCHOR=logs/local_loop/bootstrap_003_capped_value_48g/data
PARENT=checkpoints/bootstrap_007_fresh_anchor_best.pt
Latest cycle6 notes:
cycle6dpromoted with aggregatevalue_weighted_mse -4.95e-7and clean source gates.cycle6erejected despite aggregate candidate gains near-1.1e-6because terminal regressed around+1.1e-5.cycle6fproved the pretrain fresh-quality gate can suppress training/eval when capped fraction is high; use--disable-pretrain-fresh-quality-gatefor diagnostics.cycle6gtrained/evaluated with no preabort, but terminal still failed selection (+6.86e-6to+1.19e-5).
The next scout is cycle6h: same short-loop shape, terminal mix 0.25, tablebase 0.55, capped 0.20, terminal value weight 3.0, capped weight 1.75, and value distill 0.5.
The May 13 hardening pass fixed two important data-path issues before further runs:
- Stockfish curriculum bucket construction had a malformed positional bucket entry. The exception was swallowed and could make
_get_stockfish_mixed_batchreturn no stockfish data. - Mixed curriculum/teacher/stockfish/external paths could treat dict batches like legacy tuples, then drop or misalign
value_weightandresult_source. The data manager now normalizes batches before merging and preserves source metadata through shuffle.
These fixes matter most for teacher, stockfish, and curriculum modes. The explicit --train-result-source-mix local-loop recipe was less exposed, but future teacher/stockfish diagnostics should use the hardened code.
Moves-left supervision is wired as an optional auxiliary objective. It is disabled unless model.moves_left is true and training.moves_left_weight or --moves-left-weight is positive. Self-play shards save per-position moves_left; older single-game shards derive the target from meta_moves.
Scout shape:
--moves-left-weight 0.05
--moves-left-scale 256
--trainable-scope all
--policy-include-source __none__
--policy-distill-checkpoint {parent}
--policy-distill-weight 1.0
Promotion still depends on aggregate/source value gates and policy-drift gates. moves_left_mse is diagnostic; do not promote a checkpoint only because the auxiliary loss improves.
Latest status: the first guarded fresh moves-left scouts were clean but non-promoting. The s40 lr2e-8 run narrowly missed aggregate value promotion (-2.93e-7 vs -3e-7), and the s60 lr2e-8 run was weaker (-2.11e-7). Do not continue this exact moves-left shape as mainline; return to the stable non-moves-left guarded recipe unless a later scout shows a clear aggregate/source-gated value gain.
Use stable heldout data instead of evaluating only on the new run’s data.
This is required for meaningful before/after comparison.
Number of fixed eval batches sampled before training and reused after training.
Use more than one batch for less noisy comparisons.
Filter eval data by source prefix when using mixed replay directories.
Evaluation reports both raw value_mse and value_weighted_mse when eval batches include positive value_weight. If an eval slice has zero total value weight, value_weighted_mse is omitted instead of reported as 0.0; that keeps gates from accepting a metric that had no signal.
Use raw value_mse when the promotion question is "did the candidate fit every heldout position equally?" Use value_weighted_mse when the promotion question should match the training objective, especially with capped bootstrap positions at lower value weight.
For local-loop selection:
--eval-select-metric value_weighted_mse \
--eval-select-source-metric value_weighted_mseFor cycle promotion:
--value-gate-metric value_weighted_mse \
--source-value-gate-metric value_weighted_mseKeep the policy drift gates unchanged. Weighted value gates are not a substitute for policy quality checks.
Local-loop scouts now default to --checkpoint-state model for heldout eval and exported candidates. The initial parent is normalized from --initial-checkpoint-state model_ema by default, then each short candidate chunk is normalized to the raw trained model state before the next chunk, final eval, and promotion.
This avoids a short-run EMA trap: with ema_decay=0.999, a 40-60 step scout's model_ema can contain only a small fraction of the raw candidate update, making real value movement look like stagnation. Long offline training and match tooling can still prefer EMA, but the local-loop gate must evaluate the same state that downstream self-play will load after promotion.
Force each training batch to contain an explicit mix of result-source prefixes.
Use this when a rare source has a hard promotion gate. Loss multipliers alone increase gradient size after a rare source appears; this knob ensures the source appears every step.
Current balanced-source scout mix:
--train-result-source-mix terminal=0.30 \
--train-result-source-mix tablebase=0.50 \
--train-result-source-mix capped=0.20This is a training sampler only. Keep --value-source-weight, value include filters, policy distillation, and the same heldout source gates so the training objective and promotion criteria stay aligned.
Within each requested result-source bucket, shards are sampled by recorded sample count. This matters when fresh self-play shards and anchor shards coexist; otherwise small fresh shards can be overrepresented relative to larger anchor shards.
Reports include source_pressure diagnostics when this sampler is active. Watch expected_reuse_factor: a high value means the requested source mix is repeatedly drawing the same small source pool, which can overfit a rare guard source and regress other source slices.
Reports also include source_configuration diagnostics. Treat warnings there as run blockers: they catch cases like evaluating draw_adjudication while never copying/training that source, or sampling a source that is excluded from value loss.
If teacher or stockfish data is introduced, first run a short diagnostic that asserts nonzero source contribution in the actual batch stream. Do not assume adding shards means they are sampled; inspect source_pressure, source_configuration, and per-source heldout deltas.
The local loop sets strict mode for subprocesses:
MATRIX0_STRICT_DATA=1
MATRIX0_STRICT_CHECKPOINT=1
This means:
- corrupted shards raise instead of being skipped
- missing legal masks raise where legal masks are required
- mixed-source batches preserve
value_weight; shards without the field are treated as weight1.0instead of causing weighted shards to be dropped - curriculum, teacher, stockfish, tactical, and openings batches preserve
result_sourceandvalue_weightthrough merges and shuffles - all-zero value weights after value source filters raise when source filters or source weights are configured
- decoded curriculum/source-prefix batches raise on legal-mask construction failures instead of substituting all-legal masks
- partial checkpoint loads raise unless explicitly configured as a migration
- bad MCTS priors/logits raise instead of becoming uniform labels
- invalid MCTS visit distributions, stale root-policy counts, and NaN/Inf training outputs raise instead of being sanitized or sampled through
This is intentional. A failed run is better than a plausible but corrupted training signal.
Use --skip-train to validate labels and outcome mix before training.
Current active probe:
logs/local_loop/bootstrap_006_anchor_only_nossl_candidate_generator_64g_fixed_jitter_vloss
Use it to decide whether the anchor-only no-SSL candidate can generate data at least as good as the parent. If it remains all capped with weak target sharpness, do not train or promote it; inspect final-position metadata and fix generation/search behavior first. The current tablebase check passed, so material-heavy caps point at search/conversion behavior rather than broken Syzygy wiring.
Use fresh self-play plus a stable heldout eval directory:
--eval-data-dir logs/local_loop/bootstrap_003_capped_value_48g/data
--eval-batches 16
Use sharp fresh data plus anchor data:
--train-anchor-data-dir logs/local_loop/bootstrap_003_capped_value_48g/data
--train-anchor-max-files 16
Accept only if legal policy improves and value does not regress materially.