Technique vector clustering for beam-search diversity + production scheduling refinements by jiannanWang · Pull Request #140 · meta-pytorch/KernelAgent

jiannanWang · 2026-05-26T23:40:08Z

Summary

Builds on #139 (multi-LLM beam search) to add a diversity-preserving selection layer plus the scheduling and locking refinements that the 225-worker XL preset needs.

The motivation: with #139 's P × B × M × S fanout, the beam quickly fills with kernels that target the same optimization approach. PTX dedup catches byte-equivalents but can't tell apart kernels that differ structurally yet still pursue the same strategy. Without a diversity term, the beam collapses onto a single lineage and the search plateaus. This PR adds an LLM-classifier-driven technique-vector clustering layer that selects round-robin per cluster, preserving at least one representative of each distinct optimization approach.

It also includes the scheduling/locking changes that became necessary at 225-worker scale (dynamic GPU scheduler, manager-side lock-pool decouple, bottleneck-analysis caching), and an API cleanup that collapses two redundant knobs into one.

All new features are opt-in. Clustering is gated by technique_clustering.enabled: true; the existing #139 presets do not enable it.

Example commands

Two new presets ship with the PR:

# 90-worker clustered (2 parents × 3 bottlenecks × 3 models × 5 samples + clustering)                                                                                                                                                                                                               
python examples/run_opt_manager.py \                                                                                                                                                                                                                                                                
  --kernel-dir examples/optimize_01_matvec \                                                                                                                                                                                                                                                        
  --strategy beam_search_diverse_clustered \                                                                                                                                                                                                                                                        
  --max-rounds 8                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                    
# 225-worker clustered XL (5 parents × 3 bottlenecks × 3 models × 5 samples + clustering, 8 GPUs × 4 workers/GPU)                                                                                                                                                                                   
python examples/run_opt_manager.py \                                   
  --kernel-dir examples/optimize_01_matvec \                       
  --strategy beam_search_diverse_clustered_xl \                        
  --max-rounds 8

Both presets configure models: [claude-opus-4.6, gpt-5-4, gemini-2-5-pro] and use claude-opus-4.6 as the technique classifier. Clustering off by default in existing presets.

Experiment results

Two problems, 8 rounds each (full production runs):

┌───────────────┬────────────────────────────────────────────────────┬──────────────────────────────────────────────┐                            
│    Problem    │ Without clustering (PR-1, 90 workers concentrated) │    With clustering (PR-2, 225 workers XL)    │
├───────────────┼────────────────────────────────────────────────────┼──────────────────────────────────────────────┤                            
│ matvec        │ 1.8318 ms (1.12× PyTorch eager)                    │ 1.8323 ms (1.12× eager)                      │
├───────────────┼────────────────────────────────────────────────────┼──────────────────────────────────────────────┤
│ gemma3_swiglu │ 0.2716 ms (3.10× eager, 1.14× torch.compile)       │ 0.3107 ms (2.71× eager, 1.00× torch.compile) │
└───────────────┴────────────────────────────────────────────────────┴──────────────────────────────────────────────┘

Notes:

Matvec is DRAM-saturated by 90 workers — clustering doesn't add headroom past that ceiling, but doesn't hurt.
On gemma3_swiglu, clustering enables a late-round breakthrough. R7 of the XL run produced a winner kernel whose lineage would have been culled by pure top-K in round 1; clustering preserved the slow-but-diverse parent slot. (PR-1's 90-worker concentrated still beat XL clustering on this
problem at fixed budget — clustering helps escape plateaus but isn't always the dominant intervention. Both configurations beat torch.compile.)

Verified live on this branch with a 2-round 90-worker clustered run on matvec:

R1: 90/90 succeeded, PTX dedup 92→92, clusters=14, best 1.8753 ms (claude-opus-4.6)
R2: 71/90 succeeded, PTX dedup 73→17 (77% collapse), clusters=7, best 1.8612 ms (claude-opus-4.6)
Final: 1.8612 ms = 1.10× PyTorch eager
Vector cache hit confirmed: R2 only classified 15 new kernels (not 73) — vectors for carry-over kernels were reused from R1's JSON-DB persistence.

What is changed

LLM technique classifier — new searching/technique_vector.py (TechniqueDefinition, load_techniques, classify_kernel, classify_many) that emits a binary 18-bit vector per kernel via one LLM call, fanned out via a thread pool.
18-bit technique taxonomy — new examples/configs/techniques_default.yaml listing optimization techniques (tensor_cores, persistent_kernel, autotuned_config, etc.) with prompt hints for the classifier.
Round-robin per-cluster beam selection — new select_diverse_top_k (depth-first per cluster); wired into BeamSearchStrategy.update_with_results behind a config flag.
ProgramEntry.technique_vector field — JSON-DB serdes so vectors persist across rounds and runs; the manager only pays the LLM cost for newly-discovered kernels.
Per-parent bottleneck-analysis caching — manager runs the LLM-driven NCU-to-bottleneck analysis once per unique parent per round and shares the result across siblings.
Spawn-on-free-GPU-slot dynamic scheduler — replaces static "spawn N, wait for all" with a pool topped up to workers_per_gpu × len(gpu_ids) and refilled as workers exit. New workers_per_gpu config knob.
Manager GPU lock pool decoupled from workers'; semaphore timeout reduced 900s → 60s so a stuck worker can't block the manager.
candidate_pool_size API simplification — single knob replaces num_top_kernels + num_expanding_parents. All preset YAMLs updated. This is the only breaking config change.
Two new strategy presets — beam_search_diverse_clustered.yaml, beam_search_diverse_clustered_xl.yaml. Existing presets unchanged (clustering off by default).

- BeamSearchStrategy: add models / samples_per_prompt / num_expanding_parents knobs; expansion now P × M × K × C with per-candidate openai_model and sample_idx threaded through the worker dispatch. - PTX fingerprint dedup: new ptx_fingerprint.py captures compiled PTX from a per-call TRITON_CACHE_DIR during benchmarking, normalizes (strip comments/debug/headers, canonicalize register/label names), SHA-256 hashes. update_with_results dedups the combined pool by hash before sort+truncate; ProgramEntry / json_db carry ptx_hash. - Multi-GPU: per-GPU mp.Lock pool (single lock covers both benchmark and NCU on a given GPU), round-robin worker -> GPU assignment, CUDA_VISIBLE_DEVICES=<gpu_id> set in the worker process before any torch import. Manager auto-detects via nvidia-smi (NOT torch.cuda) to avoid poisoning forked children with an inherited CUDA context. - Per-parent baseline NCU cache: manager profiles each unique parent once per round and attaches baseline_metrics to each candidate dict; workers skip their own NCU when the cache is populated. - Bottleneck plumbing fix: num_bottlenecks is now wired from strategy_config -> worker_kwargs[num_bottlenecks_to_request] -> BottleneckAnalyzer. Pre-fix the analyzer always asked for 1 ranked bottleneck so workers with bottleneck_id >= 2 silently fell back to rank 1. - mp.Queue feeder-thread deadlock fix: NvidiaWorkerRunner.run_workers now drains the queue interleaved with join(timeout=0.5) polling instead of joining all workers serially before draining. - best_runtime_ptx_hash propagation: orchestrator captures the hash after _update_kernels (was previously checked before, when the comparison was tautologically false), and parent-hash byte-identity fallback in update_with_results lets unchanged-parent results inherit the parent's hash so they collapse correctly in dedup. - ncu_profiler.py: NaN-safe units-row detection (str.lower() propagated pd.NA back to float NaN, breaking the substring check). - Configs: examples/configs/beam_search_diverse.yaml (spread, P=5/C=2), beam_search_diverse_concentrated.yaml (P=2/C=5), beam_search_diverse_smoke.yaml (smoke variant).

Layered semantic dedup on top of PTX-hash dedup: an LLM emits a binary vector indicating which optimization techniques each kernel uses, and beam truncation preserves the fastest representative of each distinct vector ("at least one per cluster") so the beam doesn't fill with near-clones of the leader. - New module: opt_worker_component/searching/technique_vector.py - TechniqueDefinition dataclass; load_techniques(path) reads from YAML. - classify_kernel(...) — one LLM call → binary vector of length N. - classify_many(...) — thread-pooled fan-out for a batch of kernels. - select_diverse_top_k(entries, k) — diversity-aware truncation: walk sorted-by-time pool, keep first per cluster, backfill remaining slots from fastest unaccepted entries. - Expandable taxonomy: examples/configs/techniques_default.yaml lists 18 techniques (split-K reduce, tensor cores, software pipelining, swizzled load, persistent kernel, vectorized load, shared-memory tiling, register tiling, autotune, masked access, atomic reduction, precision split, warp specialization, cluster/DSMEM, grid swizzling, epilogue fusion, loop unrolling, async/TMA copy). Append-only entries are safe; the strategy reads the YAML at run start and the vector dimension follows automatically. - ProgramEntry / JSONProgramDatabase round-trip technique_vector along with the existing ptx_hash field; vectors persist across rounds and across runs. - BeamSearchStrategy: opt-in via new ctor params (techniques, technique_classifier_provider, technique_classifier_model, technique_classifier_concurrency). When enabled, update_with_results runs PTX-dedup, then classifies any uncached / stale-dimension survivors via thread-pooled LLM, then uses select_diverse_top_k for truncation instead of plain sort+truncate. When disabled, behavior is identical to the previous PTX-only path. - OptimizationManager._build_technique_clustering_kwargs reads the technique_clustering block from strategy_config: technique_clustering: enabled: true techniques_yaml: examples/configs/techniques_default.yaml classifier_model: claude-opus-4.6 max_concurrency: 4 Resolves the LLM provider via get_model_provider, falls back to the manager's openai_model when classifier_model is omitted, and silently disables clustering on any misconfiguration (logging a warning) so an experiment never crashes on a missing YAML. - New preset examples/configs/beam_search_diverse_clustered.yaml mirrors the concentrated 90-worker production config + technique clustering. - docs/technique_clustering_design.md captures the architecture, the taxonomy, the diversity-aware selection algorithm, the cost / payoff analysis, and open questions.

- Update select_diverse_top_k: replace "first per cluster + backfill" with the round-robin-by-depth algorithm requested for the next experiment. Pass 0 takes the fastest of every cluster (sorted by time across clusters); pass 1 takes the second-fastest of every cluster; etc., until k slots are filled or the pool is exhausted. Output is ordered by selection (most-diverse-first), not by time, so num_expanding_parents=N picks N different clusters as parents. - New preset examples/configs/beam_search_diverse_clustered_xl.yaml: P=5, M=3, K=3, C=5 = 225 workers, 8 rounds, technique clustering enabled. Pairs with beam_search_diverse_concentrated.yaml for the next A/B comparison.

Replace the spawn-all-at-once round-robin worker launcher with a dynamic pool-based scheduler: - ``workers_per_gpu`` (manager-level, default 2) bounds how many worker processes can be pinned to any single GPU at once. Total active pool = ``workers_per_gpu × len(gpu_ids)``. - ``run_workers`` maintains a pending queue + per-GPU free-slot counter. At each iteration: top up the pool (spawn pending onto GPUs with free slots, preferring GPUs with the most free capacity); drain the result queue; reap finished workers (frees their slot). - Worker→GPU is decided at spawn time, not at index-based round-robin, so GPUs that finish quickly immediately get new candidates instead of sitting idle while their statically-assigned share trickles through the LLM phase. Threading: ``OptimizationManager.workers_per_gpu`` flows via ``registry.create_from_config`` into ``NvidiaWorkerRunner.__init__``. Existing ``mp.Queue`` drain-while-join logic is preserved. The XL clustered preset now sets ``workers_per_gpu: 4`` (pool capacity 32 vs. all-at-once 225) — fewer simultaneous LLM calls, better GPU utilization, and ~7× lower per-GPU CUDA-context residency.

Two related fixes for the stuck-baseline-NCU bug surfaced by the 225-worker run: a round-1 worker died holding gpu_locks[0], and the manager's baseline-NCU step (which shared the same lock) then waited 15 minutes per parent before the existing semaphore timeout fired — roughly 75 minutes per round of pure waiting. - OptimizationManager now keeps a *separate* lock pool for its own GPU work (``_mgr_gpu_locks``), distinct from ``gpu_locks`` used by worker subprocesses. All manager-level GPU operations (initial-kernel verify, PyTorch baselines, baseline-NCU caching) happen between rounds when no workers are running, so they don't actually need to coordinate with workers — and a worker dying holding gpu_locks[g] now can't strand the manager. ``self.benchmark_lock`` and ``self.profiling_semaphore`` (the two back-compat aliases consumed by NvidiaBenchmarker / NvidiaVerifier / the manager's _mgr_profiler) point at the new dedicated pool. - Reduce DEFAULT_SEMAPHORE_TIMEOUT_SECONDS in kernel_profiler.py from 900s → 60s as belt-and-suspenders. Any future stale-lock scenario fails-fast instead of stalling 15 min per attempt.

…l_size Empirical evidence from the 225-worker / 8-round / clustered run showed that beam slots holding kernels not selected as expansion parents (the old ``num_top_kernels`` slots that exceeded ``num_expanding_parents``) never drove subsequent expansion: every round 2+ parent came directly from the previous round's children, never from carried-over slot-6+ entries. With technique-vector clustering doing the diversity job the buffer was meant for, those slots were pure overhead. This commit replaces both knobs with a single ``candidate_pool_size``: every member of the pool is expanded each round. - BeamSearchStrategy - ``candidate_pool_size`` ctor param (was ``num_top_kernels``). - Drop ``num_expanding_parents`` and the ``_effective_num_parents`` derived property. - ``select_candidates`` iterates over the entire pool. - ``num_workers_needed`` = pool × bottlenecks × models × samples. - ``initialize`` seeds N copies of initial; ``update_with_results`` truncates pooled survivors to N via the dedup + (optional) diversity-aware selection path. - OptimizationManager._create_strategy reads ``candidate_pool_size`` from strategy_config; old keys are no longer recognized. - All YAML presets renamed: - beam_search.yaml: num_top_kernels=2 → candidate_pool_size=2 - nvidia.yaml: num_top_kernels=2 → candidate_pool_size=2 - beam_search_diverse_smoke: (4,1) → candidate_pool_size=1 - beam_search_diverse: (10,5) → candidate_pool_size=5 - beam_search_diverse_concentrated:(10,2) → candidate_pool_size=2 - beam_search_diverse_clustered: (10,2) → candidate_pool_size=2 - beam_search_diverse_clustered_xl:(10,5) → candidate_pool_size=5 All declared ``num_workers`` values reconcile with the new formula.

jiannanWang added 7 commits May 26, 2026 09:46

Dedup bottleneck analysis and fail fast on malformed kernels

3668ce5

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technique vector clustering for beam-search diversity + production scheduling refinements#140

Technique vector clustering for beam-search diversity + production scheduling refinements#140
jiannanWang wants to merge 7 commits into
mainfrom
jiannanWang/technique-vector-cluster

jiannanWang commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jiannanWang commented May 26, 2026

Summary

Example commands

Experiment results

What is changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant