Skip to content

Technique vector clustering for beam-search diversity + production scheduling refinements#140

Draft
jiannanWang wants to merge 7 commits into
mainfrom
jiannanWang/technique-vector-cluster
Draft

Technique vector clustering for beam-search diversity + production scheduling refinements#140
jiannanWang wants to merge 7 commits into
mainfrom
jiannanWang/technique-vector-cluster

Conversation

@jiannanWang
Copy link
Copy Markdown
Contributor

Summary

Builds on #139 (multi-LLM beam search) to add a diversity-preserving selection layer plus the scheduling and locking refinements that the 225-worker XL preset needs.

The motivation: with #139 's P × B × M × S fanout, the beam quickly fills with kernels that target the same optimization approach. PTX dedup catches byte-equivalents but can't tell apart kernels that differ structurally yet still pursue the same strategy. Without a diversity term, the beam collapses onto a single lineage and the search plateaus. This PR adds an LLM-classifier-driven technique-vector clustering layer that selects round-robin per cluster, preserving at least one representative of each distinct optimization approach.

It also includes the scheduling/locking changes that became necessary at 225-worker scale (dynamic GPU scheduler, manager-side lock-pool decouple, bottleneck-analysis caching), and an API cleanup that collapses two redundant knobs into one.

All new features are opt-in. Clustering is gated by technique_clustering.enabled: true; the existing #139 presets do not enable it.

Example commands

Two new presets ship with the PR:

# 90-worker clustered (2 parents × 3 bottlenecks × 3 models × 5 samples + clustering)                                                                                                                                                                                                               
python examples/run_opt_manager.py \                                                                                                                                                                                                                                                                
  --kernel-dir examples/optimize_01_matvec \                                                                                                                                                                                                                                                        
  --strategy beam_search_diverse_clustered \                                                                                                                                                                                                                                                        
  --max-rounds 8                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                    
# 225-worker clustered XL (5 parents × 3 bottlenecks × 3 models × 5 samples + clustering, 8 GPUs × 4 workers/GPU)                                                                                                                                                                                   
python examples/run_opt_manager.py \                                   
  --kernel-dir examples/optimize_01_matvec \                       
  --strategy beam_search_diverse_clustered_xl \                        
  --max-rounds 8                 

Both presets configure models: [claude-opus-4.6, gpt-5-4, gemini-2-5-pro] and use claude-opus-4.6 as the technique classifier. Clustering off by default in existing presets.

Experiment results

Two problems, 8 rounds each (full production runs):

┌───────────────┬────────────────────────────────────────────────────┬──────────────────────────────────────────────┐                            
│    Problem    │ Without clustering (PR-1, 90 workers concentrated) │    With clustering (PR-2, 225 workers XL)    │
├───────────────┼────────────────────────────────────────────────────┼──────────────────────────────────────────────┤                            
│ matvec        │ 1.8318 ms (1.12× PyTorch eager)                    │ 1.8323 ms (1.12× eager)                      │
├───────────────┼────────────────────────────────────────────────────┼──────────────────────────────────────────────┤
│ gemma3_swiglu │ 0.2716 ms (3.10× eager, 1.14× torch.compile)       │ 0.3107 ms (2.71× eager, 1.00× torch.compile) │
└───────────────┴────────────────────────────────────────────────────┴──────────────────────────────────────────────┘                            

Notes:

  • Matvec is DRAM-saturated by 90 workers — clustering doesn't add headroom past that ceiling, but doesn't hurt.
  • On gemma3_swiglu, clustering enables a late-round breakthrough. R7 of the XL run produced a winner kernel whose lineage would have been culled by pure top-K in round 1; clustering preserved the slow-but-diverse parent slot. (PR-1's 90-worker concentrated still beat XL clustering on this
    problem at fixed budget — clustering helps escape plateaus but isn't always the dominant intervention. Both configurations beat torch.compile.)

Verified live on this branch with a 2-round 90-worker clustered run on matvec:

  • R1: 90/90 succeeded, PTX dedup 92→92, clusters=14, best 1.8753 ms (claude-opus-4.6)
  • R2: 71/90 succeeded, PTX dedup 73→17 (77% collapse), clusters=7, best 1.8612 ms (claude-opus-4.6)
  • Final: 1.8612 ms = 1.10× PyTorch eager
  • Vector cache hit confirmed: R2 only classified 15 new kernels (not 73) — vectors for carry-over kernels were reused from R1's JSON-DB persistence.

What is changed

  • LLM technique classifier — new searching/technique_vector.py (TechniqueDefinition, load_techniques, classify_kernel, classify_many) that emits a binary 18-bit vector per kernel via one LLM call, fanned out via a thread pool.
  • 18-bit technique taxonomy — new examples/configs/techniques_default.yaml listing optimization techniques (tensor_cores, persistent_kernel, autotuned_config, etc.) with prompt hints for the classifier.
  • Round-robin per-cluster beam selection — new select_diverse_top_k (depth-first per cluster); wired into BeamSearchStrategy.update_with_results behind a config flag.
  • ProgramEntry.technique_vector field — JSON-DB serdes so vectors persist across rounds and runs; the manager only pays the LLM cost for newly-discovered kernels.
  • Per-parent bottleneck-analysis caching — manager runs the LLM-driven NCU-to-bottleneck analysis once per unique parent per round and shares the result across siblings.
  • Spawn-on-free-GPU-slot dynamic scheduler — replaces static "spawn N, wait for all" with a pool topped up to workers_per_gpu × len(gpu_ids) and refilled as workers exit. New workers_per_gpu config knob.
  • Manager GPU lock pool decoupled from workers'; semaphore timeout reduced 900s → 60s so a stuck worker can't block the manager.
  • candidate_pool_size API simplification — single knob replaces num_top_kernels + num_expanding_parents. All preset YAMLs updated. This is the only breaking config change.
  • Two new strategy presets — beam_search_diverse_clustered.yaml, beam_search_diverse_clustered_xl.yaml. Existing presets unchanged (clustering off by default).

- BeamSearchStrategy: add models / samples_per_prompt / num_expanding_parents
  knobs; expansion now P × M × K × C with per-candidate openai_model and
  sample_idx threaded through the worker dispatch.

- PTX fingerprint dedup: new ptx_fingerprint.py captures compiled PTX
  from a per-call TRITON_CACHE_DIR during benchmarking, normalizes
  (strip comments/debug/headers, canonicalize register/label names),
  SHA-256 hashes. update_with_results dedups the combined pool by hash
  before sort+truncate; ProgramEntry / json_db carry ptx_hash.

- Multi-GPU: per-GPU mp.Lock pool (single lock covers both benchmark
  and NCU on a given GPU), round-robin worker -> GPU assignment,
  CUDA_VISIBLE_DEVICES=<gpu_id> set in the worker process before any
  torch import. Manager auto-detects via nvidia-smi (NOT torch.cuda) to
  avoid poisoning forked children with an inherited CUDA context.

- Per-parent baseline NCU cache: manager profiles each unique parent
  once per round and attaches baseline_metrics to each candidate dict;
  workers skip their own NCU when the cache is populated.

- Bottleneck plumbing fix: num_bottlenecks is now wired from
  strategy_config -> worker_kwargs[num_bottlenecks_to_request] ->
  BottleneckAnalyzer. Pre-fix the analyzer always asked for 1 ranked
  bottleneck so workers with bottleneck_id >= 2 silently fell back to
  rank 1.

- mp.Queue feeder-thread deadlock fix: NvidiaWorkerRunner.run_workers
  now drains the queue interleaved with join(timeout=0.5) polling
  instead of joining all workers serially before draining.

- best_runtime_ptx_hash propagation: orchestrator captures the hash
  after _update_kernels (was previously checked before, when the
  comparison was tautologically false), and parent-hash byte-identity
  fallback in update_with_results lets unchanged-parent results inherit
  the parent's hash so they collapse correctly in dedup.

- ncu_profiler.py: NaN-safe units-row detection (str.lower()
  propagated pd.NA back to float NaN, breaking the substring check).

- Configs: examples/configs/beam_search_diverse.yaml (spread, P=5/C=2),
  beam_search_diverse_concentrated.yaml (P=2/C=5),
  beam_search_diverse_smoke.yaml (smoke variant).
Layered semantic dedup on top of PTX-hash dedup: an LLM emits a binary
vector indicating which optimization techniques each kernel uses, and
beam truncation preserves the fastest representative of each distinct
vector ("at least one per cluster") so the beam doesn't fill with
near-clones of the leader.

- New module: opt_worker_component/searching/technique_vector.py
  - TechniqueDefinition dataclass; load_techniques(path) reads from YAML.
  - classify_kernel(...) — one LLM call → binary vector of length N.
  - classify_many(...) — thread-pooled fan-out for a batch of kernels.
  - select_diverse_top_k(entries, k) — diversity-aware truncation:
    walk sorted-by-time pool, keep first per cluster, backfill remaining
    slots from fastest unaccepted entries.

- Expandable taxonomy: examples/configs/techniques_default.yaml lists
  18 techniques (split-K reduce, tensor cores, software pipelining,
  swizzled load, persistent kernel, vectorized load, shared-memory
  tiling, register tiling, autotune, masked access, atomic reduction,
  precision split, warp specialization, cluster/DSMEM, grid swizzling,
  epilogue fusion, loop unrolling, async/TMA copy).  Append-only entries
  are safe; the strategy reads the YAML at run start and the vector
  dimension follows automatically.

- ProgramEntry / JSONProgramDatabase round-trip technique_vector along
  with the existing ptx_hash field; vectors persist across rounds and
  across runs.

- BeamSearchStrategy: opt-in via new ctor params (techniques,
  technique_classifier_provider, technique_classifier_model,
  technique_classifier_concurrency).  When enabled, update_with_results
  runs PTX-dedup, then classifies any uncached / stale-dimension
  survivors via thread-pooled LLM, then uses select_diverse_top_k for
  truncation instead of plain sort+truncate.  When disabled, behavior
  is identical to the previous PTX-only path.

- OptimizationManager._build_technique_clustering_kwargs reads the
  technique_clustering block from strategy_config:
    technique_clustering:
      enabled: true
      techniques_yaml: examples/configs/techniques_default.yaml
      classifier_model: claude-opus-4.6
      max_concurrency: 4
  Resolves the LLM provider via get_model_provider, falls back to the
  manager's openai_model when classifier_model is omitted, and
  silently disables clustering on any misconfiguration (logging a
  warning) so an experiment never crashes on a missing YAML.

- New preset examples/configs/beam_search_diverse_clustered.yaml mirrors
  the concentrated 90-worker production config + technique clustering.

- docs/technique_clustering_design.md captures the architecture, the
  taxonomy, the diversity-aware selection algorithm, the cost / payoff
  analysis, and open questions.
- Update select_diverse_top_k: replace "first per cluster + backfill"
  with the round-robin-by-depth algorithm requested for the next
  experiment.  Pass 0 takes the fastest of every cluster (sorted by
  time across clusters); pass 1 takes the second-fastest of every
  cluster; etc., until k slots are filled or the pool is exhausted.
  Output is ordered by selection (most-diverse-first), not by time, so
  num_expanding_parents=N picks N different clusters as parents.

- New preset examples/configs/beam_search_diverse_clustered_xl.yaml:
  P=5, M=3, K=3, C=5 = 225 workers, 8 rounds, technique clustering
  enabled.  Pairs with beam_search_diverse_concentrated.yaml for the
  next A/B comparison.
Replace the spawn-all-at-once round-robin worker launcher with a
dynamic pool-based scheduler:

- ``workers_per_gpu`` (manager-level, default 2) bounds how many worker
  processes can be pinned to any single GPU at once.  Total active
  pool = ``workers_per_gpu × len(gpu_ids)``.
- ``run_workers`` maintains a pending queue + per-GPU free-slot
  counter.  At each iteration: top up the pool (spawn pending onto
  GPUs with free slots, preferring GPUs with the most free capacity);
  drain the result queue; reap finished workers (frees their slot).
- Worker→GPU is decided at spawn time, not at index-based round-robin,
  so GPUs that finish quickly immediately get new candidates instead
  of sitting idle while their statically-assigned share trickles
  through the LLM phase.

Threading: ``OptimizationManager.workers_per_gpu`` flows via
``registry.create_from_config`` into ``NvidiaWorkerRunner.__init__``.
Existing ``mp.Queue`` drain-while-join logic is preserved.

The XL clustered preset now sets ``workers_per_gpu: 4`` (pool
capacity 32 vs. all-at-once 225) — fewer simultaneous LLM calls,
better GPU utilization, and ~7× lower per-GPU CUDA-context residency.
Two related fixes for the stuck-baseline-NCU bug surfaced by the
225-worker run: a round-1 worker died holding gpu_locks[0], and the
manager's baseline-NCU step (which shared the same lock) then waited
15 minutes per parent before the existing semaphore timeout fired —
roughly 75 minutes per round of pure waiting.

- OptimizationManager now keeps a *separate* lock pool for its own
  GPU work (``_mgr_gpu_locks``), distinct from ``gpu_locks`` used by
  worker subprocesses.  All manager-level GPU operations
  (initial-kernel verify, PyTorch baselines, baseline-NCU caching)
  happen between rounds when no workers are running, so they don't
  actually need to coordinate with workers — and a worker dying
  holding gpu_locks[g] now can't strand the manager.
  ``self.benchmark_lock`` and ``self.profiling_semaphore`` (the two
  back-compat aliases consumed by NvidiaBenchmarker / NvidiaVerifier
  / the manager's _mgr_profiler) point at the new dedicated pool.

- Reduce DEFAULT_SEMAPHORE_TIMEOUT_SECONDS in kernel_profiler.py
  from 900s → 60s as belt-and-suspenders.  Any future stale-lock
  scenario fails-fast instead of stalling 15 min per attempt.
…l_size

Empirical evidence from the 225-worker / 8-round / clustered run showed
that beam slots holding kernels not selected as expansion parents (the
old ``num_top_kernels`` slots that exceeded ``num_expanding_parents``)
never drove subsequent expansion: every round 2+ parent came directly
from the previous round's children, never from carried-over slot-6+
entries.  With technique-vector clustering doing the diversity job the
buffer was meant for, those slots were pure overhead.

This commit replaces both knobs with a single ``candidate_pool_size``:
every member of the pool is expanded each round.

- BeamSearchStrategy
  - ``candidate_pool_size`` ctor param (was ``num_top_kernels``).
  - Drop ``num_expanding_parents`` and the ``_effective_num_parents``
    derived property.
  - ``select_candidates`` iterates over the entire pool.
  - ``num_workers_needed`` = pool × bottlenecks × models × samples.
  - ``initialize`` seeds N copies of initial; ``update_with_results``
    truncates pooled survivors to N via the dedup + (optional)
    diversity-aware selection path.

- OptimizationManager._create_strategy reads ``candidate_pool_size``
  from strategy_config; old keys are no longer recognized.

- All YAML presets renamed:
  - beam_search.yaml:                num_top_kernels=2          → candidate_pool_size=2
  - nvidia.yaml:                     num_top_kernels=2          → candidate_pool_size=2
  - beam_search_diverse_smoke:       (4,1)                      → candidate_pool_size=1
  - beam_search_diverse:             (10,5)                     → candidate_pool_size=5
  - beam_search_diverse_concentrated:(10,2)                     → candidate_pool_size=2
  - beam_search_diverse_clustered:   (10,2)                     → candidate_pool_size=2
  - beam_search_diverse_clustered_xl:(10,5)                     → candidate_pool_size=5

  All declared ``num_workers`` values reconcile with the new formula.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant