Technique vector clustering for beam-search diversity + production scheduling refinements#140
Draft
jiannanWang wants to merge 7 commits into
Draft
Technique vector clustering for beam-search diversity + production scheduling refinements#140jiannanWang wants to merge 7 commits into
jiannanWang wants to merge 7 commits into
Conversation
- BeamSearchStrategy: add models / samples_per_prompt / num_expanding_parents knobs; expansion now P × M × K × C with per-candidate openai_model and sample_idx threaded through the worker dispatch. - PTX fingerprint dedup: new ptx_fingerprint.py captures compiled PTX from a per-call TRITON_CACHE_DIR during benchmarking, normalizes (strip comments/debug/headers, canonicalize register/label names), SHA-256 hashes. update_with_results dedups the combined pool by hash before sort+truncate; ProgramEntry / json_db carry ptx_hash. - Multi-GPU: per-GPU mp.Lock pool (single lock covers both benchmark and NCU on a given GPU), round-robin worker -> GPU assignment, CUDA_VISIBLE_DEVICES=<gpu_id> set in the worker process before any torch import. Manager auto-detects via nvidia-smi (NOT torch.cuda) to avoid poisoning forked children with an inherited CUDA context. - Per-parent baseline NCU cache: manager profiles each unique parent once per round and attaches baseline_metrics to each candidate dict; workers skip their own NCU when the cache is populated. - Bottleneck plumbing fix: num_bottlenecks is now wired from strategy_config -> worker_kwargs[num_bottlenecks_to_request] -> BottleneckAnalyzer. Pre-fix the analyzer always asked for 1 ranked bottleneck so workers with bottleneck_id >= 2 silently fell back to rank 1. - mp.Queue feeder-thread deadlock fix: NvidiaWorkerRunner.run_workers now drains the queue interleaved with join(timeout=0.5) polling instead of joining all workers serially before draining. - best_runtime_ptx_hash propagation: orchestrator captures the hash after _update_kernels (was previously checked before, when the comparison was tautologically false), and parent-hash byte-identity fallback in update_with_results lets unchanged-parent results inherit the parent's hash so they collapse correctly in dedup. - ncu_profiler.py: NaN-safe units-row detection (str.lower() propagated pd.NA back to float NaN, breaking the substring check). - Configs: examples/configs/beam_search_diverse.yaml (spread, P=5/C=2), beam_search_diverse_concentrated.yaml (P=2/C=5), beam_search_diverse_smoke.yaml (smoke variant).
Layered semantic dedup on top of PTX-hash dedup: an LLM emits a binary
vector indicating which optimization techniques each kernel uses, and
beam truncation preserves the fastest representative of each distinct
vector ("at least one per cluster") so the beam doesn't fill with
near-clones of the leader.
- New module: opt_worker_component/searching/technique_vector.py
- TechniqueDefinition dataclass; load_techniques(path) reads from YAML.
- classify_kernel(...) — one LLM call → binary vector of length N.
- classify_many(...) — thread-pooled fan-out for a batch of kernels.
- select_diverse_top_k(entries, k) — diversity-aware truncation:
walk sorted-by-time pool, keep first per cluster, backfill remaining
slots from fastest unaccepted entries.
- Expandable taxonomy: examples/configs/techniques_default.yaml lists
18 techniques (split-K reduce, tensor cores, software pipelining,
swizzled load, persistent kernel, vectorized load, shared-memory
tiling, register tiling, autotune, masked access, atomic reduction,
precision split, warp specialization, cluster/DSMEM, grid swizzling,
epilogue fusion, loop unrolling, async/TMA copy). Append-only entries
are safe; the strategy reads the YAML at run start and the vector
dimension follows automatically.
- ProgramEntry / JSONProgramDatabase round-trip technique_vector along
with the existing ptx_hash field; vectors persist across rounds and
across runs.
- BeamSearchStrategy: opt-in via new ctor params (techniques,
technique_classifier_provider, technique_classifier_model,
technique_classifier_concurrency). When enabled, update_with_results
runs PTX-dedup, then classifies any uncached / stale-dimension
survivors via thread-pooled LLM, then uses select_diverse_top_k for
truncation instead of plain sort+truncate. When disabled, behavior
is identical to the previous PTX-only path.
- OptimizationManager._build_technique_clustering_kwargs reads the
technique_clustering block from strategy_config:
technique_clustering:
enabled: true
techniques_yaml: examples/configs/techniques_default.yaml
classifier_model: claude-opus-4.6
max_concurrency: 4
Resolves the LLM provider via get_model_provider, falls back to the
manager's openai_model when classifier_model is omitted, and
silently disables clustering on any misconfiguration (logging a
warning) so an experiment never crashes on a missing YAML.
- New preset examples/configs/beam_search_diverse_clustered.yaml mirrors
the concentrated 90-worker production config + technique clustering.
- docs/technique_clustering_design.md captures the architecture, the
taxonomy, the diversity-aware selection algorithm, the cost / payoff
analysis, and open questions.
- Update select_diverse_top_k: replace "first per cluster + backfill" with the round-robin-by-depth algorithm requested for the next experiment. Pass 0 takes the fastest of every cluster (sorted by time across clusters); pass 1 takes the second-fastest of every cluster; etc., until k slots are filled or the pool is exhausted. Output is ordered by selection (most-diverse-first), not by time, so num_expanding_parents=N picks N different clusters as parents. - New preset examples/configs/beam_search_diverse_clustered_xl.yaml: P=5, M=3, K=3, C=5 = 225 workers, 8 rounds, technique clustering enabled. Pairs with beam_search_diverse_concentrated.yaml for the next A/B comparison.
Replace the spawn-all-at-once round-robin worker launcher with a dynamic pool-based scheduler: - ``workers_per_gpu`` (manager-level, default 2) bounds how many worker processes can be pinned to any single GPU at once. Total active pool = ``workers_per_gpu × len(gpu_ids)``. - ``run_workers`` maintains a pending queue + per-GPU free-slot counter. At each iteration: top up the pool (spawn pending onto GPUs with free slots, preferring GPUs with the most free capacity); drain the result queue; reap finished workers (frees their slot). - Worker→GPU is decided at spawn time, not at index-based round-robin, so GPUs that finish quickly immediately get new candidates instead of sitting idle while their statically-assigned share trickles through the LLM phase. Threading: ``OptimizationManager.workers_per_gpu`` flows via ``registry.create_from_config`` into ``NvidiaWorkerRunner.__init__``. Existing ``mp.Queue`` drain-while-join logic is preserved. The XL clustered preset now sets ``workers_per_gpu: 4`` (pool capacity 32 vs. all-at-once 225) — fewer simultaneous LLM calls, better GPU utilization, and ~7× lower per-GPU CUDA-context residency.
Two related fixes for the stuck-baseline-NCU bug surfaced by the 225-worker run: a round-1 worker died holding gpu_locks[0], and the manager's baseline-NCU step (which shared the same lock) then waited 15 minutes per parent before the existing semaphore timeout fired — roughly 75 minutes per round of pure waiting. - OptimizationManager now keeps a *separate* lock pool for its own GPU work (``_mgr_gpu_locks``), distinct from ``gpu_locks`` used by worker subprocesses. All manager-level GPU operations (initial-kernel verify, PyTorch baselines, baseline-NCU caching) happen between rounds when no workers are running, so they don't actually need to coordinate with workers — and a worker dying holding gpu_locks[g] now can't strand the manager. ``self.benchmark_lock`` and ``self.profiling_semaphore`` (the two back-compat aliases consumed by NvidiaBenchmarker / NvidiaVerifier / the manager's _mgr_profiler) point at the new dedicated pool. - Reduce DEFAULT_SEMAPHORE_TIMEOUT_SECONDS in kernel_profiler.py from 900s → 60s as belt-and-suspenders. Any future stale-lock scenario fails-fast instead of stalling 15 min per attempt.
…l_size
Empirical evidence from the 225-worker / 8-round / clustered run showed
that beam slots holding kernels not selected as expansion parents (the
old ``num_top_kernels`` slots that exceeded ``num_expanding_parents``)
never drove subsequent expansion: every round 2+ parent came directly
from the previous round's children, never from carried-over slot-6+
entries. With technique-vector clustering doing the diversity job the
buffer was meant for, those slots were pure overhead.
This commit replaces both knobs with a single ``candidate_pool_size``:
every member of the pool is expanded each round.
- BeamSearchStrategy
- ``candidate_pool_size`` ctor param (was ``num_top_kernels``).
- Drop ``num_expanding_parents`` and the ``_effective_num_parents``
derived property.
- ``select_candidates`` iterates over the entire pool.
- ``num_workers_needed`` = pool × bottlenecks × models × samples.
- ``initialize`` seeds N copies of initial; ``update_with_results``
truncates pooled survivors to N via the dedup + (optional)
diversity-aware selection path.
- OptimizationManager._create_strategy reads ``candidate_pool_size``
from strategy_config; old keys are no longer recognized.
- All YAML presets renamed:
- beam_search.yaml: num_top_kernels=2 → candidate_pool_size=2
- nvidia.yaml: num_top_kernels=2 → candidate_pool_size=2
- beam_search_diverse_smoke: (4,1) → candidate_pool_size=1
- beam_search_diverse: (10,5) → candidate_pool_size=5
- beam_search_diverse_concentrated:(10,2) → candidate_pool_size=2
- beam_search_diverse_clustered: (10,2) → candidate_pool_size=2
- beam_search_diverse_clustered_xl:(10,5) → candidate_pool_size=5
All declared ``num_workers`` values reconcile with the new formula.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on #139 (multi-LLM beam search) to add a diversity-preserving selection layer plus the scheduling and locking refinements that the 225-worker XL preset needs.
The motivation: with #139 's
P × B × M × Sfanout, the beam quickly fills with kernels that target the same optimization approach. PTX dedup catches byte-equivalents but can't tell apart kernels that differ structurally yet still pursue the same strategy. Without a diversity term, the beam collapses onto a single lineage and the search plateaus. This PR adds an LLM-classifier-driven technique-vector clustering layer that selects round-robin per cluster, preserving at least one representative of each distinct optimization approach.It also includes the scheduling/locking changes that became necessary at 225-worker scale (dynamic GPU scheduler, manager-side lock-pool decouple, bottleneck-analysis caching), and an API cleanup that collapses two redundant knobs into one.
All new features are opt-in. Clustering is gated by
technique_clustering.enabled: true; the existing #139 presets do not enable it.Example commands
Two new presets ship with the PR:
Both presets configure models: [claude-opus-4.6, gpt-5-4, gemini-2-5-pro] and use claude-opus-4.6 as the technique classifier. Clustering off by default in existing presets.
Experiment results
Two problems, 8 rounds each (full production runs):
Notes:
problem at fixed budget — clustering helps escape plateaus but isn't always the dominant intervention. Both configurations beat torch.compile.)
Verified live on this branch with a 2-round 90-worker clustered run on matvec:
What is changed