Skip to content

[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: SMT-aware asymmetric CPU capacity idle selection#406

Closed
arighi wants to merge 4 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
arighi:linux-nvidia-7.0-bos
Closed

[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: SMT-aware asymmetric CPU capacity idle selection#406
arighi wants to merge 4 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
arighi:linux-nvidia-7.0-bos

Conversation

@arighi
Copy link
Copy Markdown
Collaborator

@arighi arighi commented May 5, 2026

On Vera Rubin, the firmware exposes CPUs with different capacities through ACPI/CPPC. Unlike Grace systems, Vera Rubin also supports SMT. As a result, the Linux scheduler enables the asymmetric CPU capacity idle selection policy, but the current implementation is not SMT-aware. This can lead to suboptimal task placement, where tasks are scheduled on both SMT siblings of the same core even when fully idle SMT cores are available elsewhere in the system.

In CPU-intensive workloads, this behavior can significantly reduce performance, with slowdowns of up to 2x observed in certain CPU-intensive workloads.

This series is a backport of the upstream patch series available at:
https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com

NOTE: the original series includes additional patches that are not needed in linux-nvidia-7.0:

  • PATCH 1/6 is a refactoring that is valid only in kernel >= 7.0, because it requires 71fedc41c23b ("sched/fair: Switch to rcu_dereference_all()") and it's not worth backporting it,
  • PATCH 6/6 is incorrect and will be dropped (so it's not backported)

The series is currently under review on the mailing list, but consensus has been reached with the scheduler maintainers and the changes are expected to be merged for v7.2.

Given the potential impact on Vera Rubin performance, it seems reasonable to backport and apply these patches to the linux-nvidia kernel and carry them as NVIDIA SAUCE for now, until the upstream solution becomes available.

Patch series has been validated both on Vera and Grace running DCPerf Mediawiki and benchblas (NVBLAS).

NOTE: the same series has been applied to the linux-nvidia-6.17 kernel (see also #395)


LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-bos/+bug/2150671

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details
Checking 4 commits...

Cherry-pick digest:
E: 9a5f1d28d23a ("NVIDIA: VR: SAUCE: sched/fair: Attach sc"): diff MISMATCH with lore patch (add [Author: reason] annotation if intentional)
┌──────────────┬───────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject           │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ ba65b4eb0735 │ [SAUCE] sched/fair: add sis_util support to s │ N/A        │ N/A     │ arighi, nayak, arighi     │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ fea7499e89a7 │ [SAUCE] sched/fair: reject misfit pulls onto  │ N/A        │ N/A     │ arighi, arighi            │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ eeb53bc4dac6 │ [SAUCE] sched/fair: prefer fully-idle smt cor │ N/A        │ N/A     │ arighi, arighi            │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9a5f1d28d23a │ sched/fair: attach sched_domain_shared to sd_ │ MISMATCH   │ found   │ ok, backporter: arighi    │
└──────────────┴───────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

Copy link
Copy Markdown
Collaborator

@jamieNguyenNVIDIA jamieNguyenNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Jamie Nguyen <jamien@nvidia.com>

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 5, 2026

Since this is VR-specific, can you make the tag "NVIDIA: VR: SAUCE" ?

@arighi
Copy link
Copy Markdown
Collaborator Author

arighi commented May 7, 2026

Since this is VR-specific, can you make the tag "NVIDIA: VR: SAUCE" ?

@nvmochs Actually the last patch (sched/fair: Add SIS_UTIL support to select_idle_capacity()) is also affecting Grace, because it's introducing the same SIS_UTIL logic also for the asym-cpu-capacity idle selection (with and without SMT). And I tested this on Grace as well with DCPerf MediaWiki and benchblas (NVBLAS) and it seems to provide small (<5%) but consistent improvements also there. Sorry, I should have mentioned this in the PR description.

Should we use NVIDIA: VR: SAUCE: for the first 3 patches and NVIDIA: SAUCE: for the last one?

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 8, 2026

Since this is VR-specific, can you make the tag "NVIDIA: VR: SAUCE" ?

@nvmochs Actually the last patch (sched/fair: Add SIS_UTIL support to select_idle_capacity()) is also affecting Grace, because it's introducing the same SIS_UTIL logic also for the asym-cpu-capacity idle selection (with and without SMT). And I tested this on Grace as well with DCPerf MediaWiki and benchblas (NVBLAS) and it seems to provide small (<5%) but consistent improvements also there. Sorry, I should have mentioned this in the PR description.

Should we use NVIDIA: VR: SAUCE: for the first 3 patches and NVIDIA: SAUCE: for the last one?

I think for simplicity, let's just use NVIDIA: VR: SAUCE: for all 4 as I don't think we'll document the "Add SIS_UTIL" as a Grace-specific patch.

arighi and others added 4 commits May 8, 2026 16:53
…cpucapacity

BugLink: https://bugs.launchpad.net/bugs/2150671

On asymmetric CPU capacity systems, the wakeup path uses
select_idle_capacity(), which scans the span of sd_asym_cpucapacity
rather than sd_llc.

The has_idle_cores hint however lives on sd_llc->shared, so the
wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
the actual scan/decision spans the wider asym domain; nr_busy_cpus also
lives in the same shared sched_domain data, but it's never used in the
asym CPU capacity scenario.

Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
the scope of has_idle_cores matches the scope of the wakeup scan.

Fall back to attaching the shared object to sd_llc in three cases:

  1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);

  2) CPUs in an exclusive cpuset that carves out a symmetric capacity
     island: has_asym is system-wide but those CPUs have no
     SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
     the symmetric LLC path in select_idle_sibling();

  3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
     SD_NUMA-built domain. init_sched_domain_shared() keys the shared
     blob off cpumask_first(span), which on overlapping NUMA domains
     would alias unrelated spans onto the same blob. Keep the shared
     object on the LLC there; select_idle_capacity() gracefully skips
     the has_idle_cores preference when sd->shared is NULL.

While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
as it is no longer strictly tied to the LLC.

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
(backported from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com)
[ arighi:
   - backport full logic to attach sd->shared in build_sched_domains()
   - do not rename sd_llc_shared to reduce the risk of conflicts ]
Signed-off-by: Andrea Righi <arighi@nvidia.com>
…pacity idle selection

BugLink: https://bugs.launchpad.net/bugs/2150671

On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.

Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active, always prefer fully-idle SMT cores over partially-idle
ones.

Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.

On an SMT system with asymmetric CPU capacities, SMT-aware idle
selection has been shown to improve throughput by around 15-18% for
CPU-bound workloads, running an amount of tasks equal to the amount of
SMT cores.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
(cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com)
Signed-off-by: Andrea Righi <arighi@nvidia.com>
…ings on asym-capacity

BugLink: https://bugs.launchpad.net/bugs/2150671

When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
busy: the core does not deliver its full nominal capacity.

If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade we
cannot actually provide.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
(cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com)
Signed-off-by: Andrea Righi <arighi@nvidia.com>
…pacity()

BugLink: https://bugs.launchpad.net/bugs/2150671

Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
is enabled and the LLC domain has sched_domain_shared data, derive the
per-attempt scan limit from sd->shared->nr_idle_scan.

That bounds the walk on large LLCs and allows an early return once the
scan limit is reached, if we already picked a sufficiently strong
idle-core candidate (best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT).

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
(cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com)
Signed-off-by: Andrea Righi <arighi@nvidia.com>
@arighi arighi force-pushed the linux-nvidia-7.0-bos branch from 3bbc958 to ba65b4e Compare May 8, 2026 14:58
Copy link
Copy Markdown
Collaborator

@clsotog clsotog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Carol L Soto <csoto@nvidia.com>

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 8, 2026

Thanks Andrea!

No further issues from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 8, 2026

Merged, closing PR.

948d62aafed3 (nresolute/nvidia-bos-next) NVIDIA: VR: SAUCE: sched/fair: Add SIS_UTIL support to select_idle_capacity()
603541313076 NVIDIA: VR: SAUCE: sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
0f4e6215a473 NVIDIA: VR: SAUCE: sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
af65f59bffe0 NVIDIA: VR: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity

@nvmochs nvmochs closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants