Skip to content

[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] hugetlb memcg accounting and bugfixes#1867

Merged
opsiff merged 7 commits into
deepin-community:linux-6.6.yfrom
opsiff:linux-6.6.y-2026-06-16-hugetlb-2
Jun 16, 2026
Merged

[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] hugetlb memcg accounting and bugfixes#1867
opsiff merged 7 commits into
deepin-community:linux-6.6.yfrom
opsiff:linux-6.6.y-2026-06-16-hugetlb-2

Conversation

@opsiff

@opsiff opsiff commented Jun 16, 2026

Copy link
Copy Markdown
Member

Link: https://lore.kernel.org/all/20231006184629.155543-3-nphamcs@gmail.com/T/#m0c97da970b68fd406ecfd5e6cf05b6069442c140

Changelog:
v4:
* Add another prep patch to clean up memory controller migration
logic.
* Fix an issue in hugetlb folio migration where the new folio
is not properly charged (patch 3) (reported by Mike Kravetz)
(suggested by Johannes Weiner).
v3:
* Add a prep patch at the start of the series to extend the memory
controller interface with new helper functions for hugetlb
accounting.
* Do not account hugetlb memory for memcontroller in cgroup v1
(patch 2) (suggested by Johannes Weiner).
* Change the gfp flag passed to mem cgroup charging (patch 2)
(suggested by Michal Hocko).
* Add caveats to cgroup admin guide and commit changelog
(patch 2) (suggested by Michal Hocko).
v2:
* Add a cgroup mount option to enable/disable the new hugetlb memcg
accounting behavior (patch 1) (suggested by Johannes Weiner).
* Add a couple more ksft_print_msg() on error to aid debugging when
the selftest fails. (patch 2)

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each
container may or may not use up to 3 gigantic page, depending on the
workload within it. The rest is anon, cache, slab, etc. We can set the
hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
But it is very difficult to configure memory.max to keep overall
consumption, including anon, cache, slab etc. fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for
e.g when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.

Nhat Pham (4):
memcontrol: add helpers for hugetlb memcg accounting
memcontrol: only transfer the memcg data for migration
hugetlb: memcg: account hugetlb-backed memory in memory controller
selftests: add a selftest to verify hugetlb usage in memcg

Documentation/admin-guide/cgroup-v2.rst | 29 +++
MAINTAINERS | 2 +
include/linux/cgroup-defs.h | 5 +
include/linux/memcontrol.h | 37 +++
kernel/cgroup/cgroup.c | 15 +-
mm/filemap.c | 2 +-
mm/hugetlb.c | 35 ++-
mm/memcontrol.c | 139 +++++++++--
mm/migrate.c | 3 +-
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
.../selftests/cgroup/test_hugetlb_memcg.c | 234 ++++++++++++++++++
12 files changed, 478 insertions(+), 26 deletions(-)
create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c

--
2.34.1

Summary by Sourcery

Enable optional accounting of hugetlb-backed memory in the memory controller and update hugetlb and memcg handling to support correct charging, migration, and reclamation behavior.

New Features:

  • Introduce a cgroup v2 root mount option to enable hugetlb accounting in the memory controller.
  • Charge and uncharge hugetlb folios to memcgs so hugepage usage contributes to memory.current and related limits and protections.
  • Add helpers to obtain the current task's memcg and to explicitly try, commit, and cancel hugetlb-related memcg charges.

Bug Fixes:

  • Correct memcg handling during folio migration so the memcg metadata is transferred without double-charging or leaks, including for hugetlb folios.
  • Fix interactions between folio migration and the deferred split queue for large rmappable folios.
  • Remove size-based restrictions from hugetlb cgroup tracking so smaller hugepages are properly accounted.
  • Tighten validation of hugetlb hstate order to avoid invalid small hugepage configurations.

Enhancements:

  • Refactor memcg charging code to use exported helpers for committing and cancelling charges and distinguish replacement from migration of folios.
  • Extend cgroup v2 feature reporting and root flag handling to cover hugetlb memory accounting.
  • Simplify hugetlb cgroup file creation by initializing controls for all hstates.

Documentation:

  • Document the new hugetlb memory accounting behavior and mount option in the cgroup v2 admin guide.

Tests:

  • Add a cgroup selftest that validates hugetlb usage is reflected in memory.current when hugetlb accounting is enabled.

Frank van der Linden and others added 5 commits June 16, 2026 16:45
mainline inclusion
from mainline-v6.7-rc1
category: performance

Originally, hugetlb_cgroup was the only hugetlb user of tail page
structure fields.  So, the code defined and checked against
HUGETLB_CGROUP_MIN_ORDER to make sure pages weren't too small to use.

However, by now, tail page #2 is used to store hugetlb hwpoison and
subpool information as well.  In other words, without that tail page
hugetlb doesn't work.

Acknowledge this fact by getting rid of HUGETLB_CGROUP_MIN_ORDER and
checks against it.  Instead, just check for the minimum viable page order
at hstate creation time.

Link: https://lkml.kernel.org/r/20231004153248.3842997-1-fvdl@google.com
Signed-off-by: Frank van der Linden <fvdl@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 59838b2)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

Patch series "hugetlb memcg accounting", v4.

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory.  This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers.  The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etcetera fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max.  Similar procedure is done to other memory limits
(memory.low for e.g).  However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed.  In
addition, a new selftest is added to demonstrate and verify this new
behavior.

This patch (of 4):

This patch exposes charge committing and cancelling as parts of the memory
controller interface.  These functionalities are useful when the
try_charge() and commit_charge() stages have to be separated by other
actions in between (which can fail).  One such example is the new hugetlb
accounting behavior in the following patch.

The patch also adds a helper function to obtain a reference to the
current task's memcg.

Link: https://lkml.kernel.org/r/20231006184629.155543-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231006184629.155543-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4b56938)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

For most migration use cases, only transfer the memcg data from the old
folio to the new folio, and clear the old folio's memcg data.  No charging
and uncharging will be done.

This shaves off some work on the migration path, and avoids the temporary
double charging of a folio during its migration.

The only exception is replace_page_cache_folio(), which will use the old
mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio).  In that
context, the isolation of the old page isn't quite as thorough as with
migration, so we cannot use our new implementation directly.

This patch is the result of the following discussion on the new hugetlb
memcg accounting behavior:

https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/

Link: https://lkml.kernel.org/r/20231006184629.155543-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 85ce2c5)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory.  This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers.  The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc.  fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max.  Similar procedure is done to other memory limits
(memory.low for e.g).  However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller).  Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.

Some caveats to consider:
  * This feature is only available on cgroup v2.
  * There is no hugetlb pool management involved in the memory
    controller. As stated above, hugetlb folios are only charged towards
    the memory controller when it is used. Host overcommit management
    has to consider it when configuring hard limits.
  * Failure to charge towards the memcg results in SIGBUS. This could
    happen even if the hugetlb pool still has pages (but the cgroup
    limit is hit and reclaim attempt fails).
  * When this feature is enabled, hugetlb pages contribute to memory
    reclaim protection. low, min limits tuning must take into account
    hugetlb memory.
  * Hugetlb pages utilized while this option is not selected will not
    be tracked by the memory controller (even if cgroup v2 is remounted
    later on).

Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 8cba957)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

This patch add a new kselftest to demonstrate and verify the new hugetlb
memcg accounting behavior.

Link: https://lkml.kernel.org/r/20231006184629.155543-5-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit c0dddb7)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
@sourcery-ai

sourcery-ai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Reviewer's Guide

Implements hugetlb-backed memory accounting in the memory controller for cgroup v2, adds a mount-time feature flag to enable it, refactors memcg charging/migration helpers to support hugetlb folios correctly (including migration paths and large-folio handling), tightens hugetlb/hugetlb_cgroup invariants, and adds a kselftest to validate the new behavior.

Sequence diagram for hugetlb folio allocation with memcg accounting

sequenceDiagram
    participant Task as current_task
    participant Memcg as mem_cgroup
    participant Hugetlb as alloc_hugetlb_folio

    Task->>Hugetlb: alloc_hugetlb_folio(vma, addr, avoid_reserve)
    Hugetlb->>Memcg: get_mem_cgroup_from_current()
    Memcg-->>Hugetlb: memcg
    Hugetlb->>Memcg: mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages)
    alt charge_failed
        Memcg-->>Hugetlb: -ENOMEM
        Hugetlb->>Memcg: mem_cgroup_put(memcg)
        Hugetlb-->>Task: ERR_PTR(-ENOMEM)
    else charge_ok_or_not_supported
        Memcg-->>Hugetlb: 0 or -EOPNOTSUPP
        Hugetlb->>Hugetlb: vma_needs_reservation()
        alt reservation_or_subpool_fail
            Hugetlb->>Memcg: mem_cgroup_cancel_charge(memcg, nr_pages) [if charge_ok]
            Hugetlb->>Memcg: mem_cgroup_put(memcg)
            Hugetlb-->>Task: ERR_PTR(-ENOMEM or -ENOSPC)
        else hugetlb_folio_allocated
            Hugetlb-->>Task: folio
            opt charge_ok
                Hugetlb->>Memcg: mem_cgroup_commit_charge(folio, memcg)
            end
            Hugetlb->>Memcg: mem_cgroup_put(memcg)
        end
    end

    rect rgb(230,230,230)
    Note over Hugetlb,Memcg: On free_huge_folio(folio): mem_cgroup_uncharge(folio)
    end
Loading

Flow diagram for enabling hugetlb memcg accounting via cgroup v2 mount option

flowchart LR
    A[cgroup2_parse_param]
    B[ctx->flags &#124;= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING]
    C[apply_cgroup_root_flags]
    D[cgrp_dfl_root.flags &#124;= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING]
    E[mem_cgroup_hugetlb_try_charge]
    F[hugetlb memcg accounting active]

    A -->|memory_hugetlb_accounting| B --> C --> D --> E
    E -->|flag set and v2 memcg| F
    E -->|flag not set or !v2| F_no[skip memcg charge, return -EOPNOTSUPP]
Loading

File-Level Changes

Change Details Files
Introduce reusable memcg helpers for charging, committing, canceling, and migrating folio charges, and separate page-cache replacement from generic migration semantics.
  • Add get_mem_cgroup_from_current() to safely grab current task’s memcg under RCU with css_tryget retry loop.
  • Expose mem_cgroup_cancel_charge() and mem_cgroup_commit_charge() as exported helpers and refactor charge_memcg() to use them.
  • Add mem_cgroup_hugetlb_try_charge() that conditionally charges hugetlb allocations only when v2 + memory_hugetlb_accounting are enabled and returns -EOPNOTSUPP/-ENOMEM appropriately.
  • Split mem_cgroup_migrate() into mem_cgroup_replace_folio() (page-cache replacement that moves charge and stats) and a new mem_cgroup_migrate() that only transfers memcg metadata without touching counters, updating callers accordingly.
  • Provide stub inline implementations of the new helpers for !CONFIG_MEMCG builds.
mm/memcontrol.c
include/linux/memcontrol.h
mm/filemap.c
Add cgroup v2 mount option and feature flag to control hugetlb accounting by the memory controller and document the behavior.
  • Extend cgroup2 fs parameter parsing to accept memory_hugetlb_accounting and plumb it through apply_cgroup_root_flags() to toggle CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING.
  • Expose the new feature in the cgroup root feature set and seqfs show-options output so users can detect and introspect the knob.
  • Introduce CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING flag in the cgroup root flags enum and update documentation to describe the new mount option and its caveats.
  • Register the new selftest and maintainer entries, and ignore the compiled test binary in git.
kernel/cgroup/cgroup.c
include/linux/cgroup-defs.h
Documentation/admin-guide/cgroup-v2.rst
MAINTAINERS
tools/testing/selftests/cgroup/.gitignore
tools/testing/selftests/cgroup/Makefile
Charge hugetlb folios to memcgs at allocation/free time, integrate with hugetlb_cgroup, and fix migration/large-folio corner cases.
  • In alloc_hugetlb_folio(), obtain current’s memcg, attempt a pre-charge via mem_cgroup_hugetlb_try_charge() with hugetlb-appropriate GFP mask, and on success commit the charge after a folio is allocated, canceling and dropping the reference on every failure path.
  • Ensure free_huge_folio() uncharges the associated memcg when a hugetlb folio is freed so memory.current reflects hugetlb usage.
  • Remove the minimum-order restriction from hugetlb cgroup charge/uncharge paths and always create hugetlb cgroup files per hstate, allowing tracking of smaller hugepage sizes.
  • Tighten hugetlb hstate validation by requiring the order to be at least order_base_2(__NR_USED_SUBPAGE) instead of non-zero, matching metadata layout constraints.
  • Update folio migration to always call mem_cgroup_migrate(), including hugetlb folios, and to carefully undo large-folio rmappable state while ref-frozen so that deferred split queues and memcg data stay consistent during migration.
mm/hugetlb.c
mm/hugetlb_cgroup.c
include/linux/hugetlb_cgroup.h
mm/migrate.c
Add a kselftest that verifies hugetlb usage is reflected in memory.current when memory_hugetlb_accounting is enabled on a cgroup v2 hierarchy.
  • Introduce test_hugetlb_memcg.c which sets up a dedicated cgroup with memory.max and memory.swap.max, adjusts nr_hugepages, and then mmaps/reads/writes/unmaps a 2MB-hugetlb-backed range while sampling memory.current to ensure usage deltas roughly match 2MB and 8MB steps and then return to baseline.
  • Use proc_mount_contains() to skip when the cgroup v2 mount doesn’t have memory_hugetlb_accounting, verify the host uses 2MB hugepages via /proc/meminfo, and integrate with existing cgroup_util helpers and kselftest result reporting.
  • Wire the new test into the cgroup selftests build via the Makefile and ignore its binary in .gitignore.
tools/testing/selftests/cgroup/test_hugetlb_memcg.c
tools/testing/selftests/cgroup/Makefile
tools/testing/selftests/cgroup/.gitignore

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In alloc_hugetlb_folio(), get_mem_cgroup_from_current() can return NULL (e.g. memcg disabled), but you unconditionally call mem_cgroup_put(memcg) on all paths; this should be guarded so mem_cgroup_put() is only called when memcg is non-NULL.
  • The selftest unconditionally sets /proc/sys/vm/nr_hugepages to 20 and never restores the previous value, which can leave the test environment mutated; consider saving and restoring the original sysctl value around the test.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In alloc_hugetlb_folio(), get_mem_cgroup_from_current() can return NULL (e.g. memcg disabled), but you unconditionally call mem_cgroup_put(memcg) on all paths; this should be guarded so mem_cgroup_put() is only called when memcg is non-NULL.
- The selftest unconditionally sets /proc/sys/vm/nr_hugepages to 20 and never restores the previous value, which can leave the test environment mutated; consider saving and restoring the original sysctl value around the test.

## Individual Comments

### Comment 1
<location path="tools/testing/selftests/cgroup/test_hugetlb_memcg.c" line_range="103" />
<code_context>
+	int ret = EXIT_FAILURE;
+
+	old_current = cg_read_long(test_group, "memory.current");
+	set_nr_hugepages(20);
+	current = cg_read_long(test_group, "memory.current");
+	if (current - old_current >= MB(2)) {
</code_context>
<issue_to_address>
**suggestion (testing):** Preserve and restore nr_hugepages so the test does not leak global state into other tests

Since this is a global kernel setting, please read and store the original `/proc/sys/vm/nr_hugepages` value at the start of the test and restore it on all exit paths (success, failure, and skip) to avoid cross-test interference.

Suggested implementation:

```c
static long get_nr_hugepages(void)
{
	FILE *f;
	long val = -1;

	f = fopen("/proc/sys/vm/nr_hugepages", "r");
	if (!f)
		return -1;

	if (fscanf(f, "%ld", &val) != 1)
		val = -1;

	fclose(f);
	return val;
}

static int hugetlb_test_program(const char *cgroup, void *arg)

```

```c
{
	char *test_group = (char *)arg;
	void *addr;
	long old_current, expected_current, current;
	long old_nr_hugepages;
	int ret = EXIT_FAILURE;

	old_nr_hugepages = get_nr_hugepages();
	if (old_nr_hugepages < 0) {
		ksft_print_msg("failed to read /proc/sys/vm/nr_hugepages\n");
		return EXIT_FAILURE;
	}

	old_current = cg_read_long(test_group, "memory.current");

```

```c
	set_nr_hugepages(20);
	current = cg_read_long(test_group, "memory.current");
	if (current - old_current >= MB(2)) {
		ksft_print_msg(
			"setting nr_hugepages should not increase hugepage usage.\n");
		ksft_print_msg("before: %ld, after: %ld\n", old_current, current);
		set_nr_hugepages(old_nr_hugepages);
		return EXIT_FAILURE;
	}

	addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
	if (addr == MAP_FAILED) {
		ksft_print_msg("fail to mmap.\n");
		set_nr_hugepages(old_nr_hugepages);
		return EXIT_FAILURE;

```

To fully avoid leaking global state, you should also:

1. Restore `nr_hugepages` on all other exit paths in `hugetlb_test_program`:
   - Any other `return` statements later in the function should be updated to call `set_nr_hugepages(old_nr_hugepages);` immediately before returning.
2. If the test has a "success" path at the end of the function (e.g. `ret = 0;` / `return ret;`), ensure that `set_nr_hugepages(old_nr_hugepages);` is called there as well before returning.
3. If the test can be skipped via `ksft_test_result_skip()` or similar, make sure `nr_hugepages` is restored before that path returns from `hugetlb_test_program`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

int ret = EXIT_FAILURE;

old_current = cg_read_long(test_group, "memory.current");
set_nr_hugepages(20);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Preserve and restore nr_hugepages so the test does not leak global state into other tests

Since this is a global kernel setting, please read and store the original /proc/sys/vm/nr_hugepages value at the start of the test and restore it on all exit paths (success, failure, and skip) to avoid cross-test interference.

Suggested implementation:

static long get_nr_hugepages(void)
{
	FILE *f;
	long val = -1;

	f = fopen("/proc/sys/vm/nr_hugepages", "r");
	if (!f)
		return -1;

	if (fscanf(f, "%ld", &val) != 1)
		val = -1;

	fclose(f);
	return val;
}

static int hugetlb_test_program(const char *cgroup, void *arg)
{
	char *test_group = (char *)arg;
	void *addr;
	long old_current, expected_current, current;
	long old_nr_hugepages;
	int ret = EXIT_FAILURE;

	old_nr_hugepages = get_nr_hugepages();
	if (old_nr_hugepages < 0) {
		ksft_print_msg("failed to read /proc/sys/vm/nr_hugepages\n");
		return EXIT_FAILURE;
	}

	old_current = cg_read_long(test_group, "memory.current");
	set_nr_hugepages(20);
	current = cg_read_long(test_group, "memory.current");
	if (current - old_current >= MB(2)) {
		ksft_print_msg(
			"setting nr_hugepages should not increase hugepage usage.\n");
		ksft_print_msg("before: %ld, after: %ld\n", old_current, current);
		set_nr_hugepages(old_nr_hugepages);
		return EXIT_FAILURE;
	}

	addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
	if (addr == MAP_FAILED) {
		ksft_print_msg("fail to mmap.\n");
		set_nr_hugepages(old_nr_hugepages);
		return EXIT_FAILURE;

To fully avoid leaking global state, you should also:

  1. Restore nr_hugepages on all other exit paths in hugetlb_test_program:
    • Any other return statements later in the function should be updated to call set_nr_hugepages(old_nr_hugepages); immediately before returning.
  2. If the test has a "success" path at the end of the function (e.g. ret = 0; / return ret;), ensure that set_nr_hugepages(old_nr_hugepages); is called there as well before returning.
  3. If the test can be skipped via ksft_test_result_skip() or similar, make sure nr_hugepages is restored before that path returns from hugetlb_test_program.

Baolin Wang and others added 2 commits June 16, 2026 17:25
mainline inclusion
from mainline-v6.7
category: bugfix

When running autonuma with enabling multi-size THP, I encountered the
following kernel crash issue:

[  134.290216] list_del corruption. prev->next should be fffff9ad42e1c490,
but was dead000000000100. (prev=fffff9ad42399890)
[  134.290877] kernel BUG at lib/list_debug.c:62!
[  134.291052] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  134.291210] CPU: 56 PID: 8037 Comm: numa01 Kdump: loaded Tainted:
G            E      6.7.0-rc4+ deepin-community#20
[  134.291649] RIP: 0010:__list_del_entry_valid_or_report+0x97/0xb0
......
[  134.294252] Call Trace:
[  134.294362]  <TASK>
[  134.294440]  ? die+0x33/0x90
[  134.294561]  ? do_trap+0xe0/0x110
......
[  134.295681]  ? __list_del_entry_valid_or_report+0x97/0xb0
[  134.295842]  folio_undo_large_rmappable+0x99/0x100
[  134.296003]  destroy_large_folio+0x68/0x70
[  134.296172]  migrate_folio_move+0x12e/0x260
[  134.296264]  ? __pfx_remove_migration_pte+0x10/0x10
[  134.296389]  migrate_pages_batch+0x495/0x6b0
[  134.296523]  migrate_pages+0x1d0/0x500
[  134.296646]  ? __pfx_alloc_misplaced_dst_folio+0x10/0x10
[  134.296799]  migrate_misplaced_folio+0x12d/0x2b0
[  134.296953]  do_numa_page+0x1f4/0x570
[  134.297121]  __handle_mm_fault+0x2b0/0x6c0
[  134.297254]  handle_mm_fault+0x107/0x270
[  134.300897]  do_user_addr_fault+0x167/0x680
[  134.304561]  exc_page_fault+0x65/0x140
[  134.307919]  asm_exc_page_fault+0x22/0x30

The reason for the crash is that, the commit 85ce2c5 ("memcontrol:
only transfer the memcg data for migration") removed the charging and
uncharging operations of the migration folios and cleared the memcg data
of the old folio.

During the subsequent release process of the old large folio in
destroy_large_folio(), if the large folio needs to be removed from the
split queue, an incorrect split queue can be obtained (which is
pgdat->deferred_split_queue) because the old folio's memcg is NULL now.
This can lead to list operations being performed under the wrong split
queue lock protection, resulting in a list crash as above.

After the migration, the old folio is going to be freed, so we can remove
it from the split queue in mem_cgroup_migrate() a bit earlier before
clearing the memcg data to avoid getting incorrect split queue.

[akpm@linux-foundation.org: fix comment, per Zi Yan]
Link: https://lkml.kernel.org/r/61273e5e9b490682388377c20f52d19de4a80460.1703054559.git.baolin.wang@linux.alibaba.com
Fixes: 85ce2c5 ("memcontrol: only transfer the memcg data for migration")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[mm: always initialise folio->_deferred_list merged before]
[Rename folio_undo_large_rmappable() to folio_unqueue_deferred_split()]
Conflicts:
	mm/huge_memory.c
(cherry picked from commit 9bcef59)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.10
category: bugfix

Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
flags when freeing, yet the flags shown are not bad: PG_locked had been
set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
symptoms implying double free by deferred split and large folio migration.

6.7 commit 9bcef59 ("mm: memcg: fix split queue list crash when large
folio migration") was right to fix the memcg-dependent locking broken in
85ce2c5 ("memcontrol: only transfer the memcg data for migration"),
but missed a subtlety of deferred_split_scan(): it moves folios to its own
local list to work on them without split_queue_lock, during which time
folio->_deferred_list is not empty, but even the "right" lock does nothing
to secure the folio and the list it is on.

Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
while the old folio's reference count is temporarily frozen to 0 - adding
such a freeze in the !mapping case too (originally, folio lock and
unmapping and no swap cache left an anon folio unreachable, so no freezing
was needed there: but the deferred split queue offers a way to reach it).

Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com
Fixes: 9bcef59 ("mm: memcg: fix split queue list crash when large folio migration")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[rename folio_undo_large_rmappable() to folio_unqueue_deferred_split()]
(cherry picked from commit be9581e)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
TEST_GEN_PROGS += test_cpu
TEST_GEN_PROGS += test_cpuset
TEST_GEN_PROGS += test_zswap
TEST_GEN_PROGS += test_hugetlb_memcg

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opsiff I'd like to take this opportunity to discuss the feasibility of adding a CI job that runs the kselftests against the built kernel (the new kernel) to verify they pass.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opsiff I'd like to take this opportunity to discuss the feasibility of adding a CI job that runs the kselftests against the built kernel (the new kernel) to verify they pass.

I will take time to design and deploy it.

@deepin-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Avenger-285714

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional memory-controller (memcg) accounting for HugeTLB-backed memory on cgroup v2, aligns migration/replacement paths to handle memcg metadata correctly (including HugeTLB folios), and introduces a selftest to validate the new accounting behavior.

Changes:

  • Add a cgroup v2 mount option (memory_hugetlb_accounting) and feature reporting to opt into HugeTLB accounting in the memory controller.
  • Implement memcg charge/uncharge and migration metadata transfer for HugeTLB folios; split “replacement” vs “migration” memcg handling for folios.
  • Add a cgroup selftest to ensure HugeTLB usage is reflected in memory.current when the mount option is enabled.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
Documentation/admin-guide/cgroup-v2.rst Documents the new memory_hugetlb_accounting mount option and caveats.
MAINTAINERS Adds the new selftest to relevant maintainer entries.
include/linux/cgroup-defs.h Introduces CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING root flag.
include/linux/memcontrol.h Declares new memcg helpers for hugetlb charging and folio replacement/migration.
kernel/cgroup/cgroup.c Parses, applies, shows, and reports the new cgroup v2 mount option/feature.
mm/filemap.c Switches page-cache replacement to the new mem_cgroup_replace_folio() helper.
mm/hugetlb.c Adds memcg charge on HugeTLB allocation and uncharge on free; tightens hstate order validation.
mm/memcontrol.c Adds/exports memcg charge commit/cancel helpers, current-task memcg getter, and correct migration metadata transfer.
mm/migrate.c Ensures memcg data is migrated for all folios and adjusts deferred-split queue handling.
tools/testing/selftests/cgroup/.gitignore Ignores the newly built test_hugetlb_memcg binary.
tools/testing/selftests/cgroup/Makefile Builds the new HugeTLB memcg selftest.
tools/testing/selftests/cgroup/test_hugetlb_memcg.c New selftest validating HugeTLB accounting against memory.current.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

memory_hugetlb_accounting
Count HugeTLB memory usage towards the cgroup's overall
memory usage for the memory controller (for the purpose of
statistics reporting and memory protetion). This is a new
Comment thread mm/memcontrol.c
Comment on lines +7681 to +7682
* For e.g, itt could have been allocated when memory_hugetlb_accounting
* was not selected.
Comment on lines +116 to +120

/*
* Enable hugetlb accounting for the memory controller.
*/
CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
Comment on lines +4 to +11
#include <linux/limits.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include "../kselftest.h"
#include "cgroup_util.h"
Comment on lines +55 to +59
val = strtol(p, &q, 0);
if (*q != ' ') {
/* Error parsing the file */
return -1;
}
Comment on lines +102 to +105
old_current = cg_read_long(test_group, "memory.current");
set_nr_hugepages(20);
current = cg_read_long(test_group, "memory.current");
if (current - old_current >= MB(2)) {
@opsiff opsiff merged commit 09490f6 into deepin-community:linux-6.6.y Jun 16, 2026
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants