[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] hugetlb memcg accounting and bugfixes by opsiff · Pull Request #1867 · deepin-community/kernel

opsiff · 2026-06-16T08:49:43Z

Link: https://lore.kernel.org/all/20231006184629.155543-3-nphamcs@gmail.com/T/#m0c97da970b68fd406ecfd5e6cf05b6069442c140

Changelog:
v4:
* Add another prep patch to clean up memory controller migration
logic.
* Fix an issue in hugetlb folio migration where the new folio
is not properly charged (patch 3) (reported by Mike Kravetz)
(suggested by Johannes Weiner).
v3:
* Add a prep patch at the start of the series to extend the memory
controller interface with new helper functions for hugetlb
accounting.
* Do not account hugetlb memory for memcontroller in cgroup v1
(patch 2) (suggested by Johannes Weiner).
* Change the gfp flag passed to mem cgroup charging (patch 2)
(suggested by Michal Hocko).
* Add caveats to cgroup admin guide and commit changelog
(patch 2) (suggested by Michal Hocko).
v2:
* Add a cgroup mount option to enable/disable the new hugetlb memcg
accounting behavior (patch 1) (suggested by Johannes Weiner).
* Add a couple more ksft_print_msg() on error to aid debugging when
the selftest fails. (patch 2)

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each
container may or may not use up to 3 gigantic page, depending on the
workload within it. The rest is anon, cache, slab, etc. We can set the
hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
But it is very difficult to configure memory.max to keep overall
consumption, including anon, cache, slab etc. fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for
e.g when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.

Nhat Pham (4):
memcontrol: add helpers for hugetlb memcg accounting
memcontrol: only transfer the memcg data for migration
hugetlb: memcg: account hugetlb-backed memory in memory controller
selftests: add a selftest to verify hugetlb usage in memcg

--
2.34.1

Summary by Sourcery

Enable optional accounting of hugetlb-backed memory in the memory controller and update hugetlb and memcg handling to support correct charging, migration, and reclamation behavior.

New Features:

Introduce a cgroup v2 root mount option to enable hugetlb accounting in the memory controller.
Charge and uncharge hugetlb folios to memcgs so hugepage usage contributes to memory.current and related limits and protections.
Add helpers to obtain the current task's memcg and to explicitly try, commit, and cancel hugetlb-related memcg charges.

Bug Fixes:

Correct memcg handling during folio migration so the memcg metadata is transferred without double-charging or leaks, including for hugetlb folios.
Fix interactions between folio migration and the deferred split queue for large rmappable folios.
Remove size-based restrictions from hugetlb cgroup tracking so smaller hugepages are properly accounted.
Tighten validation of hugetlb hstate order to avoid invalid small hugepage configurations.

Enhancements:

Refactor memcg charging code to use exported helpers for committing and cancelling charges and distinguish replacement from migration of folios.
Extend cgroup v2 feature reporting and root flag handling to cover hugetlb memory accounting.
Simplify hugetlb cgroup file creation by initializing controls for all hstates.

Documentation:

Document the new hugetlb memory accounting behavior and mount option in the cgroup v2 admin guide.

Tests:

Add a cgroup selftest that validates hugetlb usage is reflected in memory.current when hugetlb accounting is enabled.

mainline inclusion from mainline-v6.7-rc1 category: performance Originally, hugetlb_cgroup was the only hugetlb user of tail page structure fields. So, the code defined and checked against HUGETLB_CGROUP_MIN_ORDER to make sure pages weren't too small to use. However, by now, tail page #2 is used to store hugetlb hwpoison and subpool information as well. In other words, without that tail page hugetlb doesn't work. Acknowledge this fact by getting rid of HUGETLB_CGROUP_MIN_ORDER and checks against it. Instead, just check for the minimum viable page order at hstate creation time. Link: https://lkml.kernel.org/r/20231004153248.3842997-1-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 59838b2) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>

mainline inclusion from mainline-v6.7-rc1 category: performance Patch series "hugetlb memcg accounting", v4. Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etcetera fair. What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state. This patch series rectifies this issue by charging the memcg when the hugetlb folio is allocated, and uncharging when the folio is freed. In addition, a new selftest is added to demonstrate and verify this new behavior. This patch (of 4): This patch exposes charge committing and cancelling as parts of the memory controller interface. These functionalities are useful when the try_charge() and commit_charge() stages have to be separated by other actions in between (which can fail). One such example is the new hugetlb accounting behavior in the following patch. The patch also adds a helper function to obtain a reference to the current task's memcg. Link: https://lkml.kernel.org/r/20231006184629.155543-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231006184629.155543-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 4b56938) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>

mainline inclusion from mainline-v6.7-rc1 category: performance For most migration use cases, only transfer the memcg data from the old folio to the new folio, and clear the old folio's memcg data. No charging and uncharging will be done. This shaves off some work on the migration path, and avoids the temporary double charging of a folio during its migration. The only exception is replace_page_cache_folio(), which will use the old mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that context, the isolation of the old page isn't quite as thorough as with migration, so we cannot use our new implementation directly. This patch is the result of the following discussion on the new hugetlb memcg accounting behavior: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/ Link: https://lkml.kernel.org/r/20231006184629.155543-3-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 85ce2c5) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>

mainline inclusion from mainline-v6.7-rc1 category: performance Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etc. fair. What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state. This patch rectifies this issue by charging the memcg when the hugetlb folio is utilized, and uncharging when the folio is freed (analogous to the hugetlb controller). Note that we do not charge when the folio is allocated to the hugetlb pool, because at this point it is not owned by any memcg. Some caveats to consider: * This feature is only available on cgroup v2. * There is no hugetlb pool management involved in the memory controller. As stated above, hugetlb folios are only charged towards the memory controller when it is used. Host overcommit management has to consider it when configuring hard limits. * Failure to charge towards the memcg results in SIGBUS. This could happen even if the hugetlb pool still has pages (but the cgroup limit is hit and reclaim attempt fails). * When this feature is enabled, hugetlb pages contribute to memory reclaim protection. low, min limits tuning must take into account hugetlb memory. * Hugetlb pages utilized while this option is not selected will not be tracked by the memory controller (even if cgroup v2 is remounted later on). Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 8cba957) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>

mainline inclusion from mainline-v6.7-rc1 category: performance This patch add a new kselftest to demonstrate and verify the new hugetlb memcg accounting behavior. Link: https://lkml.kernel.org/r/20231006184629.155543-5-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit c0dddb7) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>

sourcery-ai · 2026-06-16T08:49:50Z

Reviewer's Guide

Implements hugetlb-backed memory accounting in the memory controller for cgroup v2, adds a mount-time feature flag to enable it, refactors memcg charging/migration helpers to support hugetlb folios correctly (including migration paths and large-folio handling), tightens hugetlb/hugetlb_cgroup invariants, and adds a kselftest to validate the new behavior.

Sequence diagram for hugetlb folio allocation with memcg accounting

sequenceDiagram
    participant Task as current_task
    participant Memcg as mem_cgroup
    participant Hugetlb as alloc_hugetlb_folio

    Task->>Hugetlb: alloc_hugetlb_folio(vma, addr, avoid_reserve)
    Hugetlb->>Memcg: get_mem_cgroup_from_current()
    Memcg-->>Hugetlb: memcg
    Hugetlb->>Memcg: mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages)
    alt charge_failed
        Memcg-->>Hugetlb: -ENOMEM
        Hugetlb->>Memcg: mem_cgroup_put(memcg)
        Hugetlb-->>Task: ERR_PTR(-ENOMEM)
    else charge_ok_or_not_supported
        Memcg-->>Hugetlb: 0 or -EOPNOTSUPP
        Hugetlb->>Hugetlb: vma_needs_reservation()
        alt reservation_or_subpool_fail
            Hugetlb->>Memcg: mem_cgroup_cancel_charge(memcg, nr_pages) [if charge_ok]
            Hugetlb->>Memcg: mem_cgroup_put(memcg)
            Hugetlb-->>Task: ERR_PTR(-ENOMEM or -ENOSPC)
        else hugetlb_folio_allocated
            Hugetlb-->>Task: folio
            opt charge_ok
                Hugetlb->>Memcg: mem_cgroup_commit_charge(folio, memcg)
            end
            Hugetlb->>Memcg: mem_cgroup_put(memcg)
        end
    end

    rect rgb(230,230,230)
    Note over Hugetlb,Memcg: On free_huge_folio(folio): mem_cgroup_uncharge(folio)
    end

Flow diagram for enabling hugetlb memcg accounting via cgroup v2 mount option

flowchart LR
    A[cgroup2_parse_param]
    B[ctx->flags &#124;= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING]
    C[apply_cgroup_root_flags]
    D[cgrp_dfl_root.flags &#124;= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING]
    E[mem_cgroup_hugetlb_try_charge]
    F[hugetlb memcg accounting active]

    A -->|memory_hugetlb_accounting| B --> C --> D --> E
    E -->|flag set and v2 memcg| F
    E -->|flag not set or !v2| F_no[skip memcg charge, return -EOPNOTSUPP]

File-Level Changes

Change	Details	Files
Introduce reusable memcg helpers for charging, committing, canceling, and migrating folio charges, and separate page-cache replacement from generic migration semantics.	Add get_mem_cgroup_from_current() to safely grab current task’s memcg under RCU with css_tryget retry loop. Expose mem_cgroup_cancel_charge() and mem_cgroup_commit_charge() as exported helpers and refactor charge_memcg() to use them. Add mem_cgroup_hugetlb_try_charge() that conditionally charges hugetlb allocations only when v2 + memory_hugetlb_accounting are enabled and returns -EOPNOTSUPP/-ENOMEM appropriately. Split mem_cgroup_migrate() into mem_cgroup_replace_folio() (page-cache replacement that moves charge and stats) and a new mem_cgroup_migrate() that only transfers memcg metadata without touching counters, updating callers accordingly. Provide stub inline implementations of the new helpers for !CONFIG_MEMCG builds.	`mm/memcontrol.c` `include/linux/memcontrol.h` `mm/filemap.c`
Add cgroup v2 mount option and feature flag to control hugetlb accounting by the memory controller and document the behavior.	Extend cgroup2 fs parameter parsing to accept memory_hugetlb_accounting and plumb it through apply_cgroup_root_flags() to toggle CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING. Expose the new feature in the cgroup root feature set and seqfs show-options output so users can detect and introspect the knob. Introduce CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING flag in the cgroup root flags enum and update documentation to describe the new mount option and its caveats. Register the new selftest and maintainer entries, and ignore the compiled test binary in git.	`kernel/cgroup/cgroup.c` `include/linux/cgroup-defs.h` `Documentation/admin-guide/cgroup-v2.rst` `MAINTAINERS` `tools/testing/selftests/cgroup/.gitignore` `tools/testing/selftests/cgroup/Makefile`
Charge hugetlb folios to memcgs at allocation/free time, integrate with hugetlb_cgroup, and fix migration/large-folio corner cases.	In alloc_hugetlb_folio(), obtain current’s memcg, attempt a pre-charge via mem_cgroup_hugetlb_try_charge() with hugetlb-appropriate GFP mask, and on success commit the charge after a folio is allocated, canceling and dropping the reference on every failure path. Ensure free_huge_folio() uncharges the associated memcg when a hugetlb folio is freed so memory.current reflects hugetlb usage. Remove the minimum-order restriction from hugetlb cgroup charge/uncharge paths and always create hugetlb cgroup files per hstate, allowing tracking of smaller hugepage sizes. Tighten hugetlb hstate validation by requiring the order to be at least order_base_2(__NR_USED_SUBPAGE) instead of non-zero, matching metadata layout constraints. Update folio migration to always call mem_cgroup_migrate(), including hugetlb folios, and to carefully undo large-folio rmappable state while ref-frozen so that deferred split queues and memcg data stay consistent during migration.	`mm/hugetlb.c` `mm/hugetlb_cgroup.c` `include/linux/hugetlb_cgroup.h` `mm/migrate.c`
Add a kselftest that verifies hugetlb usage is reflected in memory.current when memory_hugetlb_accounting is enabled on a cgroup v2 hierarchy.	Introduce test_hugetlb_memcg.c which sets up a dedicated cgroup with memory.max and memory.swap.max, adjusts nr_hugepages, and then mmaps/reads/writes/unmaps a 2MB-hugetlb-backed range while sampling memory.current to ensure usage deltas roughly match 2MB and 8MB steps and then return to baseline. Use proc_mount_contains() to skip when the cgroup v2 mount doesn’t have memory_hugetlb_accounting, verify the host uses 2MB hugepages via /proc/meminfo, and integrate with existing cgroup_util helpers and kselftest result reporting. Wire the new test into the cgroup selftests build via the Makefile and ignore its binary in .gitignore.	`tools/testing/selftests/cgroup/test_hugetlb_memcg.c` `tools/testing/selftests/cgroup/Makefile` `tools/testing/selftests/cgroup/.gitignore`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In alloc_hugetlb_folio(), get_mem_cgroup_from_current() can return NULL (e.g. memcg disabled), but you unconditionally call mem_cgroup_put(memcg) on all paths; this should be guarded so mem_cgroup_put() is only called when memcg is non-NULL.
The selftest unconditionally sets /proc/sys/vm/nr_hugepages to 20 and never restores the previous value, which can leave the test environment mutated; consider saving and restoring the original sysctl value around the test.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In alloc_hugetlb_folio(), get_mem_cgroup_from_current() can return NULL (e.g. memcg disabled), but you unconditionally call mem_cgroup_put(memcg) on all paths; this should be guarded so mem_cgroup_put() is only called when memcg is non-NULL.
- The selftest unconditionally sets /proc/sys/vm/nr_hugepages to 20 and never restores the previous value, which can leave the test environment mutated; consider saving and restoring the original sysctl value around the test.

## Individual Comments

### Comment 1
<location path="tools/testing/selftests/cgroup/test_hugetlb_memcg.c" line_range="103" />
<code_context>
+	int ret = EXIT_FAILURE;
+
+	old_current = cg_read_long(test_group, "memory.current");
+	set_nr_hugepages(20);
+	current = cg_read_long(test_group, "memory.current");
+	if (current - old_current >= MB(2)) {
</code_context>
<issue_to_address>
**suggestion (testing):** Preserve and restore nr_hugepages so the test does not leak global state into other tests

Since this is a global kernel setting, please read and store the original `/proc/sys/vm/nr_hugepages` value at the start of the test and restore it on all exit paths (success, failure, and skip) to avoid cross-test interference.

Suggested implementation:

```c
static long get_nr_hugepages(void)
{
	FILE *f;
	long val = -1;

	f = fopen("/proc/sys/vm/nr_hugepages", "r");
	if (!f)
		return -1;

	if (fscanf(f, "%ld", &val) != 1)
		val = -1;

	fclose(f);
	return val;
}

static int hugetlb_test_program(const char *cgroup, void *arg)

```

```c
{
	char *test_group = (char *)arg;
	void *addr;
	long old_current, expected_current, current;
	long old_nr_hugepages;
	int ret = EXIT_FAILURE;

	old_nr_hugepages = get_nr_hugepages();
	if (old_nr_hugepages < 0) {
		ksft_print_msg("failed to read /proc/sys/vm/nr_hugepages\n");
		return EXIT_FAILURE;
	}

	old_current = cg_read_long(test_group, "memory.current");

```

```c
	set_nr_hugepages(20);
	current = cg_read_long(test_group, "memory.current");
	if (current - old_current >= MB(2)) {
		ksft_print_msg(
			"setting nr_hugepages should not increase hugepage usage.\n");
		ksft_print_msg("before: %ld, after: %ld\n", old_current, current);
		set_nr_hugepages(old_nr_hugepages);
		return EXIT_FAILURE;
	}

	addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
	if (addr == MAP_FAILED) {
		ksft_print_msg("fail to mmap.\n");
		set_nr_hugepages(old_nr_hugepages);
		return EXIT_FAILURE;

```

To fully avoid leaking global state, you should also:

1. Restore `nr_hugepages` on all other exit paths in `hugetlb_test_program`:
   - Any other `return` statements later in the function should be updated to call `set_nr_hugepages(old_nr_hugepages);` immediately before returning.
2. If the test has a "success" path at the end of the function (e.g. `ret = 0;` / `return ret;`), ensure that `set_nr_hugepages(old_nr_hugepages);` is called there as well before returning.
3. If the test can be skipped via `ksft_test_result_skip()` or similar, make sure `nr_hugepages` is restored before that path returns from `hugetlb_test_program`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-06-16T08:53:06Z

+	int ret = EXIT_FAILURE;
+
+	old_current = cg_read_long(test_group, "memory.current");
+	set_nr_hugepages(20);


suggestion (testing): Preserve and restore nr_hugepages so the test does not leak global state into other tests

Since this is a global kernel setting, please read and store the original /proc/sys/vm/nr_hugepages value at the start of the test and restore it on all exit paths (success, failure, and skip) to avoid cross-test interference.

Suggested implementation:

static long get_nr_hugepages(void) { FILE *f; long val = -1; f = fopen("/proc/sys/vm/nr_hugepages", "r"); if (!f) return -1; if (fscanf(f, "%ld", &val) != 1) val = -1; fclose(f); return val; } static int hugetlb_test_program(const char *cgroup, void *arg)

{ char *test_group = (char *)arg; void *addr; long old_current, expected_current, current; long old_nr_hugepages; int ret = EXIT_FAILURE; old_nr_hugepages = get_nr_hugepages(); if (old_nr_hugepages < 0) { ksft_print_msg("failed to read /proc/sys/vm/nr_hugepages\n"); return EXIT_FAILURE; } old_current = cg_read_long(test_group, "memory.current");

set_nr_hugepages(20); current = cg_read_long(test_group, "memory.current"); if (current - old_current >= MB(2)) { ksft_print_msg( "setting nr_hugepages should not increase hugepage usage.\n"); ksft_print_msg("before: %ld, after: %ld\n", old_current, current); set_nr_hugepages(old_nr_hugepages); return EXIT_FAILURE; } addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0); if (addr == MAP_FAILED) { ksft_print_msg("fail to mmap.\n"); set_nr_hugepages(old_nr_hugepages); return EXIT_FAILURE;

To fully avoid leaking global state, you should also:

Restore nr_hugepages on all other exit paths in hugetlb_test_program:

Any other return statements later in the function should be updated to call set_nr_hugepages(old_nr_hugepages); immediately before returning.

If the test has a "success" path at the end of the function (e.g. ret = 0; / return ret;), ensure that set_nr_hugepages(old_nr_hugepages); is called there as well before returning.

If the test can be skipped via ksft_test_result_skip() or similar, make sure nr_hugepages is restored before that path returns from hugetlb_test_program.

mainline inclusion from mainline-v6.7 category: bugfix When running autonuma with enabling multi-size THP, I encountered the following kernel crash issue: [ 134.290216] list_del corruption. prev->next should be fffff9ad42e1c490, but was dead000000000100. (prev=fffff9ad42399890) [ 134.290877] kernel BUG at lib/list_debug.c:62! [ 134.291052] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 134.291210] CPU: 56 PID: 8037 Comm: numa01 Kdump: loaded Tainted: G E 6.7.0-rc4+ deepin-community#20 [ 134.291649] RIP: 0010:__list_del_entry_valid_or_report+0x97/0xb0 ...... [ 134.294252] Call Trace: [ 134.294362] <TASK> [ 134.294440] ? die+0x33/0x90 [ 134.294561] ? do_trap+0xe0/0x110 ...... [ 134.295681] ? __list_del_entry_valid_or_report+0x97/0xb0 [ 134.295842] folio_undo_large_rmappable+0x99/0x100 [ 134.296003] destroy_large_folio+0x68/0x70 [ 134.296172] migrate_folio_move+0x12e/0x260 [ 134.296264] ? __pfx_remove_migration_pte+0x10/0x10 [ 134.296389] migrate_pages_batch+0x495/0x6b0 [ 134.296523] migrate_pages+0x1d0/0x500 [ 134.296646] ? __pfx_alloc_misplaced_dst_folio+0x10/0x10 [ 134.296799] migrate_misplaced_folio+0x12d/0x2b0 [ 134.296953] do_numa_page+0x1f4/0x570 [ 134.297121] __handle_mm_fault+0x2b0/0x6c0 [ 134.297254] handle_mm_fault+0x107/0x270 [ 134.300897] do_user_addr_fault+0x167/0x680 [ 134.304561] exc_page_fault+0x65/0x140 [ 134.307919] asm_exc_page_fault+0x22/0x30 The reason for the crash is that, the commit 85ce2c5 ("memcontrol: only transfer the memcg data for migration") removed the charging and uncharging operations of the migration folios and cleared the memcg data of the old folio. During the subsequent release process of the old large folio in destroy_large_folio(), if the large folio needs to be removed from the split queue, an incorrect split queue can be obtained (which is pgdat->deferred_split_queue) because the old folio's memcg is NULL now. This can lead to list operations being performed under the wrong split queue lock protection, resulting in a list crash as above. After the migration, the old folio is going to be freed, so we can remove it from the split queue in mem_cgroup_migrate() a bit earlier before clearing the memcg data to avoid getting incorrect split queue. [akpm@linux-foundation.org: fix comment, per Zi Yan] Link: https://lkml.kernel.org/r/61273e5e9b490682388377c20f52d19de4a80460.1703054559.git.baolin.wang@linux.alibaba.com Fixes: 85ce2c5 ("memcontrol: only transfer the memcg data for migration") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [mm: always initialise folio->_deferred_list merged before] [Rename folio_undo_large_rmappable() to folio_unqueue_deferred_split()] Conflicts: mm/huge_memory.c (cherry picked from commit 9bcef59) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>

mainline inclusion from mainline-v6.10 category: bugfix Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on flags when freeing, yet the flags shown are not bad: PG_locked had been set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN symptoms implying double free by deferred split and large folio migration. 6.7 commit 9bcef59 ("mm: memcg: fix split queue list crash when large folio migration") was right to fix the memcg-dependent locking broken in 85ce2c5 ("memcontrol: only transfer the memcg data for migration"), but missed a subtlety of deferred_split_scan(): it moves folios to its own local list to work on them without split_queue_lock, during which time folio->_deferred_list is not empty, but even the "right" lock does nothing to secure the folio and the list it is on. Fortunately, deferred_split_scan() is careful to use folio_try_get(): so folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() while the old folio's reference count is temporarily frozen to 0 - adding such a freeze in the !mapping case too (originally, folio lock and unmapping and no swap cache left an anon folio unreachable, so no freezing was needed there: but the deferred split queue offers a way to reach it). Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com Fixes: 9bcef59 ("mm: memcg: fix split queue list crash when large folio migration") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [rename folio_undo_large_rmappable() to folio_unqueue_deferred_split()] (cherry picked from commit be9581e) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>

Avenger-285714 · 2026-06-16T11:37:09Z

 TEST_GEN_PROGS += test_cpu
 TEST_GEN_PROGS += test_cpuset
 TEST_GEN_PROGS += test_zswap
+TEST_GEN_PROGS += test_hugetlb_memcg


@opsiff I'd like to take this opportunity to discuss the feasibility of adding a CI job that runs the kselftests against the built kernel (the new kernel) to verify they pass.

@opsiff I'd like to take this opportunity to discuss the feasibility of adding a CI job that runs the kselftests against the built kernel (the new kernel) to verify they pass.

I will take time to design and deploy it.

deepin-ci-robot · 2026-06-16T11:37:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Avenger-285714

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~deepin/OWNERS~~ [Avenger-285714]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR adds optional memory-controller (memcg) accounting for HugeTLB-backed memory on cgroup v2, aligns migration/replacement paths to handle memcg metadata correctly (including HugeTLB folios), and introduces a selftest to validate the new accounting behavior.

Changes:

Add a cgroup v2 mount option (memory_hugetlb_accounting) and feature reporting to opt into HugeTLB accounting in the memory controller.
Implement memcg charge/uncharge and migration metadata transfer for HugeTLB folios; split “replacement” vs “migration” memcg handling for folios.
Add a cgroup selftest to ensure HugeTLB usage is reflected in memory.current when the mount option is enabled.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
Documentation/admin-guide/cgroup-v2.rst	Documents the new `memory_hugetlb_accounting` mount option and caveats.
MAINTAINERS	Adds the new selftest to relevant maintainer entries.
include/linux/cgroup-defs.h	Introduces `CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING` root flag.
include/linux/memcontrol.h	Declares new memcg helpers for hugetlb charging and folio replacement/migration.
kernel/cgroup/cgroup.c	Parses, applies, shows, and reports the new cgroup v2 mount option/feature.
mm/filemap.c	Switches page-cache replacement to the new `mem_cgroup_replace_folio()` helper.
mm/hugetlb.c	Adds memcg charge on HugeTLB allocation and uncharge on free; tightens hstate order validation.
mm/memcontrol.c	Adds/exports memcg charge commit/cancel helpers, current-task memcg getter, and correct migration metadata transfer.
mm/migrate.c	Ensures memcg data is migrated for all folios and adjusts deferred-split queue handling.
tools/testing/selftests/cgroup/.gitignore	Ignores the newly built `test_hugetlb_memcg` binary.
tools/testing/selftests/cgroup/Makefile	Builds the new HugeTLB memcg selftest.
tools/testing/selftests/cgroup/test_hugetlb_memcg.c	New selftest validating HugeTLB accounting against `memory.current`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  memory_hugetlb_accounting
+        Count HugeTLB memory usage towards the cgroup's overall
+        memory usage for the memory controller (for the purpose of
+        statistics reporting and memory protetion). This is a new


+	 * For e.g, itt could have been allocated when memory_hugetlb_accounting
+	 * was not selected.


+
+	/*
+	 * Enable hugetlb accounting for the memory controller.
+	 */
+	 CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),


+#include <linux/limits.h>
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <fcntl.h>
+#include "../kselftest.h"
+#include "cgroup_util.h"


+	val = strtol(p, &q, 0);
+	if (*q != ' ') {
+		/* Error parsing the file */
+		return -1;
+	}


+	old_current = cg_read_long(test_group, "memory.current");
+	set_nr_hugepages(20);
+	current = cg_read_long(test_group, "memory.current");
+	if (current - old_current >= MB(2)) {


Frank van der Linden and others added 5 commits June 16, 2026 16:45

deepin-ci-robot requested review from huangbibo and myml June 16, 2026 08:50

sourcery-ai Bot reviewed Jun 16, 2026

View reviewed changes

Baolin Wang and others added 2 commits June 16, 2026 17:25

opsiff force-pushed the linux-6.6.y-2026-06-16-hugetlb-2 branch from 2a4c876 to 225eb56 Compare June 16, 2026 09:27

Avenger-285714 requested review from Avenger-285714 and Copilot June 16, 2026 11:35

Copilot started reviewing on behalf of Avenger-285714 June 16, 2026 11:35 View session

Avenger-285714 approved these changes Jun 16, 2026

View reviewed changes

deepin-ci-robot added the approved label Jun 16, 2026

Copilot AI reviewed Jun 16, 2026

View reviewed changes

opsiff merged commit 09490f6 into deepin-community:linux-6.6.y Jun 16, 2026
14 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] hugetlb memcg accounting and bugfixes#1867

[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] hugetlb memcg accounting and bugfixes#1867
opsiff merged 7 commits into
deepin-community:linux-6.6.yfrom
opsiff:linux-6.6.y-2026-06-16-hugetlb-2

opsiff commented Jun 16, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Jun 16, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Jun 16, 2026

Uh oh!

Avenger-285714 Jun 16, 2026

Uh oh!

opsiff Jun 16, 2026

Uh oh!

deepin-ci-robot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		* For e.g, itt could have been allocated when memory_hugetlb_accounting
		* was not selected.

Conversation

opsiff commented Jun 16, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for hugetlb folio allocation with memcg accounting

Flow diagram for enabling hugetlb memcg accounting via cgroup v2 mount option

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Avenger-285714 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

opsiff Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

deepin-ci-robot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

opsiff commented Jun 16, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Jun 16, 2026 •

edited

Loading