[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] hugetlb memcg accounting and bugfixes#1867
Conversation
mainline inclusion from mainline-v6.7-rc1 category: performance Originally, hugetlb_cgroup was the only hugetlb user of tail page structure fields. So, the code defined and checked against HUGETLB_CGROUP_MIN_ORDER to make sure pages weren't too small to use. However, by now, tail page #2 is used to store hugetlb hwpoison and subpool information as well. In other words, without that tail page hugetlb doesn't work. Acknowledge this fact by getting rid of HUGETLB_CGROUP_MIN_ORDER and checks against it. Instead, just check for the minimum viable page order at hstate creation time. Link: https://lkml.kernel.org/r/20231004153248.3842997-1-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 59838b2) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: performance Patch series "hugetlb memcg accounting", v4. Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etcetera fair. What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state. This patch series rectifies this issue by charging the memcg when the hugetlb folio is allocated, and uncharging when the folio is freed. In addition, a new selftest is added to demonstrate and verify this new behavior. This patch (of 4): This patch exposes charge committing and cancelling as parts of the memory controller interface. These functionalities are useful when the try_charge() and commit_charge() stages have to be separated by other actions in between (which can fail). One such example is the new hugetlb accounting behavior in the following patch. The patch also adds a helper function to obtain a reference to the current task's memcg. Link: https://lkml.kernel.org/r/20231006184629.155543-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231006184629.155543-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 4b56938) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: performance For most migration use cases, only transfer the memcg data from the old folio to the new folio, and clear the old folio's memcg data. No charging and uncharging will be done. This shaves off some work on the migration path, and avoids the temporary double charging of a folio during its migration. The only exception is replace_page_cache_folio(), which will use the old mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that context, the isolation of the old page isn't quite as thorough as with migration, so we cannot use our new implementation directly. This patch is the result of the following discussion on the new hugetlb memcg accounting behavior: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/ Link: https://lkml.kernel.org/r/20231006184629.155543-3-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 85ce2c5) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 8cba957)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: performance This patch add a new kselftest to demonstrate and verify the new hugetlb memcg accounting behavior. Link: https://lkml.kernel.org/r/20231006184629.155543-5-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit c0dddb7) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Reviewer's GuideImplements hugetlb-backed memory accounting in the memory controller for cgroup v2, adds a mount-time feature flag to enable it, refactors memcg charging/migration helpers to support hugetlb folios correctly (including migration paths and large-folio handling), tightens hugetlb/hugetlb_cgroup invariants, and adds a kselftest to validate the new behavior. Sequence diagram for hugetlb folio allocation with memcg accountingsequenceDiagram
participant Task as current_task
participant Memcg as mem_cgroup
participant Hugetlb as alloc_hugetlb_folio
Task->>Hugetlb: alloc_hugetlb_folio(vma, addr, avoid_reserve)
Hugetlb->>Memcg: get_mem_cgroup_from_current()
Memcg-->>Hugetlb: memcg
Hugetlb->>Memcg: mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages)
alt charge_failed
Memcg-->>Hugetlb: -ENOMEM
Hugetlb->>Memcg: mem_cgroup_put(memcg)
Hugetlb-->>Task: ERR_PTR(-ENOMEM)
else charge_ok_or_not_supported
Memcg-->>Hugetlb: 0 or -EOPNOTSUPP
Hugetlb->>Hugetlb: vma_needs_reservation()
alt reservation_or_subpool_fail
Hugetlb->>Memcg: mem_cgroup_cancel_charge(memcg, nr_pages) [if charge_ok]
Hugetlb->>Memcg: mem_cgroup_put(memcg)
Hugetlb-->>Task: ERR_PTR(-ENOMEM or -ENOSPC)
else hugetlb_folio_allocated
Hugetlb-->>Task: folio
opt charge_ok
Hugetlb->>Memcg: mem_cgroup_commit_charge(folio, memcg)
end
Hugetlb->>Memcg: mem_cgroup_put(memcg)
end
end
rect rgb(230,230,230)
Note over Hugetlb,Memcg: On free_huge_folio(folio): mem_cgroup_uncharge(folio)
end
Flow diagram for enabling hugetlb memcg accounting via cgroup v2 mount optionflowchart LR
A[cgroup2_parse_param]
B[ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING]
C[apply_cgroup_root_flags]
D[cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING]
E[mem_cgroup_hugetlb_try_charge]
F[hugetlb memcg accounting active]
A -->|memory_hugetlb_accounting| B --> C --> D --> E
E -->|flag set and v2 memcg| F
E -->|flag not set or !v2| F_no[skip memcg charge, return -EOPNOTSUPP]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In alloc_hugetlb_folio(), get_mem_cgroup_from_current() can return NULL (e.g. memcg disabled), but you unconditionally call mem_cgroup_put(memcg) on all paths; this should be guarded so mem_cgroup_put() is only called when memcg is non-NULL.
- The selftest unconditionally sets /proc/sys/vm/nr_hugepages to 20 and never restores the previous value, which can leave the test environment mutated; consider saving and restoring the original sysctl value around the test.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In alloc_hugetlb_folio(), get_mem_cgroup_from_current() can return NULL (e.g. memcg disabled), but you unconditionally call mem_cgroup_put(memcg) on all paths; this should be guarded so mem_cgroup_put() is only called when memcg is non-NULL.
- The selftest unconditionally sets /proc/sys/vm/nr_hugepages to 20 and never restores the previous value, which can leave the test environment mutated; consider saving and restoring the original sysctl value around the test.
## Individual Comments
### Comment 1
<location path="tools/testing/selftests/cgroup/test_hugetlb_memcg.c" line_range="103" />
<code_context>
+ int ret = EXIT_FAILURE;
+
+ old_current = cg_read_long(test_group, "memory.current");
+ set_nr_hugepages(20);
+ current = cg_read_long(test_group, "memory.current");
+ if (current - old_current >= MB(2)) {
</code_context>
<issue_to_address>
**suggestion (testing):** Preserve and restore nr_hugepages so the test does not leak global state into other tests
Since this is a global kernel setting, please read and store the original `/proc/sys/vm/nr_hugepages` value at the start of the test and restore it on all exit paths (success, failure, and skip) to avoid cross-test interference.
Suggested implementation:
```c
static long get_nr_hugepages(void)
{
FILE *f;
long val = -1;
f = fopen("/proc/sys/vm/nr_hugepages", "r");
if (!f)
return -1;
if (fscanf(f, "%ld", &val) != 1)
val = -1;
fclose(f);
return val;
}
static int hugetlb_test_program(const char *cgroup, void *arg)
```
```c
{
char *test_group = (char *)arg;
void *addr;
long old_current, expected_current, current;
long old_nr_hugepages;
int ret = EXIT_FAILURE;
old_nr_hugepages = get_nr_hugepages();
if (old_nr_hugepages < 0) {
ksft_print_msg("failed to read /proc/sys/vm/nr_hugepages\n");
return EXIT_FAILURE;
}
old_current = cg_read_long(test_group, "memory.current");
```
```c
set_nr_hugepages(20);
current = cg_read_long(test_group, "memory.current");
if (current - old_current >= MB(2)) {
ksft_print_msg(
"setting nr_hugepages should not increase hugepage usage.\n");
ksft_print_msg("before: %ld, after: %ld\n", old_current, current);
set_nr_hugepages(old_nr_hugepages);
return EXIT_FAILURE;
}
addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
if (addr == MAP_FAILED) {
ksft_print_msg("fail to mmap.\n");
set_nr_hugepages(old_nr_hugepages);
return EXIT_FAILURE;
```
To fully avoid leaking global state, you should also:
1. Restore `nr_hugepages` on all other exit paths in `hugetlb_test_program`:
- Any other `return` statements later in the function should be updated to call `set_nr_hugepages(old_nr_hugepages);` immediately before returning.
2. If the test has a "success" path at the end of the function (e.g. `ret = 0;` / `return ret;`), ensure that `set_nr_hugepages(old_nr_hugepages);` is called there as well before returning.
3. If the test can be skipped via `ksft_test_result_skip()` or similar, make sure `nr_hugepages` is restored before that path returns from `hugetlb_test_program`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| int ret = EXIT_FAILURE; | ||
|
|
||
| old_current = cg_read_long(test_group, "memory.current"); | ||
| set_nr_hugepages(20); |
There was a problem hiding this comment.
suggestion (testing): Preserve and restore nr_hugepages so the test does not leak global state into other tests
Since this is a global kernel setting, please read and store the original /proc/sys/vm/nr_hugepages value at the start of the test and restore it on all exit paths (success, failure, and skip) to avoid cross-test interference.
Suggested implementation:
static long get_nr_hugepages(void)
{
FILE *f;
long val = -1;
f = fopen("/proc/sys/vm/nr_hugepages", "r");
if (!f)
return -1;
if (fscanf(f, "%ld", &val) != 1)
val = -1;
fclose(f);
return val;
}
static int hugetlb_test_program(const char *cgroup, void *arg){
char *test_group = (char *)arg;
void *addr;
long old_current, expected_current, current;
long old_nr_hugepages;
int ret = EXIT_FAILURE;
old_nr_hugepages = get_nr_hugepages();
if (old_nr_hugepages < 0) {
ksft_print_msg("failed to read /proc/sys/vm/nr_hugepages\n");
return EXIT_FAILURE;
}
old_current = cg_read_long(test_group, "memory.current"); set_nr_hugepages(20);
current = cg_read_long(test_group, "memory.current");
if (current - old_current >= MB(2)) {
ksft_print_msg(
"setting nr_hugepages should not increase hugepage usage.\n");
ksft_print_msg("before: %ld, after: %ld\n", old_current, current);
set_nr_hugepages(old_nr_hugepages);
return EXIT_FAILURE;
}
addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
if (addr == MAP_FAILED) {
ksft_print_msg("fail to mmap.\n");
set_nr_hugepages(old_nr_hugepages);
return EXIT_FAILURE;To fully avoid leaking global state, you should also:
- Restore
nr_hugepageson all other exit paths inhugetlb_test_program:- Any other
returnstatements later in the function should be updated to callset_nr_hugepages(old_nr_hugepages);immediately before returning.
- Any other
- If the test has a "success" path at the end of the function (e.g.
ret = 0;/return ret;), ensure thatset_nr_hugepages(old_nr_hugepages);is called there as well before returning. - If the test can be skipped via
ksft_test_result_skip()or similar, make surenr_hugepagesis restored before that path returns fromhugetlb_test_program.
mainline inclusion from mainline-v6.7 category: bugfix When running autonuma with enabling multi-size THP, I encountered the following kernel crash issue: [ 134.290216] list_del corruption. prev->next should be fffff9ad42e1c490, but was dead000000000100. (prev=fffff9ad42399890) [ 134.290877] kernel BUG at lib/list_debug.c:62! [ 134.291052] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 134.291210] CPU: 56 PID: 8037 Comm: numa01 Kdump: loaded Tainted: G E 6.7.0-rc4+ deepin-community#20 [ 134.291649] RIP: 0010:__list_del_entry_valid_or_report+0x97/0xb0 ...... [ 134.294252] Call Trace: [ 134.294362] <TASK> [ 134.294440] ? die+0x33/0x90 [ 134.294561] ? do_trap+0xe0/0x110 ...... [ 134.295681] ? __list_del_entry_valid_or_report+0x97/0xb0 [ 134.295842] folio_undo_large_rmappable+0x99/0x100 [ 134.296003] destroy_large_folio+0x68/0x70 [ 134.296172] migrate_folio_move+0x12e/0x260 [ 134.296264] ? __pfx_remove_migration_pte+0x10/0x10 [ 134.296389] migrate_pages_batch+0x495/0x6b0 [ 134.296523] migrate_pages+0x1d0/0x500 [ 134.296646] ? __pfx_alloc_misplaced_dst_folio+0x10/0x10 [ 134.296799] migrate_misplaced_folio+0x12d/0x2b0 [ 134.296953] do_numa_page+0x1f4/0x570 [ 134.297121] __handle_mm_fault+0x2b0/0x6c0 [ 134.297254] handle_mm_fault+0x107/0x270 [ 134.300897] do_user_addr_fault+0x167/0x680 [ 134.304561] exc_page_fault+0x65/0x140 [ 134.307919] asm_exc_page_fault+0x22/0x30 The reason for the crash is that, the commit 85ce2c5 ("memcontrol: only transfer the memcg data for migration") removed the charging and uncharging operations of the migration folios and cleared the memcg data of the old folio. During the subsequent release process of the old large folio in destroy_large_folio(), if the large folio needs to be removed from the split queue, an incorrect split queue can be obtained (which is pgdat->deferred_split_queue) because the old folio's memcg is NULL now. This can lead to list operations being performed under the wrong split queue lock protection, resulting in a list crash as above. After the migration, the old folio is going to be freed, so we can remove it from the split queue in mem_cgroup_migrate() a bit earlier before clearing the memcg data to avoid getting incorrect split queue. [akpm@linux-foundation.org: fix comment, per Zi Yan] Link: https://lkml.kernel.org/r/61273e5e9b490682388377c20f52d19de4a80460.1703054559.git.baolin.wang@linux.alibaba.com Fixes: 85ce2c5 ("memcontrol: only transfer the memcg data for migration") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [mm: always initialise folio->_deferred_list merged before] [Rename folio_undo_large_rmappable() to folio_unqueue_deferred_split()] Conflicts: mm/huge_memory.c (cherry picked from commit 9bcef59) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.10 category: bugfix Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on flags when freeing, yet the flags shown are not bad: PG_locked had been set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN symptoms implying double free by deferred split and large folio migration. 6.7 commit 9bcef59 ("mm: memcg: fix split queue list crash when large folio migration") was right to fix the memcg-dependent locking broken in 85ce2c5 ("memcontrol: only transfer the memcg data for migration"), but missed a subtlety of deferred_split_scan(): it moves folios to its own local list to work on them without split_queue_lock, during which time folio->_deferred_list is not empty, but even the "right" lock does nothing to secure the folio and the list it is on. Fortunately, deferred_split_scan() is careful to use folio_try_get(): so folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() while the old folio's reference count is temporarily frozen to 0 - adding such a freeze in the !mapping case too (originally, folio lock and unmapping and no swap cache left an anon folio unreachable, so no freezing was needed there: but the deferred split queue offers a way to reach it). Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com Fixes: 9bcef59 ("mm: memcg: fix split queue list crash when large folio migration") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [rename folio_undo_large_rmappable() to folio_unqueue_deferred_split()] (cherry picked from commit be9581e) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
2a4c876 to
225eb56
Compare
| TEST_GEN_PROGS += test_cpu | ||
| TEST_GEN_PROGS += test_cpuset | ||
| TEST_GEN_PROGS += test_zswap | ||
| TEST_GEN_PROGS += test_hugetlb_memcg |
There was a problem hiding this comment.
@opsiff I'd like to take this opportunity to discuss the feasibility of adding a CI job that runs the kselftests against the built kernel (the new kernel) to verify they pass.
There was a problem hiding this comment.
@opsiff I'd like to take this opportunity to discuss the feasibility of adding a CI job that runs the kselftests against the built kernel (the new kernel) to verify they pass.
I will take time to design and deploy it.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Avenger-285714 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR adds optional memory-controller (memcg) accounting for HugeTLB-backed memory on cgroup v2, aligns migration/replacement paths to handle memcg metadata correctly (including HugeTLB folios), and introduces a selftest to validate the new accounting behavior.
Changes:
- Add a cgroup v2 mount option (
memory_hugetlb_accounting) and feature reporting to opt into HugeTLB accounting in the memory controller. - Implement memcg charge/uncharge and migration metadata transfer for HugeTLB folios; split “replacement” vs “migration” memcg handling for folios.
- Add a cgroup selftest to ensure HugeTLB usage is reflected in
memory.currentwhen the mount option is enabled.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| Documentation/admin-guide/cgroup-v2.rst | Documents the new memory_hugetlb_accounting mount option and caveats. |
| MAINTAINERS | Adds the new selftest to relevant maintainer entries. |
| include/linux/cgroup-defs.h | Introduces CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING root flag. |
| include/linux/memcontrol.h | Declares new memcg helpers for hugetlb charging and folio replacement/migration. |
| kernel/cgroup/cgroup.c | Parses, applies, shows, and reports the new cgroup v2 mount option/feature. |
| mm/filemap.c | Switches page-cache replacement to the new mem_cgroup_replace_folio() helper. |
| mm/hugetlb.c | Adds memcg charge on HugeTLB allocation and uncharge on free; tightens hstate order validation. |
| mm/memcontrol.c | Adds/exports memcg charge commit/cancel helpers, current-task memcg getter, and correct migration metadata transfer. |
| mm/migrate.c | Ensures memcg data is migrated for all folios and adjusts deferred-split queue handling. |
| tools/testing/selftests/cgroup/.gitignore | Ignores the newly built test_hugetlb_memcg binary. |
| tools/testing/selftests/cgroup/Makefile | Builds the new HugeTLB memcg selftest. |
| tools/testing/selftests/cgroup/test_hugetlb_memcg.c | New selftest validating HugeTLB accounting against memory.current. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| memory_hugetlb_accounting | ||
| Count HugeTLB memory usage towards the cgroup's overall | ||
| memory usage for the memory controller (for the purpose of | ||
| statistics reporting and memory protetion). This is a new |
| * For e.g, itt could have been allocated when memory_hugetlb_accounting | ||
| * was not selected. |
|
|
||
| /* | ||
| * Enable hugetlb accounting for the memory controller. | ||
| */ | ||
| CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19), |
| #include <linux/limits.h> | ||
| #include <sys/mman.h> | ||
| #include <stdio.h> | ||
| #include <stdlib.h> | ||
| #include <string.h> | ||
| #include <fcntl.h> | ||
| #include "../kselftest.h" | ||
| #include "cgroup_util.h" |
| val = strtol(p, &q, 0); | ||
| if (*q != ' ') { | ||
| /* Error parsing the file */ | ||
| return -1; | ||
| } |
| old_current = cg_read_long(test_group, "memory.current"); | ||
| set_nr_hugepages(20); | ||
| current = cg_read_long(test_group, "memory.current"); | ||
| if (current - old_current >= MB(2)) { |
Link: https://lore.kernel.org/all/20231006184629.155543-3-nphamcs@gmail.com/T/#m0c97da970b68fd406ecfd5e6cf05b6069442c140
Changelog:
v4:
* Add another prep patch to clean up memory controller migration
logic.
* Fix an issue in hugetlb folio migration where the new folio
is not properly charged (patch 3) (reported by Mike Kravetz)
(suggested by Johannes Weiner).
v3:
* Add a prep patch at the start of the series to extend the memory
controller interface with new helper functions for hugetlb
accounting.
* Do not account hugetlb memory for memcontroller in cgroup v1
(patch 2) (suggested by Johannes Weiner).
* Change the gfp flag passed to mem cgroup charging (patch 2)
(suggested by Michal Hocko).
* Add caveats to cgroup admin guide and commit changelog
(patch 2) (suggested by Michal Hocko).
v2:
* Add a cgroup mount option to enable/disable the new hugetlb memcg
accounting behavior (patch 1) (suggested by Johannes Weiner).
* Add a couple more ksft_print_msg() on error to aid debugging when
the selftest fails. (patch 2)
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each
container may or may not use up to 3 gigantic page, depending on the
workload within it. The rest is anon, cache, slab, etc. We can set the
hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
But it is very difficult to configure memory.max to keep overall
consumption, including anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for
e.g when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.
Nhat Pham (4):
memcontrol: add helpers for hugetlb memcg accounting
memcontrol: only transfer the memcg data for migration
hugetlb: memcg: account hugetlb-backed memory in memory controller
selftests: add a selftest to verify hugetlb usage in memcg
Documentation/admin-guide/cgroup-v2.rst | 29 +++
MAINTAINERS | 2 +
include/linux/cgroup-defs.h | 5 +
include/linux/memcontrol.h | 37 +++
kernel/cgroup/cgroup.c | 15 +-
mm/filemap.c | 2 +-
mm/hugetlb.c | 35 ++-
mm/memcontrol.c | 139 +++++++++--
mm/migrate.c | 3 +-
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
.../selftests/cgroup/test_hugetlb_memcg.c | 234 ++++++++++++++++++
12 files changed, 478 insertions(+), 26 deletions(-)
create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
--
2.34.1
Summary by Sourcery
Enable optional accounting of hugetlb-backed memory in the memory controller and update hugetlb and memcg handling to support correct charging, migration, and reclamation behavior.
New Features:
Bug Fixes:
Enhancements:
Documentation:
Tests: