Skip to content

[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] shmem,tmpfs: general maintenance#1870

Merged
opsiff merged 9 commits into
deepin-community:linux-6.6.yfrom
opsiff:linux-6.6.y-2026-06-16-shmem
Jun 16, 2026
Merged

[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] shmem,tmpfs: general maintenance#1870
opsiff merged 9 commits into
deepin-community:linux-6.6.yfrom
opsiff:linux-6.6.y-2026-06-16-shmem

Conversation

@opsiff

@opsiff opsiff commented Jun 16, 2026

Copy link
Copy Markdown
Member

Link: https://lore.kernel.org/all/6fe379a4-6176-9225-9263-fe60d2633c0@google.com/T/#m2b57bb32528d567002772f1ed9973f9b2268bc56

And here is a series of patches based on v6.6-rc3: mostly just cosmetic
mods in mm/shmem.c, but the last two enforcing the "size=" limit better.
8/8 goes into percpu counter territory, and could stand alone: I'll add
some more Cc's on that one.

Applies to any v6.6-rc so far, and to next-20230929 and to
mm-everything-2023-09-29-23-51: hah, there's now an 09-30-01-16,
I haven't tried it yet, but this should be good on that too.

1/8 shmem: shrink shmem_inode_info: dir_offsets in a union
2/8 shmem: remove vma arg from shmem_get_folio_gfp()
3/8 shmem: factor shmem_falloc_wait() out of shmem_fault()
4/8 shmem: trivial tidyups, removing extra blank lines, etc
5/8 shmem: shmem_acct_blocks() and shmem_inode_acct_blocks()
6/8 shmem: move memcg charge out of shmem_add_to_page_cache()
7/8 shmem: _add_to_page_cache() before shmem_inode_acct_blocks()
8/8 shmem,percpu_counter: add _limited_add(fbc, limit, amount)

include/linux/percpu_counter.h | 23 ++
include/linux/shmem_fs.h | 16 +-
lib/percpu_counter.c | 53 ++++
mm/shmem.c | 500 +++++++++++++++++------------------
4 files changed, 333 insertions(+), 259 deletions(-)

Hugh

Hugh Dickins added 9 commits June 16, 2026 20:45
mainline inclusion
from mainline-v6.7-rc1
category: performance

Patch series "shmem,tmpfs: general maintenance".

Mostly just cosmetic mods in mm/shmem.c, but the last two enforcing the
"size=" limit better.  8/8 goes into percpu counter territory, and could
stand alone.

This patch (of 8):

Shave 32 bytes off (the 64-bit) shmem_inode_info.  There was a 4-byte
pahole after stop_eviction, better filled by fsflags.  And the 24-byte
dir_offsets can only be used by directories, whereas shrinklist and
swaplist only by shmem_mapping() inodes (regular files or long symlinks):
so put those into a union.  No change in mm/shmem.c is required for this.

Link: https://lkml.kernel.org/r/c7441dc6-f3bb-dd60-c670-9f5cbd9f266@google.com
Link: https://lkml.kernel.org/r/86ebb4b-c571-b9e8-27f5-cb82ec50357e@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ee615d4)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

The vma is already there in vmf->vma, so no need for a separate arg.

Link: https://lkml.kernel.org/r/d9ce6f65-a2ed-48f4-4299-fdb0544875c5@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit e3e1a50)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

That Trinity livelock shmem_falloc avoidance block is unlikely, and a
distraction from the proper business of shmem_fault(): separate it out.
(This used to help compilers save stack on the fault path too, but both
gcc and clang nowadays seem to make better choices anyway.)

Link: https://lkml.kernel.org/r/6fe379a4-6176-9225-9263-fe60d2633c0@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit f0a9ad1)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

Mostly removing a few superfluous blank lines, joining short arglines,
imposing some 80-column observance, correcting a couple of comments.  None
of it more interesting than deleting a repeated INIT_LIST_HEAD().

Link: https://lkml.kernel.org/r/b3983d28-5d3f-8649-36af-b819285d7a9e@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 9be7d5b)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

By historical accident, shmem_acct_block() and shmem_inode_acct_block()
were never pluralized when the pages argument was added, despite their
complements being shmem_unacct_blocks() and shmem_inode_unacct_blocks()
all along.  It has been an irritation: fix their naming at last.

Link: https://lkml.kernel.org/r/9124094-e4ab-8be7-ef80-9a87bdc2e4fc@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4199f51)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

Extract shmem's memcg charging out of shmem_add_to_page_cache(): it's
misleading done there, because many calls are dealing with a swapcache
page, whose memcg is nowadays always remembered while swapped out, then
the charge re-levied when it's brought back into swapcache.

Temporarily move it back up to the shmem_get_folio_gfp() level, where the
memcg was charged before v5.8; but the next commit goes on to move it back
down to a new home.

In making this change, it becomes clear that shmem_swapin_folio() does not
need to know the vma, just the fault mm (if any): call it fault_mm rather
than charge_mm - let mem_cgroup_charge() decide whom to charge.

Link: https://lkml.kernel.org/r/4b2143c5-bf32-64f0-841-81a81158dac@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 054a9f7)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

There has been a recurring problem, that when a tmpfs volume is being
filled by racing threads, some fail with ENOSPC (or consequent SIGBUS or
EFAULT) even though all allocations were within the permitted size.

This was a problem since early days, but magnified and complicated by the
addition of huge pages.  We have often worked around it by adding some
slop to the tmpfs size, but it's hard to say how much is needed, and some
users prefer not to do that e.g.  keeping sparse files in a tightly
tailored tmpfs helps to prevent accidental writing to holes.

This comes from the allocation sequence:
1. check page cache for existing folio
2. check and reserve from vm_enough_memory
3. check and account from size of tmpfs
4. if huge, check page cache for overlapping folio
5. allocate physical folio, huge or small
6. check and charge from mem cgroup limit
7. add to page cache (but maybe another folio already got in).

Concurrent tasks allocating at the same position could deplete the size
allowance and fail.  Doing vm_enough_memory and size checks before the
folio allocation was intentional (to limit the load on the page allocator
from this source) and still has some virtue; but memory cgroup never did
that, so I think it's better reordered to favour predictable behaviour.

1. check page cache for existing folio
2. if huge, check page cache for overlapping folio
3. allocate physical folio, huge or small
4. check and charge from mem cgroup limit
5. add to page cache (but maybe another folio already got in)
6. check and reserve from vm_enough_memory
7. check and account from size of tmpfs.

The folio lock held from allocation onwards ensures that the !uptodate
folio cannot be used by others, and can safely be deleted from the cache
if checks 6 or 7 subsequently fail (and those waiting on folio lock
already check that the folio was not truncated once they get the lock);
and the early addition to page cache ensures that racers find it before
they try to duplicate the accounting.

Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying,
which can be combined inside the new shmem_alloc_and_add_folio(): doing 2
splits twice (once huge, once nonhuge) is not exactly equivalent to trying
5 splits (and giving up early on huge), but let's keep it simple unless
more complication proves necessary.

Userfaultfd is a foreign country: they do things differently there, and
for good reason - to avoid mmap_lock deadlock.  Leave ordering in
shmem_mfill_atomic_pte() untouched for now, but I would rather like to
mesh it better with shmem_get_folio_gfp() in the future.

Link: https://lkml.kernel.org/r/22ddd06-d919-33b-1219-56335c1bf28e@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 3022fd7)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

Percpu counter's compare and add are separate functions: without locking
around them (which would defeat their purpose), it has been possible to
overflow the intended limit.  Imagine all the other CPUs fallocating tmpfs
huge pages to the limit, in between this CPU's compare and its add.

I have not seen reports of that happening; but tmpfs's recent addition of
dquot_alloc_block_nodirty() in between the compare and the add makes it
even more likely, and I'd be uncomfortable to leave it unfixed.

Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.

I believe this implementation is correct, and slightly more efficient than
the combination of compare and add (taking the lock once rather than twice
when nearing full - the last 128MiB of a tmpfs volume on a machine with
128 CPUs and 4KiB pages); but it does beg for a better design - when
nearing full, there is no new batching, but the costly percpu counter sum
across CPUs still has to be done, while locked.

Follow __percpu_counter_sum()'s example, including cpu_dying_mask as well
as cpu_online_mask: but shouldn't __percpu_counter_compare() and
__percpu_counter_limited_add() then be adding a num_dying_cpus() to
num_online_cpus(), when they calculate the maximum which could be held
across CPUs?  But the times when it matters would be vanishingly rare.

Link: https://lkml.kernel.org/r/bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit beb9868)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion
from mainline-v6.7-rc1
category: performance

Though tmpfs does not need it, percpu_counter_limited_add() can be twice
as useful if it works sensibly with negative amounts (subs) - typically
decrements towards a limit of 0 or nearby: as suggested by Dave Chinner.

And in the course of that reworking, skip the percpu counter sum if it is
already obvious that the limit would be passed: as suggested by Tim Chen.

Extend the comment above __percpu_counter_limited_add(), defining the
behaviour with positive and negative amounts, allowing negative limits,
but not bothering about overflow beyond S64_MAX.

Link: https://lkml.kernel.org/r/8f86083b-c452-95d4-365b-f16a2e4ebcd4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 1431996)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @opsiff, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@deepin-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from opsiff. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This patch series performs maintenance and refactoring in tmpfs/shmem, including moving memcg charging logic, tightening/enforcing tmpfs “size=” block accounting, and introducing a new percpu_counter helper to perform bounded adds.

Changes:

  • Refactors shmem folio allocation/add-to-cache paths (including moving memcg charging out of shmem_add_to_page_cache()).
  • Reworks tmpfs/shmem block accounting helpers (*_acct_blocks) and uses a new bounded percpu counter add to enforce max_blocks.
  • Adds __percpu_counter_limited_add() + inline wrapper percpu_counter_limited_add().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
mm/shmem.c Refactors shmem/tmpfs allocation, charging, and accounting flow; introduces bounded used_blocks accounting usage.
lib/percpu_counter.c Adds the new bounded-add implementation for percpu counters.
include/linux/shmem_fs.h Shrinks shmem_inode_info using a union for dir offsets vs shrink/swap lists.
include/linux/percpu_counter.h Declares/introduces the new limited-add API and wrapper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/percpu_counter.c
Comment on lines +357 to +358
return good;
}
Comment thread include/linux/shmem_fs.h
Comment on lines +26 to +32
union {
struct offset_ctx dir_offsets; /* stable directory offsets */
struct {
struct list_head shrinklist; /* shrinkable hpage inodes */
struct list_head swaplist; /* chain of maybes on swap */
};
};
Comment thread mm/shmem.c
*
* vma, vmf, and fault_type are only supplied by shmem_fault:
* otherwise they are NULL.
* vmf and fault_type are only supplied by shmem_fault: otherwise they are NULL.
@opsiff opsiff merged commit 40b729e into deepin-community:linux-6.6.y Jun 16, 2026
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants