Memory Management
Imported from
_research/manual-study-linux/memory-management.md.
Memory Management
Status: memory-management volume verified for the core VM/page-fault, VMA, physical-page allocation, reclaim, page-cache, and slab-cache surfaces.
This volume follows the Linux memory subsystem from process address-space metadata, through VMA creation and page faults, into physical page allocation, file-backed page cache behavior, reclaim under pressure, and slab-cache object allocation. The goal is implementation fluency: what the C code does, which locks and lifetimes matter, and what a Rust or AI-native translation must preserve.
Source Surface
Primary reviewed sources:
include/linux/mm_types.hinclude/linux/mm.hmm/mmap.cmm/memory.cmm/page_alloc.cmm/filemap.cmm/vmscan.cmm/slab_common.cDocumentation/admin-guide/mm/concepts.rstDocumentation/core-api/memory-allocation.rst
Entry Points
Memory management is not a single syscall path; it is a set of cooperating entry points.
- Address-space layout enters through
brk(),mmap_pgoff(), old mmap compatibility,munmap(), stack expansion, and fork-timedup_mmap(). - Page faults enter the core VM through
handle_mm_fault()after architecture fault code has identified a VMA and lock state. - Physical page allocation enters through
__alloc_pages_noprof()and frees through__free_pages(). - File-backed cache access enters through
filemap_read(),generic_file_read_iter(),filemap_fault(),filemap_map_pages(), andgeneric_perform_write(). - Reclaim enters through direct reclaim and memcg reclaim, while background
pressure wakes
kswapd()throughwakeup_kswapd(). - Slab caches are created and destroyed through
__kmem_cache_create_args(),create_cache(), andkmem_cache_destroy().
Core Data Structures
Linux represents virtual memory with mm_struct, vm_area_struct, page-table
levels, and fault descriptors. mm_struct owns process-level address-space
state, including VMA indexing and mmap_lock. vm_area_struct describes a
contiguous virtual range: permissions, flags, file backing, anonymous-memory
metadata, VMA operations, optional per-VMA lock state, and tree linkage.
struct vm_fault is the page-fault work item. It carries the target VMA,
allocation mask, logical page offset, faulting address, fault flags, original
PTE/PMD state, page-table lock, optional COW page, and returned folio/page
state. vm_fault_t is a bitmask result channel for OOM, SIGBUS/SIGSEGV, retry,
fallback, completed COW, lock-dropping, and storage IO outcomes.
The physical allocator is zone and order based. Zones own buddy free lists, GFP
policy drives allocation permissions, and slowpath code coordinates reclaim,
compaction, reserves, and OOM decisions. The page cache uses an address-space
mapping XArray to index folios by file offset. Reclaim uses struct scan_control to carry target, memcg, permissions, counters, priority, order,
and GFP context. Slab allocation wraps repeated fixed-size kernel objects in
named kmem_cache descriptors with alignment, flags, constructors, merge
rules, and lifecycle accounting.
Control Flow
VMA Creation And Teardown
mmap_pgoff() routes mapping requests into do_mmap(). do_mmap() requires a
write-held mmap_lock, validates length and page offset overflow, checks
map-count limits, asks for an unmapped area, rejects
MAP_FIXED_NOREPLACE collisions, validates locked mappings, separates
file-backed from anonymous policy, handles MAP_NORESERVE, and calls
mmap_region() to install the mapping.
Lookup is maple-tree based. find_vma_intersection() and find_vma() search
mm->mm_mt; find_vma_prev() uses VMA_ITERATOR to retrieve adjacent ranges.
Unmap routes through do_munmap() and do_vmi_munmap(). Fork duplicates the
address space in dup_mmap(): it locks old and new mms, duplicates the maple
tree, skips VM_DONTCOPY, allocates new VMA descriptors, forks anon-vma state,
calls VMA open hooks, inserts file interval-tree nodes, and copies page tables.
Page Faults
handle_mm_fault() is the public core VM entry after architecture code has
resolved the faulting VMA. It sanitizes flags, checks architecture access
permission, enters memcg user-fault handling for user faults, chooses hugetlb or
regular fault handling, and accounts the result.
__handle_mm_fault() builds struct vm_fault, walks and allocates page-table
levels, tries huge PUD/PMD paths where valid, handles migration or
device-private entries, and falls back to handle_pte_fault().
handle_pte_fault() dispatches:
- missing PTEs to
do_pte_missing() - swap entries to
do_swap_page() - NUMA-protected PTEs to
do_numa_page() - write or unshare faults on read-only PTEs to
do_wp_page() - present mappings to accessed/dirty/MMU-cache update logic
Anonymous missing-PTE faults can map the shared zero page for reads. Write
faults prepare anon-vma state, allocate a private folio, mark it uptodate, lock
the PTE, recheck races, and install a writable entry. File-backed faults call
vm_ops->fault() via __do_fault() and finish installation through
finish_fault(). COW faults use do_wp_page() to choose shared-writable reuse,
private anonymous reuse, or wp_page_copy() allocation and replacement.
Physical Page Allocation
Freeing enters __free_one_page(). The allocator accounts the page range,
checks whether the corresponding buddy is mergeable, removes free buddies,
coalesces to higher orders, and inserts the merged block into the right free
list.
Allocation enters __alloc_frozen_pages_noprof() and then
__alloc_pages_noprof(). The fast path validates order, derives an allocation
context from GFP flags, applies fragmentation-avoidance policy, and tries
get_page_from_freelist(). If watermarks cannot be met, the slow path wakes
kswapd, retries under adjusted flags, checks reserves and cpusets, performs
direct reclaim and compaction when allowed, evaluates retry rules, invokes OOM
handling, and implements __GFP_NOFAIL looping.
Page Cache And File-Backed Memory
filemap_get_entry() performs lockless XArray lookup under RCU, skips
exceptional values, pins a folio, and reloads the slot to ensure it did not race
with truncation or replacement. __filemap_get_folio_mpol() builds on that
lookup with optional locking, accessed/write/stable handling, allocation, and
filemap_add_folio().
Buffered reads use filemap_get_pages() to gather cache folios, trigger
readahead, allocate missing folios, update stale pages, and retry on truncation
races. filemap_fault() is the file-backed mmap fault path; it looks up or
creates a folio, may drop mmap_lock and return a retry result, locks and reads
the folio as needed, and returns a vm_fault_t. filemap_map_pages() maps
already-resident folios into PTEs in batches. generic_perform_write() runs the
generic buffered write loop: dirty throttling, filesystem write_begin, atomic
copy from user iterator, filesystem write_end, and forward-progress handling.
Reclaim
Reclaim policy is centralized in struct scan_control. It carries the reclaim
target, optional memcg, anon/file cost, may-writepage/may-unmap/may-swap
permissions, memcg low-limit state, priority, order, reclaim index, GFP mask,
and scan/reclaim counters.
shrink_inactive_list() isolates LRU folios, calls shrink_folio_list(), and
moves survivors back. shrink_folio_list() locks folios, skips unevictable or
unmappable entries, handles dirty/writeback state, checks references, demotes
where possible, allocates swap for anonymous folios, unmaps, avoids pinned
folios, writes dirty file pages when allowed, frees successful candidates, and
returns failed candidates to LRU state.
Background reclaim runs in kswapd(). wakeup_kswapd() records pressure and
wakes the daemon. balance_pgdat() chooses priority and reclaim index, ages
active lists, reclaims memcg soft-limit pages, calls kswapd_shrink_node(),
wakes direct reclaimers once watermarks improve, and decides when the daemon can
sleep.
Slab Caches
The slab layer builds named fixed-size object caches. __kmem_cache_create_args()
validates names, object sizes, flags, debugging, hardened usercopy ranges,
mergeability, alignment, and cache aliasing under slab_mutex. create_cache()
allocates the cache descriptor, calls allocator-specific creation hooks, sets
the refcount, and links the cache globally. kmem_cache_destroy() waits for
deferred RCU/free work, handles SLAB_TYPESAFE_BY_RCU, takes CPU and slab
locks, decrements the refcount, shuts down allocator state, warns on live
objects, unlinks sysfs/debugfs state, and releases the descriptor only when
safe.
Concurrency And Lifetime
mmap_lock protects address-space layout. VMA paths distinguish read-side
lookup from write-side mutation, and stack expansion can upgrade from a read
lock to a write lock, mutate the VMA, and downgrade again. Page-fault handlers
must treat lock-dropping results carefully: after __handle_mm_fault() returns,
callers cannot assume the original VMA pointer remains valid if the lock was
dropped.
Page-table mutation is protected by page-table locks and original-entry revalidation. Fault paths allocate and copy before publishing PTEs, then lock and recheck that the observed PTE still matches. File-backed faults preallocate PTE pages before taking folio locks to avoid reclaim/writeback deadlocks.
Page-cache lookup uses RCU plus folio references and slot revalidation. Reclaim isolates folios before expensive work and returns survivors to LRU lists. Allocator slow paths are constrained by caller context: reclaim, compaction, IO, reserve access, and no-fail looping are all derived from GFP state and task context.
Resource And Failure Model
The memory subsystem reports failure through typed kernel channels, not a
single error code. vm_fault_t can signal OOM, SIGBUS, SIGSEGV, retry, fallback
to smaller pages, storage IO, completed COW, or dropped locks. Mapping creation
can fail because of invalid flags, length/offset overflow, address collision,
locked-memory permission, file-mode mismatch, map-count limits, or allocation
failure.
Physical allocation has a staged failure model: fast-path miss, kswapd wake, reserve retry, direct reclaim, compaction, retry decision, OOM, no-fail loop, or failure return. Reclaim has policy-limited failure: a folio may be referenced, locked elsewhere, dirty when writeback is disallowed, under writeback, pinned, unmappable, unevictable, or not worth retrying at the current priority.
Extension Points
Memory extension points are operation tables and policy flags:
- VMA
vm_ops, especiallyfault, provide file/special mapping behavior. - Filesystem address-space operations provide
read_folio,write_begin, andwrite_end. - GFP flags express caller allocation policy.
- Memcg and cpuset state constrain reclaim and allocation.
- Slab flags, constructors, hardened usercopy ranges, and allocator-specific hooks customize object-cache behavior.
- Architecture page-table helpers define the unsafe hardware boundary for PTE, PMD, PUD, and TLB behavior.
C Implementation Walkthrough
The C implementation is structured around explicit state, labels, bitfields, and lock-coupled helper calls rather than object-oriented wrappers.
In mm/mmap.c, do_mmap() starts at line 336. It immediately asserts the
write-held mmap_lock at line 347, then validates zero length, length
alignment, offset overflow, and map-count pressure at lines 349-380. It asks
for an unmapped range at lines 405-410, checks MAP_FIXED_NOREPLACE
intersection at lines 412-414, validates MAP_LOCKED at lines 417-422, splits
file-backed from anonymous mapping logic at lines 424-543, handles
MAP_NORESERVE at lines 546-558, and calls mmap_region() at line 560.
In mm/memory.c, handle_mm_fault() runs the top-level permission, memcg,
hugetlb, normal-fault, and accounting flow at lines 6644-6716.
__handle_mm_fault() constructs struct vm_fault and walks page-table levels
at lines 6411-6515. handle_pte_fault() is the PTE dispatcher at lines
6328-6408. Anonymous fault allocation and zero-page handling are at lines
5282-5365; file fault callback dispatch is at lines 5393-5428; COW and
write-protect decisions are at lines 3836-3852 and 4240-4315.
In mm/page_alloc.c, the buddy allocator is described at lines 913-934.
__free_one_page() begins at line 936, merges compatible buddies at lines
954-1005, and reinserts the merged block at lines 1007-1019. Allocation
slowpath behavior is concentrated in __alloc_pages_slowpath() at lines
4724-5023. The main zoned buddy entry is documented as the allocator heart at
lines 5265-5267 and implemented at lines 5268-5331.
In mm/filemap.c, the lockless page-cache protocol is documented at lines
1862-1880. filemap_get_entry() performs RCU/XArray lookup and folio
revalidation at lines 1882-1923. filemap_get_pages() runs buffered read
batching, readahead, creation, update, and retry behavior at lines 2677-2744.
filemap_fault() implements file-backed mmap fault behavior at lines
3523-3704, and generic_perform_write() runs the buffered write loop at lines
4335-4415.
In mm/vmscan.c, struct scan_control is defined at lines 74-180.
shrink_folio_list() begins at line 1058, classifies folios at lines
1078-1132, implements writeback cases at lines 1143-1230, runs reference,
demotion, swap, unmap, pinned, dirty-file, and writepage decisions at lines
1233-1441, and completes demotion/free/move-back handling near lines
1553-1594. balance_pgdat() runs background node balancing at lines
7056-7290; kswapd() itself runs at lines 7391-7476.
In mm/slab_common.c, cache creation validates caller-visible cache properties,
allocator flags, merging, and hardened usercopy constraints before publishing a
cache. Cache destruction waits for deferred work, handles RCU-safe caches, and
unlinks the cache only after allocator shutdown and live-object checks.
File-By-File Implementation Analysis
This section is the deeper implementation layer. It describes each core file as code: what the important functions accept, what they mutate, where they branch, which labels are used for retry/error paths, and why the C is shaped the way it is.
include/linux/mm_types.h: State Carried Through The VM
This header is where the memory subsystem’s central nouns are declared. The important design choice is that Linux does not pass “an address” through the VM as a naked scalar. It carries the address with the VMA, original PTE/PMD state, page-table lock, fault flags, folio/page return value, and allocation policy. That is what makes page-fault code resumable after races and retryable after lock dropping.
The vm_area_struct is the unit of virtual-address policy. It answers: which
range is this, what permissions does it have, is it file-backed, what callbacks
does it use, and what anonymous-memory metadata is attached. The mm_struct is
the address-space owner: it owns the VMA index, page tables, counters, locks,
and process-level state.
struct vm_fault is the implementation hinge. It is not merely an error
context; it is a mutable work packet. Fault helpers fill in fields as they walk
from address-space level to page-table level:
struct vm_fault vmf = { .vma = vma, .address = address & PAGE_MASK, .real_address = address, .flags = flags, .pgoff = linear_page_index(vma, address), .gfp_mask = __get_fault_gfp_mask(vma),};That initializer in mm/memory.c lines 6420-6427 shows the data model in use:
the raw CPU fault address is normalized to a page address, the file/anonymous
offset is derived from the VMA, and the allocation mask is derived from fault
context before any page-table allocation is attempted.
include/linux/mm.h: The Public VM Contract
This header exports the contracts consumed by architecture code, file systems,
device mappings, GUP, and other MM users. The most important shape is the VMA
operation table. File-backed memory does not hard-code ext4, tmpfs, or device
behavior into mm/memory.c; it calls through VMA and address-space operation
tables.
The conceptual contract is:
struct vm_operations_struct { void (*open)(struct vm_area_struct *area); void (*close)(struct vm_area_struct *area); vm_fault_t (*fault)(struct vm_fault *vmf); vm_fault_t (*map_pages)(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff); ...};The exact struct includes more callbacks, but these fields explain the pattern:
VMA lifecycle and page-fault behavior are delegated. mm/memory.c owns common
page-table installation and COW policy; filesystems/devices own “how do I
materialize backing data for this VMA?”
mm/mmap.c: VMA Creation, Lookup, Fork Copy, And Unmap
do_mmap() is the core mapping constructor. It accepts an optional struct file *, requested address/length/protection/flags, derived VM flags, page
offset, output population length, and optional userfaultfd list. It mutates
vm_flags, may normalize addr, may rewrite pgoff, and eventually asks
mmap_region() to edit the VMA tree.
The opening shape matters:
*populate = 0;mmap_assert_write_locked(mm);
if (!len) return -EINVAL;
len = PAGE_ALIGN(len);if (!len) return -ENOMEM;
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) return -EOVERFLOW;
if (mm->map_count > get_sysctl_max_map_count()) return -ENOMEM;This is mm/mmap.c lines 345-380. The function is deliberately front-loaded
with policy checks before tree mutation: caller must hold the write side of
mmap_lock; zero-length mappings fail as invalid input; length alignment can
overflow to zero and is treated as allocation impossibility; file offset plus
length must not wrap; and the process cannot exceed the configured VMA count.
Next it derives final VMA flags and finds an address:
vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(file, flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags);if (IS_ERR_VALUE(addr)) return addr;
if (flags & MAP_FIXED_NOREPLACE) { if (find_vma_intersection(mm, addr, addr + len)) return -EEXIST;}This is lines 402-414. The key point is that “protection” and “capability” are
separate. VM_READ means currently readable; VM_MAYREAD means it may be made
readable later. The address selection is delegated because architecture,
randomization, top-down/bottom-up layout, huge pages, and file constraints all
affect placement.
The file-backed branch enforces the backing object’s rules:
if (prot & PROT_WRITE) { if (!(file->f_mode & FMODE_WRITE)) return -EACCES; if (IS_SWAPFILE(file->f_mapping->host)) return -ETXTBSY;}
if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE)) return -EACCES;
vm_flags |= VM_SHARED | VM_MAYSHARE;if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED);That is lines 450-466 inside the MAP_SHARED/MAP_SHARED_VALIDATE path. The
C code encodes Unix semantics directly: writable shared mappings require a
writable file; swapfiles cannot be text-busy modified through mappings;
append-only files cannot be bypassed with mmap writes; and a non-writable file
can still be mapped, but not as a writable shared mapping.
Anonymous mappings use different policy:
case MAP_PRIVATE: /* * Set pgoff according to addr for anon_vma. */ pgoff = addr >> PAGE_SHIFT; break;That is lines 535-540. Anonymous private mappings have no file offset, so Linux chooses an offset derived from the address. That value becomes part of anon-vma and reverse-mapping logic.
The end of do_mmap() is the actual mutation point:
addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len;return addr;Lines 560-565 show the split between creating metadata and faulting/populating
pages. mmap() usually creates address-space policy; it does not necessarily
allocate every page immediately.
dup_mmap() is the fork-side counterpart. It has to clone the VMA tree and
then make every child VMA consistent with files, anon-vma state, userfaultfd,
and page tables. Its important sequence is: lock old and new address spaces,
duplicate the maple tree, iterate VMAs, skip VM_DONTCOPY, duplicate the VMA
descriptor, fork anon-vma metadata, call VMA open, insert file interval-tree
state, then call copy_page_range(). This ordering is why fork cleanup is
complicated: after any partial success, there may be tree entries, file
references, anon-vma references, and copied page tables to unwind.
mm/memory.c: Fault Dispatch, Page-Table Walk, Anonymous/File/COW
handle_mm_fault() is the top-level common fault entry. Architecture code has
already found the VMA and acquired either the VMA lock or mmap_lock; this
function is responsible for policy, common dispatch, memcg user-fault state,
and accounting.
The high-level function is short because the complexity is delegated:
ret = sanitize_fault_flags(vma, &flags);if (ret) goto out;
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTION, flags & FAULT_FLAG_REMOTE)) { ret = VM_FAULT_SIGSEGV; goto out;}
if (flags & FAULT_FLAG_USER) mem_cgroup_enter_user_fault();
if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags);else ret = __handle_mm_fault(vma, address, flags);That is from mm/memory.c lines 6661-6686. The function first rejects
impossible flag combinations and architecture permission failures. Then it
enters memcg OOM handling only for user faults. Then it routes either to hugetlb
or normal page-table handling.
The most important lifetime warning follows:
/* * Warning: It is no longer safe to dereference vma-> after this point, * because mmap_lock might have been dropped by __handle_mm_fault(), so * vma might be destroyed from underneath us. */This is lines 6688-6692. It is one of the most important lessons in the VM: return values are not just success/failure. Some return values encode “the lock was dropped; your pointer may be stale.”
__handle_mm_fault() constructs the vm_fault, walks page-table levels, and
tries huge mappings before falling back:
pgd = pgd_offset(mm, address);p4d = p4d_alloc(mm, pgd, address);if (!p4d) return VM_FAULT_OOM;
vmf.pud = pud_alloc(mm, p4d, address);if (!vmf.pud) return VM_FAULT_OOM;...vmf.pmd = pmd_alloc(mm, vmf.pud, address);if (!vmf.pmd) return VM_FAULT_OOM;...fallback: return handle_pte_fault(&vmf);This is lines 6434-6471 and 6516-6517. The walk is allocation-capable: a page fault can allocate page-table pages before it ever allocates a user page.
Huge-page handling is opportunistic. If a PUD/PMD is empty and the VMA allows a
huge fault, Linux tries to create the huge mapping. If that returns
VM_FAULT_FALLBACK, the code resumes at the smaller page-table level. This
means huge pages are an optimization path, not a semantic requirement.
handle_pte_fault() is the dispatcher:
if (!vmf->pte) return do_pte_missing(vmf);
if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf);
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf);
spin_lock(vmf->ptl);entry = vmf->orig_pte;if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock;}if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!pte_write(entry)) return do_wp_page(vmf); else if (likely(vmf->flags & FAULT_FLAG_WRITE)) entry = pte_mkdirty(entry);}entry = pte_mkyoung(entry);This is lines 6378-6399. The dispatch order is exact:
missing mapping, swap/migration/non-present entry, NUMA protection, then present
PTE update. Before modifying the PTE, it locks the page-table lock and checks
pte_same() against the originally observed value. That protects against a
parallel fault racing to install or change the same PTE.
The missing-PTE branch eventually splits anonymous from file-backed faults:
anonymous read faults can map the zero page; anonymous write faults allocate a
private folio; file/special mappings call the VMA fault operation. COW write
faults go through do_wp_page() and may reuse an exclusive anonymous folio or
allocate/copy through wp_page_copy(). The common pattern is always:
prepare outside the PTE lock when possible, take the lock, revalidate the
original PTE, publish the new entry, update MMU/TLB state, and release.
mm/page_alloc.c: Zoned Buddy Allocator And Slowpath Policy
__alloc_frozen_pages_noprof() is documented in the source as the heart of the
zoned buddy allocator. Its structure is a classic Linux fast path plus slow
path:
if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) return NULL;
gfp &= gfp_allowed_mask;gfp = current_gfp_context(gfp);alloc_gfp = gfp;if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac, &alloc_gfp, &alloc_flags)) return NULL;
alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);if (likely(page)) goto out;
ac.nodemask = nodemask;page = __alloc_pages_slowpath(alloc_gfp, order, &ac);This is mm/page_alloc.c lines 5276-5317. It validates order, masks GFP by
system-wide allowed bits, applies scoped context like nofs/noio, prepares zone
and cpuset state, tries a no-fragment fast allocation, then restores the caller
nodemask and enters slowpath.
The slow path is not “try harder” as one vague step. It is a sequence of policy gates:
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;bool can_compact = can_direct_reclaim && gfp_compaction_allowed(gfp_mask);bool nofail = gfp_mask & __GFP_NOFAIL;const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;Lines 4728-4731 derive the allocator’s legal actions from GFP. If direct reclaim is disallowed, the allocator must not sleep in reclaim. If compaction is disallowed, high-order allocations cannot depend on moving pages. If nofail is set, the function must loop unless the request is nonsensical.
The main retry loop wakes background reclaim and retries the freelist:
retry: if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac);
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); if (page) goto got_pg;That is lines 4812-4823. It is important that kswapd is woken before expensive direct work: background reclaim might satisfy future allocations even if this allocation has to continue into slowpath.
Then it broadens policy if reserves or cpusets can be ignored:
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);if (reserve_flags) alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags) | (alloc_flags & ALLOC_KSWAPD);
if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) { ac->nodemask = NULL; ac->preferred_zoneref = first_zones_zonelist(...); if (can_retry_reserves) { can_retry_reserves = false; goto retry; }}Lines 4825-4850 show how Linux handles privileged/system allocations: it can ignore memory policy and watermarks once, then retry before doing heavier work.
If the caller cannot reclaim, the path is short:
if (!can_direct_reclaim) { if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT) && (gfp_mask & __GFP_KSWAPD_RECLAIM)) { alloc_flags &= ~ALLOC_NOFRAGMENT; goto retry; } goto nopage;}Lines 4852-4865 demonstrate why GFP is semantically important: allocation from atomic or nonblocking context cannot simply block until memory appears.
The expensive part is direct reclaim and compaction:
if (!compact_first) { page = __alloc_pages_direct_reclaim(..., &did_some_progress); if (page) goto got_pg;}
page = __alloc_pages_direct_compact(..., compact_priority, &compact_result);if (page) goto got_pg;Lines 4874-4886 show the two big recovery mechanisms. Reclaim frees pages;
compaction moves pages to create high-order contiguous ranges. The retry logic
afterward is guarded by __GFP_NORETRY, costly-order policy,
should_reclaim_retry(), should_compact_retry(), cpuset/zonelist race checks,
OOM handling, and nofail fallback.
The nofail tail is explicit:
if (unlikely(nofail)) { if (!can_direct_reclaim) goto fail; page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac); if (page) goto got_pg; cond_resched(); goto retry;}Lines 4992-5017 make clear that nofail is not magic. It still needs a context that can reclaim, it tries limited reserve fallback, yields, and retries.
mm/filemap.c: Page Cache, File Faults, And Buffered IO
The page cache is an indexed folio store rooted in address_space->i_pages.
filemap_get_entry() is the primitive lookup:
rcu_read_lock();repeat: xas_reset(&xas); folio = xas_load(&xas); if (xas_retry(&xas, folio)) goto repeat; if (!folio || xa_is_value(folio)) goto out;
if (!folio_try_get(folio)) goto repeat;
if (unlikely(folio != xas_reload(&xas))) { folio_put(folio); goto repeat; }out: rcu_read_unlock();This is mm/filemap.c lines 1899-1920. The sequence is the whole lockless
protocol: load under RCU, handle retry entries, ignore shadow/swap exceptional
values for refcounting, try to pin the folio, then reload the XArray slot to
prove the pinned folio is still the indexed cache entry.
__filemap_get_folio_mpol() layers policy on top. If FGP_LOCK is requested,
it locks the folio and then verifies it was not truncated out of the mapping:
if (fgp_flags & FGP_LOCK) { ... if (unlikely(folio->mapping != mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);}Lines 1954-1970 show the same pattern as PTE faults: take reference/lock, then revalidate against the mapping because truncation can race with lookup.
The create path uses folio order policy and falls back to smaller orders:
do { gfp_t alloc_gfp = gfp; err = -ENOMEM; if (order > min_order) alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; folio = filemap_alloc_folio(alloc_gfp, order, policy); if (!folio) continue;
err = filemap_add_folio(mapping, folio, index, gfp); if (!err) break; folio_put(folio); folio = NULL;} while (order-- > min_order);Lines 2007-2028 show large-folio optimism without making high-order allocation mandatory. If the big folio cannot be allocated or inserted, the loop can try a smaller order.
filemap_fault() is the mmap fault implementation for ordinary files. It first
checks file size:
max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);if (unlikely(index >= max_idx)) return VM_FAULT_SIGBUS;Lines 3557-3559 show why mmap past EOF faults as SIGBUS: the VMA may cover a
range, but the file no longer has data for that page.
Then it tries the cache:
folio = filemap_get_folio(mapping, index);if (likely(!IS_ERR(folio))) { if (!(vmf->flags & FAULT_FLAG_TRIED)) fpin = do_async_mmap_readahead(vmf, folio); if (unlikely(!folio_test_uptodate(folio))) { filemap_invalidate_lock_shared(mapping); mapping_locked = true; }} else { count_vm_event(PGMAJFAULT); ret = VM_FAULT_MAJOR; fpin = do_sync_mmap_readahead(vmf); ... folio = __filemap_get_folio(mapping, index, FGP_CREAT|FGP_FOR_MMAP, vmf->gfp_mask);}This is lines 3566-3600. Cache hit: maybe async readahead. Cache miss: major fault accounting, synchronous mmap readahead, invalidate-lock coverage, and folio creation.
The lock/drop-retry path is the subtle part:
if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin)) goto out_retry;...if (fpin) { folio_unlock(folio); goto out_retry;}...out_retry: if (!IS_ERR(folio)) folio_put(folio); if (mapping_locked) filemap_invalidate_unlock_shared(mapping); if (fpin) fput(fpin); return ret | VM_FAULT_RETRY;Lines 3608-3609, 3650-3652, and 3690-3702 explain the contract mentioned in
handle_mm_fault(): file IO may require dropping mmap_lock; when that
happens, the upper fault handler must re-find the VMA and retry.
mm/vmscan.c: Reclaim Decision Engine
shrink_folio_list() is a long function because reclaim has to decide what is
legal and profitable for every folio. The opening sets up private lists and
policy:
struct folio_batch free_folios;LIST_HEAD(ret_folios);LIST_HEAD(demote_folios);unsigned int nr_reclaimed = 0, nr_demoted = 0;...do_demote_pass = can_demote(pgdat->node_id, sc, memcg);This is mm/vmscan.c lines 1065-1076. Reclaim does not immediately free every
candidate; it separates folios to free, folios to return, and folios to demote
to another memory tier.
The main loop isolates and locks one folio at a time:
folio = lru_to_folio(folio_list);list_del(&folio->lru);
if (!folio_trylock(folio)) goto keep;
nr_pages = folio_nr_pages(folio);sc->nr_scanned += nr_pages;
if (unlikely(!folio_evictable(folio))) goto activate_locked;
if (!sc->may_unmap && folio_mapped(folio)) goto keep_locked;Lines 1088-1120 show the first filter: if it cannot lock the folio, keep it; if unevictable, reactivate it; if mapped and this reclaim context cannot unmap, keep it.
Writeback handling is deliberately conservative. The source comments describe three cases, and the code reflects them:
if (folio_test_writeback(folio)) { mapping = folio_mapping(folio);
if (current_is_kswapd() && folio_test_reclaim(folio) && test_bit(PGDAT_WRITEBACK, &pgdat->flags)) { stat->nr_immediate += nr_pages; goto activate_locked; } else if (writeback_throttling_sane(sc) || !folio_test_reclaim(folio) || !may_enter_fs(folio, sc->gfp_mask) || (mapping && mapping_writeback_may_deadlock_on_reclaim(mapping))) { folio_set_reclaim(folio); stat->nr_writeback += nr_pages; goto activate_locked; } else { folio_unlock(folio); folio_wait_writeback(folio); list_add_tail(&folio->lru, folio_list); continue; }}This is lines 1189-1230. Reclaim avoids indefinite stalls and filesystem deadlocks. Sometimes it marks the folio for immediate reclaim later; sometimes legacy memcg waits for writeback; often it activates the folio and keeps scanning for cheaper victims.
Reference checking decides whether the folio is still part of the working set:
if (!ignore_references) references = folio_check_references(folio, sc);
switch (references) {case FOLIOREF_ACTIVATE: goto activate_locked;case FOLIOREF_KEEP: stat->nr_ref_keep += nr_pages; goto keep_locked;case FOLIOREF_RECLAIM:case FOLIOREF_RECLAIM_CLEAN: ; /* try to reclaim the folio below */}Lines 1233-1245 show the “second chance” behavior. Reclaim is not only about freeing memory; it protects recently used pages from being discarded.
Anonymous pages need swap before reclaim:
if (folio_test_anon(folio) && folio_test_swapbacked(folio) && !folio_test_swapcache(folio)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; if (folio_maybe_dma_pinned(folio)) goto keep_locked; ... if (folio_alloc_swap(folio)) { ... goto activate_locked_split; } folio_mark_dirty(folio);}Lines 1263-1315 show that reclaim cannot simply drop anonymous memory. It must secure swap backing, avoid pinned memory, handle large folios, and mark special MADV_FREE races dirty to avoid data corruption.
Mapped folios must be unmapped before freeing:
if (folio_mapped(folio)) { enum ttu_flags flags = TTU_BATCH_FLUSH; bool was_swapbacked = folio_test_swapbacked(folio);
if (folio_test_pmd_mappable(folio)) flags |= TTU_SPLIT_HUGE_PMD; if (folio_test_large(folio)) flags |= TTU_SYNC;
try_to_unmap(folio, flags); if (folio_mapped(folio)) { stat->nr_unmap_fail += nr_pages; ... goto activate_locked; }}Lines 1331-1359 show reverse-mapping in action. Reclaim asks every mapping of that folio to remove its PTEs. Large folios add synchronization because partial PTE races can leave subpages mapped.
Dirty file folios go to writeback only when the reclaim context allows it:
if (folio_test_dirty(folio)) { if (folio_is_file_lru(folio)) { node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE, nr_pages); if (!folio_test_reclaim(folio)) folio_set_reclaim(folio); goto activate_locked; }
if (references == FOLIOREF_RECLAIM_CLEAN) goto keep_locked; if (!may_enter_fs(folio, sc->gfp_mask)) goto keep_locked; if (!sc->may_writepage) goto keep_locked;
try_to_unmap_flush_dirty(); switch (pageout(folio, mapping, &plug, folio_list)) { ... }}Lines 1372-1441 show that IO permission is a first-class part of reclaim. A nofs/noio allocation context must not deadlock by entering filesystem writeback from reclaim.
mm/slab_common.c: Object Cache Lifecycle
The slab common layer is the allocator surface for fixed-size kernel objects. It exists because many kernel types are allocated frequently and need constructor hooks, alignment, debugging, hardened usercopy metadata, and allocator-specific backend setup.
The creation path validates user-visible and allocator-visible properties before publishing a cache. The common pattern is:
- Validate name, size, alignment, flags, usercopy range, and constructor.
- Normalize flags and reject impossible combinations.
- Under
slab_mutex, try to merge with an existing compatible cache when allowed. - Allocate and initialize a
kmem_cachedescriptor. - Call allocator-specific creation hooks.
- Link the cache into global/sysfs/debugfs state only after backend creation succeeds.
The destruction path is careful because cached objects may have RCU-delayed
lifetime or in-flight frees. kmem_cache_destroy() has to flush deferred work,
handle SLAB_TYPESAFE_BY_RCU, take the right locks, decrement references,
invoke backend shutdown, warn if live objects remain, unlink external
visibility, and free the descriptor only after no allocator user can discover
it.
The conceptual contrast with the buddy allocator is important: the buddy allocator manages physical page blocks; slab manages typed object reuse on top of pages. Most kernel subsystems should not hand-roll object pools when slab can encode the object size, alignment, constructor, and debug policy centrally.
Cross-File Execution Traces
Anonymous Write Fault To A New Page
- Architecture fault code identifies the VMA and calls
handle_mm_fault(). handle_mm_fault()validates flags and routes to__handle_mm_fault().__handle_mm_fault()walks/allocates page-table levels and falls back tohandle_pte_fault().handle_pte_fault()sees no PTE and callsdo_pte_missing().do_pte_missing()identifies anonymous memory and callsdo_anonymous_page().do_anonymous_page()allocates a folio using the page allocator, prepares anon-vma/rmap state, marks the folio uptodate, locks the PTE, revalidates the PTE is still missing, installs a writable PTE, updates MMU cache state, and unlocks.- If allocation fails, the fault result carries
VM_FAULT_OOM; if a race installed the PTE first, the handler drops its prepared state and retries or treats the race as resolved.
File-Backed Mmap Read Fault
handle_mm_fault()routes normal VMA fault handling to__handle_mm_fault().- Missing PTE dispatch reaches
do_fault()because the VMA hasvm_ops. - The file VMA’s
faultcallback isfilemap_fault(). filemap_fault()checks file size and returnsVM_FAULT_SIGBUSif the page offset is beyond EOF.- It searches the page cache with
filemap_get_folio(). - On cache hit, it may start async readahead and lock the folio.
- On cache miss, it accounts a major fault, performs sync readahead, creates a
cache folio under invalidate-lock coverage, and may allocate pages through
mm/page_alloc.c. - If IO requires dropping
mmap_lock, it returnsVM_FAULT_RETRY; the upper fault path must re-find the VMA. - Once an uptodate locked folio is returned, common fault code installs the PTE
and returns
VM_FAULT_LOCKEDsemantics.
Memory Pressure During Allocation
- Caller requests pages with GFP flags.
__alloc_frozen_pages_noprof()derives context, tries the freelist fast path, and falls into__alloc_pages_slowpath()on miss.- Slowpath wakes
kswapd, retries adjusted watermarks/reserves, and checks whether direct reclaim is legal. - Direct reclaim enters vmscan with a
scan_control. shrink_inactive_list()isolates candidate folios.shrink_folio_list()filters locked, unevictable, mapped, dirty, writeback, referenced, pinned, and non-IO-safe pages.- Reclaim either frees pages, writes pages, demotes pages, activates pages, or returns them to LRU.
- The allocator retries the freelist, may compact memory, may invoke OOM, and
may loop if
__GFP_NOFAILallows it.
Rust Translation
A Rust translation should preserve the same state boundaries:
AddressSpaceformm_struct, withMmapReadGuardandMmapWriteGuard.Vmahandles tied to address-space guard lifetimes.FaultContextforstruct vm_fault.FaultResultbitflags or enum variants forvm_fault_t.PageTableWalkandPteGuardtypes for mutation under page-table locks.ZoneAllocatorwith order-indexed free lists and explicitAllocPolicy.PageCacheindexed by file offset, returning revalidated pinned folios.ScanControlplusFolioReclaimStatefor reclaim.SlabCache<T>for typed fixed-size object allocation.
Unsafe code should be narrow and hardware-facing: page-table writes, atomic PTE updates, TLB/cache operations, and architecture-specific memory ordering. The safe layer should encode lock ownership, VMA validity, folio lock state, and retry/drop-lock outcomes in types so stale handles are hard to misuse.
AI-Native Translation
AI-native runtimes can borrow the same architecture for large context and tool memory:
- Address spaces become tenant/session memory domains.
- VMAs become typed context regions with permissions, backing store, and lazy materialization policy.
- Page faults become cache misses with typed outcomes: synthesize, fetch, clone-on-write, retry, throttle, demote, or fail.
- GFP policy becomes allocation intent: latency-sensitive, reclaimable, no-IO, no-wait, no-fail, or background.
- Reclaim becomes pressure handling for conversation context, embedding caches, tool outputs, and derived artifacts.
- Slab caches become typed pools for frequently allocated runtime objects.
The key lesson is that memory policy must be explicit. Hidden allocation and implicit cache growth make agent systems unpredictable under load; Linux keeps allocation intent, reclaim permission, lock dropping, and retry behavior visible at every important boundary.
Evidence Table
| Source | Evidence |
|---|---|
include/linux/mm_types.h | mm_struct, vm_area_struct, struct vm_fault, and vm_fault_t define the central VM state and fault result model. |
include/linux/mm.h | Public MM APIs and VMA operation tables define the boundary used by architecture, file, and special-mapping code. |
mm/mmap.c | do_mmap() lines 336-565 implement VMA creation policy; dup_mmap() lines 1731-1840 implements fork-time address-space copy. |
mm/memory.c | handle_mm_fault() lines 6644-6716, __handle_mm_fault() lines 6411-6515, and handle_pte_fault() lines 6328-6408 implement core fault dispatch. |
mm/page_alloc.c | Buddy allocator comments and code at lines 913-1019 plus allocation slowpath lines 4724-5023 show physical-page policy. |
mm/filemap.c | Page-cache lookup lines 1862-1923, read path lines 2677-2744, mmap fault lines 3523-3704, and write path lines 4335-4415 show file-backed memory. |
mm/vmscan.c | scan_control lines 74-180, shrink_folio_list() lines 1058-1594, and balance_pgdat() lines 7056-7290 show reclaim policy. |
mm/slab_common.c | Cache creation and destruction paths define fixed-size object cache lifecycle. |
Source Notes
file-notes/linux__include__linux__mm_types.h.mdfile-notes/linux__include__linux__mm.h.mdfile-notes/linux__mm__mmap.c.mdfile-notes/linux__mm__memory.c.mdfile-notes/linux__mm__page_alloc.c.mdfile-notes/linux__mm__filemap.c.mdfile-notes/linux__mm__vmscan.c.mdfile-notes/linux__mm__slab_common.c.mdfile-notes/linux__Documentation__admin-guide__mm__concepts.rst.mdfile-notes/linux__Documentation__core-api__memory-allocation.rst.md