Skip to content

Memory Management

Imported from _research/manual-study-linux/memory-management.md.

Memory Management

Status: memory-management volume verified for the core VM/page-fault, VMA, physical-page allocation, reclaim, page-cache, and slab-cache surfaces.

This volume follows the Linux memory subsystem from process address-space metadata, through VMA creation and page faults, into physical page allocation, file-backed page cache behavior, reclaim under pressure, and slab-cache object allocation. The goal is implementation fluency: what the C code does, which locks and lifetimes matter, and what a Rust or AI-native translation must preserve.

Source Surface

Primary reviewed sources:

  • include/linux/mm_types.h
  • include/linux/mm.h
  • mm/mmap.c
  • mm/memory.c
  • mm/page_alloc.c
  • mm/filemap.c
  • mm/vmscan.c
  • mm/slab_common.c
  • Documentation/admin-guide/mm/concepts.rst
  • Documentation/core-api/memory-allocation.rst

Entry Points

Memory management is not a single syscall path; it is a set of cooperating entry points.

  • Address-space layout enters through brk(), mmap_pgoff(), old mmap compatibility, munmap(), stack expansion, and fork-time dup_mmap().
  • Page faults enter the core VM through handle_mm_fault() after architecture fault code has identified a VMA and lock state.
  • Physical page allocation enters through __alloc_pages_noprof() and frees through __free_pages().
  • File-backed cache access enters through filemap_read(), generic_file_read_iter(), filemap_fault(), filemap_map_pages(), and generic_perform_write().
  • Reclaim enters through direct reclaim and memcg reclaim, while background pressure wakes kswapd() through wakeup_kswapd().
  • Slab caches are created and destroyed through __kmem_cache_create_args(), create_cache(), and kmem_cache_destroy().

Core Data Structures

Linux represents virtual memory with mm_struct, vm_area_struct, page-table levels, and fault descriptors. mm_struct owns process-level address-space state, including VMA indexing and mmap_lock. vm_area_struct describes a contiguous virtual range: permissions, flags, file backing, anonymous-memory metadata, VMA operations, optional per-VMA lock state, and tree linkage.

struct vm_fault is the page-fault work item. It carries the target VMA, allocation mask, logical page offset, faulting address, fault flags, original PTE/PMD state, page-table lock, optional COW page, and returned folio/page state. vm_fault_t is a bitmask result channel for OOM, SIGBUS/SIGSEGV, retry, fallback, completed COW, lock-dropping, and storage IO outcomes.

The physical allocator is zone and order based. Zones own buddy free lists, GFP policy drives allocation permissions, and slowpath code coordinates reclaim, compaction, reserves, and OOM decisions. The page cache uses an address-space mapping XArray to index folios by file offset. Reclaim uses struct scan_control to carry target, memcg, permissions, counters, priority, order, and GFP context. Slab allocation wraps repeated fixed-size kernel objects in named kmem_cache descriptors with alignment, flags, constructors, merge rules, and lifecycle accounting.

Control Flow

VMA Creation And Teardown

mmap_pgoff() routes mapping requests into do_mmap(). do_mmap() requires a write-held mmap_lock, validates length and page offset overflow, checks map-count limits, asks for an unmapped area, rejects MAP_FIXED_NOREPLACE collisions, validates locked mappings, separates file-backed from anonymous policy, handles MAP_NORESERVE, and calls mmap_region() to install the mapping.

Lookup is maple-tree based. find_vma_intersection() and find_vma() search mm->mm_mt; find_vma_prev() uses VMA_ITERATOR to retrieve adjacent ranges. Unmap routes through do_munmap() and do_vmi_munmap(). Fork duplicates the address space in dup_mmap(): it locks old and new mms, duplicates the maple tree, skips VM_DONTCOPY, allocates new VMA descriptors, forks anon-vma state, calls VMA open hooks, inserts file interval-tree nodes, and copies page tables.

Page Faults

handle_mm_fault() is the public core VM entry after architecture code has resolved the faulting VMA. It sanitizes flags, checks architecture access permission, enters memcg user-fault handling for user faults, chooses hugetlb or regular fault handling, and accounts the result.

__handle_mm_fault() builds struct vm_fault, walks and allocates page-table levels, tries huge PUD/PMD paths where valid, handles migration or device-private entries, and falls back to handle_pte_fault(). handle_pte_fault() dispatches:

  • missing PTEs to do_pte_missing()
  • swap entries to do_swap_page()
  • NUMA-protected PTEs to do_numa_page()
  • write or unshare faults on read-only PTEs to do_wp_page()
  • present mappings to accessed/dirty/MMU-cache update logic

Anonymous missing-PTE faults can map the shared zero page for reads. Write faults prepare anon-vma state, allocate a private folio, mark it uptodate, lock the PTE, recheck races, and install a writable entry. File-backed faults call vm_ops->fault() via __do_fault() and finish installation through finish_fault(). COW faults use do_wp_page() to choose shared-writable reuse, private anonymous reuse, or wp_page_copy() allocation and replacement.

Physical Page Allocation

Freeing enters __free_one_page(). The allocator accounts the page range, checks whether the corresponding buddy is mergeable, removes free buddies, coalesces to higher orders, and inserts the merged block into the right free list.

Allocation enters __alloc_frozen_pages_noprof() and then __alloc_pages_noprof(). The fast path validates order, derives an allocation context from GFP flags, applies fragmentation-avoidance policy, and tries get_page_from_freelist(). If watermarks cannot be met, the slow path wakes kswapd, retries under adjusted flags, checks reserves and cpusets, performs direct reclaim and compaction when allowed, evaluates retry rules, invokes OOM handling, and implements __GFP_NOFAIL looping.

Page Cache And File-Backed Memory

filemap_get_entry() performs lockless XArray lookup under RCU, skips exceptional values, pins a folio, and reloads the slot to ensure it did not race with truncation or replacement. __filemap_get_folio_mpol() builds on that lookup with optional locking, accessed/write/stable handling, allocation, and filemap_add_folio().

Buffered reads use filemap_get_pages() to gather cache folios, trigger readahead, allocate missing folios, update stale pages, and retry on truncation races. filemap_fault() is the file-backed mmap fault path; it looks up or creates a folio, may drop mmap_lock and return a retry result, locks and reads the folio as needed, and returns a vm_fault_t. filemap_map_pages() maps already-resident folios into PTEs in batches. generic_perform_write() runs the generic buffered write loop: dirty throttling, filesystem write_begin, atomic copy from user iterator, filesystem write_end, and forward-progress handling.

Reclaim

Reclaim policy is centralized in struct scan_control. It carries the reclaim target, optional memcg, anon/file cost, may-writepage/may-unmap/may-swap permissions, memcg low-limit state, priority, order, reclaim index, GFP mask, and scan/reclaim counters.

shrink_inactive_list() isolates LRU folios, calls shrink_folio_list(), and moves survivors back. shrink_folio_list() locks folios, skips unevictable or unmappable entries, handles dirty/writeback state, checks references, demotes where possible, allocates swap for anonymous folios, unmaps, avoids pinned folios, writes dirty file pages when allowed, frees successful candidates, and returns failed candidates to LRU state.

Background reclaim runs in kswapd(). wakeup_kswapd() records pressure and wakes the daemon. balance_pgdat() chooses priority and reclaim index, ages active lists, reclaims memcg soft-limit pages, calls kswapd_shrink_node(), wakes direct reclaimers once watermarks improve, and decides when the daemon can sleep.

Slab Caches

The slab layer builds named fixed-size object caches. __kmem_cache_create_args() validates names, object sizes, flags, debugging, hardened usercopy ranges, mergeability, alignment, and cache aliasing under slab_mutex. create_cache() allocates the cache descriptor, calls allocator-specific creation hooks, sets the refcount, and links the cache globally. kmem_cache_destroy() waits for deferred RCU/free work, handles SLAB_TYPESAFE_BY_RCU, takes CPU and slab locks, decrements the refcount, shuts down allocator state, warns on live objects, unlinks sysfs/debugfs state, and releases the descriptor only when safe.

Concurrency And Lifetime

mmap_lock protects address-space layout. VMA paths distinguish read-side lookup from write-side mutation, and stack expansion can upgrade from a read lock to a write lock, mutate the VMA, and downgrade again. Page-fault handlers must treat lock-dropping results carefully: after __handle_mm_fault() returns, callers cannot assume the original VMA pointer remains valid if the lock was dropped.

Page-table mutation is protected by page-table locks and original-entry revalidation. Fault paths allocate and copy before publishing PTEs, then lock and recheck that the observed PTE still matches. File-backed faults preallocate PTE pages before taking folio locks to avoid reclaim/writeback deadlocks.

Page-cache lookup uses RCU plus folio references and slot revalidation. Reclaim isolates folios before expensive work and returns survivors to LRU lists. Allocator slow paths are constrained by caller context: reclaim, compaction, IO, reserve access, and no-fail looping are all derived from GFP state and task context.

Resource And Failure Model

The memory subsystem reports failure through typed kernel channels, not a single error code. vm_fault_t can signal OOM, SIGBUS, SIGSEGV, retry, fallback to smaller pages, storage IO, completed COW, or dropped locks. Mapping creation can fail because of invalid flags, length/offset overflow, address collision, locked-memory permission, file-mode mismatch, map-count limits, or allocation failure.

Physical allocation has a staged failure model: fast-path miss, kswapd wake, reserve retry, direct reclaim, compaction, retry decision, OOM, no-fail loop, or failure return. Reclaim has policy-limited failure: a folio may be referenced, locked elsewhere, dirty when writeback is disallowed, under writeback, pinned, unmappable, unevictable, or not worth retrying at the current priority.

Extension Points

Memory extension points are operation tables and policy flags:

  • VMA vm_ops, especially fault, provide file/special mapping behavior.
  • Filesystem address-space operations provide read_folio, write_begin, and write_end.
  • GFP flags express caller allocation policy.
  • Memcg and cpuset state constrain reclaim and allocation.
  • Slab flags, constructors, hardened usercopy ranges, and allocator-specific hooks customize object-cache behavior.
  • Architecture page-table helpers define the unsafe hardware boundary for PTE, PMD, PUD, and TLB behavior.

C Implementation Walkthrough

The C implementation is structured around explicit state, labels, bitfields, and lock-coupled helper calls rather than object-oriented wrappers.

In mm/mmap.c, do_mmap() starts at line 336. It immediately asserts the write-held mmap_lock at line 347, then validates zero length, length alignment, offset overflow, and map-count pressure at lines 349-380. It asks for an unmapped range at lines 405-410, checks MAP_FIXED_NOREPLACE intersection at lines 412-414, validates MAP_LOCKED at lines 417-422, splits file-backed from anonymous mapping logic at lines 424-543, handles MAP_NORESERVE at lines 546-558, and calls mmap_region() at line 560.

In mm/memory.c, handle_mm_fault() runs the top-level permission, memcg, hugetlb, normal-fault, and accounting flow at lines 6644-6716. __handle_mm_fault() constructs struct vm_fault and walks page-table levels at lines 6411-6515. handle_pte_fault() is the PTE dispatcher at lines 6328-6408. Anonymous fault allocation and zero-page handling are at lines 5282-5365; file fault callback dispatch is at lines 5393-5428; COW and write-protect decisions are at lines 3836-3852 and 4240-4315.

In mm/page_alloc.c, the buddy allocator is described at lines 913-934. __free_one_page() begins at line 936, merges compatible buddies at lines 954-1005, and reinserts the merged block at lines 1007-1019. Allocation slowpath behavior is concentrated in __alloc_pages_slowpath() at lines 4724-5023. The main zoned buddy entry is documented as the allocator heart at lines 5265-5267 and implemented at lines 5268-5331.

In mm/filemap.c, the lockless page-cache protocol is documented at lines 1862-1880. filemap_get_entry() performs RCU/XArray lookup and folio revalidation at lines 1882-1923. filemap_get_pages() runs buffered read batching, readahead, creation, update, and retry behavior at lines 2677-2744. filemap_fault() implements file-backed mmap fault behavior at lines 3523-3704, and generic_perform_write() runs the buffered write loop at lines 4335-4415.

In mm/vmscan.c, struct scan_control is defined at lines 74-180. shrink_folio_list() begins at line 1058, classifies folios at lines 1078-1132, implements writeback cases at lines 1143-1230, runs reference, demotion, swap, unmap, pinned, dirty-file, and writepage decisions at lines 1233-1441, and completes demotion/free/move-back handling near lines 1553-1594. balance_pgdat() runs background node balancing at lines 7056-7290; kswapd() itself runs at lines 7391-7476.

In mm/slab_common.c, cache creation validates caller-visible cache properties, allocator flags, merging, and hardened usercopy constraints before publishing a cache. Cache destruction waits for deferred work, handles RCU-safe caches, and unlinks the cache only after allocator shutdown and live-object checks.

File-By-File Implementation Analysis

This section is the deeper implementation layer. It describes each core file as code: what the important functions accept, what they mutate, where they branch, which labels are used for retry/error paths, and why the C is shaped the way it is.

include/linux/mm_types.h: State Carried Through The VM

This header is where the memory subsystem’s central nouns are declared. The important design choice is that Linux does not pass “an address” through the VM as a naked scalar. It carries the address with the VMA, original PTE/PMD state, page-table lock, fault flags, folio/page return value, and allocation policy. That is what makes page-fault code resumable after races and retryable after lock dropping.

The vm_area_struct is the unit of virtual-address policy. It answers: which range is this, what permissions does it have, is it file-backed, what callbacks does it use, and what anonymous-memory metadata is attached. The mm_struct is the address-space owner: it owns the VMA index, page tables, counters, locks, and process-level state.

struct vm_fault is the implementation hinge. It is not merely an error context; it is a mutable work packet. Fault helpers fill in fields as they walk from address-space level to page-table level:

struct vm_fault vmf = {
.vma = vma,
.address = address & PAGE_MASK,
.real_address = address,
.flags = flags,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
};

That initializer in mm/memory.c lines 6420-6427 shows the data model in use: the raw CPU fault address is normalized to a page address, the file/anonymous offset is derived from the VMA, and the allocation mask is derived from fault context before any page-table allocation is attempted.

include/linux/mm.h: The Public VM Contract

This header exports the contracts consumed by architecture code, file systems, device mappings, GUP, and other MM users. The most important shape is the VMA operation table. File-backed memory does not hard-code ext4, tmpfs, or device behavior into mm/memory.c; it calls through VMA and address-space operation tables.

The conceptual contract is:

struct vm_operations_struct {
void (*open)(struct vm_area_struct *area);
void (*close)(struct vm_area_struct *area);
vm_fault_t (*fault)(struct vm_fault *vmf);
vm_fault_t (*map_pages)(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
...
};

The exact struct includes more callbacks, but these fields explain the pattern: VMA lifecycle and page-fault behavior are delegated. mm/memory.c owns common page-table installation and COW policy; filesystems/devices own “how do I materialize backing data for this VMA?”

mm/mmap.c: VMA Creation, Lookup, Fork Copy, And Unmap

do_mmap() is the core mapping constructor. It accepts an optional struct file *, requested address/length/protection/flags, derived VM flags, page offset, output population length, and optional userfaultfd list. It mutates vm_flags, may normalize addr, may rewrite pgoff, and eventually asks mmap_region() to edit the VMA tree.

The opening shape matters:

*populate = 0;
mmap_assert_write_locked(mm);
if (!len)
return -EINVAL;
len = PAGE_ALIGN(len);
if (!len)
return -ENOMEM;
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
return -EOVERFLOW;
if (mm->map_count > get_sysctl_max_map_count())
return -ENOMEM;

This is mm/mmap.c lines 345-380. The function is deliberately front-loaded with policy checks before tree mutation: caller must hold the write side of mmap_lock; zero-length mappings fail as invalid input; length alignment can overflow to zero and is treated as allocation impossibility; file offset plus length must not wrap; and the process cannot exceed the configured VMA count.

Next it derives final VMA flags and finds an address:

vm_flags |= calc_vm_prot_bits(prot, pkey) |
calc_vm_flag_bits(file, flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags);
if (IS_ERR_VALUE(addr))
return addr;
if (flags & MAP_FIXED_NOREPLACE) {
if (find_vma_intersection(mm, addr, addr + len))
return -EEXIST;
}

This is lines 402-414. The key point is that “protection” and “capability” are separate. VM_READ means currently readable; VM_MAYREAD means it may be made readable later. The address selection is delegated because architecture, randomization, top-down/bottom-up layout, huge pages, and file constraints all affect placement.

The file-backed branch enforces the backing object’s rules:

if (prot & PROT_WRITE) {
if (!(file->f_mode & FMODE_WRITE))
return -EACCES;
if (IS_SWAPFILE(file->f_mapping->host))
return -ETXTBSY;
}
if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
return -EACCES;
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

That is lines 450-466 inside the MAP_SHARED/MAP_SHARED_VALIDATE path. The C code encodes Unix semantics directly: writable shared mappings require a writable file; swapfiles cannot be text-busy modified through mappings; append-only files cannot be bypassed with mmap writes; and a non-writable file can still be mapped, but not as a writable shared mapping.

Anonymous mappings use different policy:

case MAP_PRIVATE:
/*
* Set pgoff according to addr for anon_vma.
*/
pgoff = addr >> PAGE_SHIFT;
break;

That is lines 535-540. Anonymous private mappings have no file offset, so Linux chooses an offset derived from the address. That value becomes part of anon-vma and reverse-mapping logic.

The end of do_mmap() is the actual mutation point:

addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
*populate = len;
return addr;

Lines 560-565 show the split between creating metadata and faulting/populating pages. mmap() usually creates address-space policy; it does not necessarily allocate every page immediately.

dup_mmap() is the fork-side counterpart. It has to clone the VMA tree and then make every child VMA consistent with files, anon-vma state, userfaultfd, and page tables. Its important sequence is: lock old and new address spaces, duplicate the maple tree, iterate VMAs, skip VM_DONTCOPY, duplicate the VMA descriptor, fork anon-vma metadata, call VMA open, insert file interval-tree state, then call copy_page_range(). This ordering is why fork cleanup is complicated: after any partial success, there may be tree entries, file references, anon-vma references, and copied page tables to unwind.

mm/memory.c: Fault Dispatch, Page-Table Walk, Anonymous/File/COW

handle_mm_fault() is the top-level common fault entry. Architecture code has already found the VMA and acquired either the VMA lock or mmap_lock; this function is responsible for policy, common dispatch, memcg user-fault state, and accounting.

The high-level function is short because the complexity is delegated:

ret = sanitize_fault_flags(vma, &flags);
if (ret)
goto out;
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE)) {
ret = VM_FAULT_SIGSEGV;
goto out;
}
if (flags & FAULT_FLAG_USER)
mem_cgroup_enter_user_fault();
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
ret = __handle_mm_fault(vma, address, flags);

That is from mm/memory.c lines 6661-6686. The function first rejects impossible flag combinations and architecture permission failures. Then it enters memcg OOM handling only for user faults. Then it routes either to hugetlb or normal page-table handling.

The most important lifetime warning follows:

/*
* Warning: It is no longer safe to dereference vma-> after this point,
* because mmap_lock might have been dropped by __handle_mm_fault(), so
* vma might be destroyed from underneath us.
*/

This is lines 6688-6692. It is one of the most important lessons in the VM: return values are not just success/failure. Some return values encode “the lock was dropped; your pointer may be stale.”

__handle_mm_fault() constructs the vm_fault, walks page-table levels, and tries huge mappings before falling back:

pgd = pgd_offset(mm, address);
p4d = p4d_alloc(mm, pgd, address);
if (!p4d)
return VM_FAULT_OOM;
vmf.pud = pud_alloc(mm, p4d, address);
if (!vmf.pud)
return VM_FAULT_OOM;
...
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
...
fallback:
return handle_pte_fault(&vmf);

This is lines 6434-6471 and 6516-6517. The walk is allocation-capable: a page fault can allocate page-table pages before it ever allocates a user page.

Huge-page handling is opportunistic. If a PUD/PMD is empty and the VMA allows a huge fault, Linux tries to create the huge mapping. If that returns VM_FAULT_FALLBACK, the code resumes at the smaller page-table level. This means huge pages are an optimization path, not a semantic requirement.

handle_pte_fault() is the dispatcher:

if (!vmf->pte)
return do_pte_missing(vmf);
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
goto unlock;
}
if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
if (!pte_write(entry))
return do_wp_page(vmf);
else if (likely(vmf->flags & FAULT_FLAG_WRITE))
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);

This is lines 6378-6399. The dispatch order is exact: missing mapping, swap/migration/non-present entry, NUMA protection, then present PTE update. Before modifying the PTE, it locks the page-table lock and checks pte_same() against the originally observed value. That protects against a parallel fault racing to install or change the same PTE.

The missing-PTE branch eventually splits anonymous from file-backed faults: anonymous read faults can map the zero page; anonymous write faults allocate a private folio; file/special mappings call the VMA fault operation. COW write faults go through do_wp_page() and may reuse an exclusive anonymous folio or allocate/copy through wp_page_copy(). The common pattern is always: prepare outside the PTE lock when possible, take the lock, revalidate the original PTE, publish the new entry, update MMU/TLB state, and release.

mm/page_alloc.c: Zoned Buddy Allocator And Slowpath Policy

__alloc_frozen_pages_noprof() is documented in the source as the heart of the zoned buddy allocator. Its structure is a classic Linux fast path plus slow path:

if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
return NULL;
gfp &= gfp_allowed_mask;
gfp = current_gfp_context(gfp);
alloc_gfp = gfp;
if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
&alloc_gfp, &alloc_flags))
return NULL;
alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
if (likely(page))
goto out;
ac.nodemask = nodemask;
page = __alloc_pages_slowpath(alloc_gfp, order, &ac);

This is mm/page_alloc.c lines 5276-5317. It validates order, masks GFP by system-wide allowed bits, applies scoped context like nofs/noio, prepares zone and cpuset state, tries a no-fragment fast allocation, then restores the caller nodemask and enters slowpath.

The slow path is not “try harder” as one vague step. It is a sequence of policy gates:

bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
bool can_compact = can_direct_reclaim && gfp_compaction_allowed(gfp_mask);
bool nofail = gfp_mask & __GFP_NOFAIL;
const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;

Lines 4728-4731 derive the allocator’s legal actions from GFP. If direct reclaim is disallowed, the allocator must not sleep in reclaim. If compaction is disallowed, high-order allocations cannot depend on moving pages. If nofail is set, the function must loop unless the request is nonsensical.

The main retry loop wakes background reclaim and retries the freelist:

retry:
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if (page)
goto got_pg;

That is lines 4812-4823. It is important that kswapd is woken before expensive direct work: background reclaim might satisfy future allocations even if this allocation has to continue into slowpath.

Then it broadens policy if reserves or cpusets can be ignored:

reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags) |
(alloc_flags & ALLOC_KSWAPD);
if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
ac->nodemask = NULL;
ac->preferred_zoneref = first_zones_zonelist(...);
if (can_retry_reserves) {
can_retry_reserves = false;
goto retry;
}
}

Lines 4825-4850 show how Linux handles privileged/system allocations: it can ignore memory policy and watermarks once, then retry before doing heavier work.

If the caller cannot reclaim, the path is short:

if (!can_direct_reclaim) {
if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT) &&
(gfp_mask & __GFP_KSWAPD_RECLAIM)) {
alloc_flags &= ~ALLOC_NOFRAGMENT;
goto retry;
}
goto nopage;
}

Lines 4852-4865 demonstrate why GFP is semantically important: allocation from atomic or nonblocking context cannot simply block until memory appears.

The expensive part is direct reclaim and compaction:

if (!compact_first) {
page = __alloc_pages_direct_reclaim(..., &did_some_progress);
if (page)
goto got_pg;
}
page = __alloc_pages_direct_compact(..., compact_priority, &compact_result);
if (page)
goto got_pg;

Lines 4874-4886 show the two big recovery mechanisms. Reclaim frees pages; compaction moves pages to create high-order contiguous ranges. The retry logic afterward is guarded by __GFP_NORETRY, costly-order policy, should_reclaim_retry(), should_compact_retry(), cpuset/zonelist race checks, OOM handling, and nofail fallback.

The nofail tail is explicit:

if (unlikely(nofail)) {
if (!can_direct_reclaim)
goto fail;
page = __alloc_pages_cpuset_fallback(gfp_mask, order,
ALLOC_MIN_RESERVE, ac);
if (page)
goto got_pg;
cond_resched();
goto retry;
}

Lines 4992-5017 make clear that nofail is not magic. It still needs a context that can reclaim, it tries limited reserve fallback, yields, and retries.

mm/filemap.c: Page Cache, File Faults, And Buffered IO

The page cache is an indexed folio store rooted in address_space->i_pages. filemap_get_entry() is the primitive lookup:

rcu_read_lock();
repeat:
xas_reset(&xas);
folio = xas_load(&xas);
if (xas_retry(&xas, folio))
goto repeat;
if (!folio || xa_is_value(folio))
goto out;
if (!folio_try_get(folio))
goto repeat;
if (unlikely(folio != xas_reload(&xas))) {
folio_put(folio);
goto repeat;
}
out:
rcu_read_unlock();

This is mm/filemap.c lines 1899-1920. The sequence is the whole lockless protocol: load under RCU, handle retry entries, ignore shadow/swap exceptional values for refcounting, try to pin the folio, then reload the XArray slot to prove the pinned folio is still the indexed cache entry.

__filemap_get_folio_mpol() layers policy on top. If FGP_LOCK is requested, it locks the folio and then verifies it was not truncated out of the mapping:

if (fgp_flags & FGP_LOCK) {
...
if (unlikely(folio->mapping != mapping)) {
folio_unlock(folio);
folio_put(folio);
goto repeat;
}
VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
}

Lines 1954-1970 show the same pattern as PTE faults: take reference/lock, then revalidate against the mapping because truncation can race with lookup.

The create path uses folio order policy and falls back to smaller orders:

do {
gfp_t alloc_gfp = gfp;
err = -ENOMEM;
if (order > min_order)
alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
folio = filemap_alloc_folio(alloc_gfp, order, policy);
if (!folio)
continue;
err = filemap_add_folio(mapping, folio, index, gfp);
if (!err)
break;
folio_put(folio);
folio = NULL;
} while (order-- > min_order);

Lines 2007-2028 show large-folio optimism without making high-order allocation mandatory. If the big folio cannot be allocated or inserted, the loop can try a smaller order.

filemap_fault() is the mmap fault implementation for ordinary files. It first checks file size:

max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(index >= max_idx))
return VM_FAULT_SIGBUS;

Lines 3557-3559 show why mmap past EOF faults as SIGBUS: the VMA may cover a range, but the file no longer has data for that page.

Then it tries the cache:

folio = filemap_get_folio(mapping, index);
if (likely(!IS_ERR(folio))) {
if (!(vmf->flags & FAULT_FLAG_TRIED))
fpin = do_async_mmap_readahead(vmf, folio);
if (unlikely(!folio_test_uptodate(folio))) {
filemap_invalidate_lock_shared(mapping);
mapping_locked = true;
}
} else {
count_vm_event(PGMAJFAULT);
ret = VM_FAULT_MAJOR;
fpin = do_sync_mmap_readahead(vmf);
...
folio = __filemap_get_folio(mapping, index,
FGP_CREAT|FGP_FOR_MMAP,
vmf->gfp_mask);
}

This is lines 3566-3600. Cache hit: maybe async readahead. Cache miss: major fault accounting, synchronous mmap readahead, invalidate-lock coverage, and folio creation.

The lock/drop-retry path is the subtle part:

if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
goto out_retry;
...
if (fpin) {
folio_unlock(folio);
goto out_retry;
}
...
out_retry:
if (!IS_ERR(folio))
folio_put(folio);
if (mapping_locked)
filemap_invalidate_unlock_shared(mapping);
if (fpin)
fput(fpin);
return ret | VM_FAULT_RETRY;

Lines 3608-3609, 3650-3652, and 3690-3702 explain the contract mentioned in handle_mm_fault(): file IO may require dropping mmap_lock; when that happens, the upper fault handler must re-find the VMA and retry.

mm/vmscan.c: Reclaim Decision Engine

shrink_folio_list() is a long function because reclaim has to decide what is legal and profitable for every folio. The opening sets up private lists and policy:

struct folio_batch free_folios;
LIST_HEAD(ret_folios);
LIST_HEAD(demote_folios);
unsigned int nr_reclaimed = 0, nr_demoted = 0;
...
do_demote_pass = can_demote(pgdat->node_id, sc, memcg);

This is mm/vmscan.c lines 1065-1076. Reclaim does not immediately free every candidate; it separates folios to free, folios to return, and folios to demote to another memory tier.

The main loop isolates and locks one folio at a time:

folio = lru_to_folio(folio_list);
list_del(&folio->lru);
if (!folio_trylock(folio))
goto keep;
nr_pages = folio_nr_pages(folio);
sc->nr_scanned += nr_pages;
if (unlikely(!folio_evictable(folio)))
goto activate_locked;
if (!sc->may_unmap && folio_mapped(folio))
goto keep_locked;

Lines 1088-1120 show the first filter: if it cannot lock the folio, keep it; if unevictable, reactivate it; if mapped and this reclaim context cannot unmap, keep it.

Writeback handling is deliberately conservative. The source comments describe three cases, and the code reflects them:

if (folio_test_writeback(folio)) {
mapping = folio_mapping(folio);
if (current_is_kswapd() &&
folio_test_reclaim(folio) &&
test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
stat->nr_immediate += nr_pages;
goto activate_locked;
} else if (writeback_throttling_sane(sc) ||
!folio_test_reclaim(folio) ||
!may_enter_fs(folio, sc->gfp_mask) ||
(mapping &&
mapping_writeback_may_deadlock_on_reclaim(mapping))) {
folio_set_reclaim(folio);
stat->nr_writeback += nr_pages;
goto activate_locked;
} else {
folio_unlock(folio);
folio_wait_writeback(folio);
list_add_tail(&folio->lru, folio_list);
continue;
}
}

This is lines 1189-1230. Reclaim avoids indefinite stalls and filesystem deadlocks. Sometimes it marks the folio for immediate reclaim later; sometimes legacy memcg waits for writeback; often it activates the folio and keeps scanning for cheaper victims.

Reference checking decides whether the folio is still part of the working set:

if (!ignore_references)
references = folio_check_references(folio, sc);
switch (references) {
case FOLIOREF_ACTIVATE:
goto activate_locked;
case FOLIOREF_KEEP:
stat->nr_ref_keep += nr_pages;
goto keep_locked;
case FOLIOREF_RECLAIM:
case FOLIOREF_RECLAIM_CLEAN:
; /* try to reclaim the folio below */
}

Lines 1233-1245 show the “second chance” behavior. Reclaim is not only about freeing memory; it protects recently used pages from being discarded.

Anonymous pages need swap before reclaim:

if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
!folio_test_swapcache(folio)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
if (folio_maybe_dma_pinned(folio))
goto keep_locked;
...
if (folio_alloc_swap(folio)) {
...
goto activate_locked_split;
}
folio_mark_dirty(folio);
}

Lines 1263-1315 show that reclaim cannot simply drop anonymous memory. It must secure swap backing, avoid pinned memory, handle large folios, and mark special MADV_FREE races dirty to avoid data corruption.

Mapped folios must be unmapped before freeing:

if (folio_mapped(folio)) {
enum ttu_flags flags = TTU_BATCH_FLUSH;
bool was_swapbacked = folio_test_swapbacked(folio);
if (folio_test_pmd_mappable(folio))
flags |= TTU_SPLIT_HUGE_PMD;
if (folio_test_large(folio))
flags |= TTU_SYNC;
try_to_unmap(folio, flags);
if (folio_mapped(folio)) {
stat->nr_unmap_fail += nr_pages;
...
goto activate_locked;
}
}

Lines 1331-1359 show reverse-mapping in action. Reclaim asks every mapping of that folio to remove its PTEs. Large folios add synchronization because partial PTE races can leave subpages mapped.

Dirty file folios go to writeback only when the reclaim context allows it:

if (folio_test_dirty(folio)) {
if (folio_is_file_lru(folio)) {
node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE, nr_pages);
if (!folio_test_reclaim(folio))
folio_set_reclaim(folio);
goto activate_locked;
}
if (references == FOLIOREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs(folio, sc->gfp_mask))
goto keep_locked;
if (!sc->may_writepage)
goto keep_locked;
try_to_unmap_flush_dirty();
switch (pageout(folio, mapping, &plug, folio_list)) {
...
}
}

Lines 1372-1441 show that IO permission is a first-class part of reclaim. A nofs/noio allocation context must not deadlock by entering filesystem writeback from reclaim.

mm/slab_common.c: Object Cache Lifecycle

The slab common layer is the allocator surface for fixed-size kernel objects. It exists because many kernel types are allocated frequently and need constructor hooks, alignment, debugging, hardened usercopy metadata, and allocator-specific backend setup.

The creation path validates user-visible and allocator-visible properties before publishing a cache. The common pattern is:

  1. Validate name, size, alignment, flags, usercopy range, and constructor.
  2. Normalize flags and reject impossible combinations.
  3. Under slab_mutex, try to merge with an existing compatible cache when allowed.
  4. Allocate and initialize a kmem_cache descriptor.
  5. Call allocator-specific creation hooks.
  6. Link the cache into global/sysfs/debugfs state only after backend creation succeeds.

The destruction path is careful because cached objects may have RCU-delayed lifetime or in-flight frees. kmem_cache_destroy() has to flush deferred work, handle SLAB_TYPESAFE_BY_RCU, take the right locks, decrement references, invoke backend shutdown, warn if live objects remain, unlink external visibility, and free the descriptor only after no allocator user can discover it.

The conceptual contrast with the buddy allocator is important: the buddy allocator manages physical page blocks; slab manages typed object reuse on top of pages. Most kernel subsystems should not hand-roll object pools when slab can encode the object size, alignment, constructor, and debug policy centrally.

Cross-File Execution Traces

Anonymous Write Fault To A New Page

  1. Architecture fault code identifies the VMA and calls handle_mm_fault().
  2. handle_mm_fault() validates flags and routes to __handle_mm_fault().
  3. __handle_mm_fault() walks/allocates page-table levels and falls back to handle_pte_fault().
  4. handle_pte_fault() sees no PTE and calls do_pte_missing().
  5. do_pte_missing() identifies anonymous memory and calls do_anonymous_page().
  6. do_anonymous_page() allocates a folio using the page allocator, prepares anon-vma/rmap state, marks the folio uptodate, locks the PTE, revalidates the PTE is still missing, installs a writable PTE, updates MMU cache state, and unlocks.
  7. If allocation fails, the fault result carries VM_FAULT_OOM; if a race installed the PTE first, the handler drops its prepared state and retries or treats the race as resolved.

File-Backed Mmap Read Fault

  1. handle_mm_fault() routes normal VMA fault handling to __handle_mm_fault().
  2. Missing PTE dispatch reaches do_fault() because the VMA has vm_ops.
  3. The file VMA’s fault callback is filemap_fault().
  4. filemap_fault() checks file size and returns VM_FAULT_SIGBUS if the page offset is beyond EOF.
  5. It searches the page cache with filemap_get_folio().
  6. On cache hit, it may start async readahead and lock the folio.
  7. On cache miss, it accounts a major fault, performs sync readahead, creates a cache folio under invalidate-lock coverage, and may allocate pages through mm/page_alloc.c.
  8. If IO requires dropping mmap_lock, it returns VM_FAULT_RETRY; the upper fault path must re-find the VMA.
  9. Once an uptodate locked folio is returned, common fault code installs the PTE and returns VM_FAULT_LOCKED semantics.

Memory Pressure During Allocation

  1. Caller requests pages with GFP flags.
  2. __alloc_frozen_pages_noprof() derives context, tries the freelist fast path, and falls into __alloc_pages_slowpath() on miss.
  3. Slowpath wakes kswapd, retries adjusted watermarks/reserves, and checks whether direct reclaim is legal.
  4. Direct reclaim enters vmscan with a scan_control.
  5. shrink_inactive_list() isolates candidate folios.
  6. shrink_folio_list() filters locked, unevictable, mapped, dirty, writeback, referenced, pinned, and non-IO-safe pages.
  7. Reclaim either frees pages, writes pages, demotes pages, activates pages, or returns them to LRU.
  8. The allocator retries the freelist, may compact memory, may invoke OOM, and may loop if __GFP_NOFAIL allows it.

Rust Translation

A Rust translation should preserve the same state boundaries:

  • AddressSpace for mm_struct, with MmapReadGuard and MmapWriteGuard.
  • Vma handles tied to address-space guard lifetimes.
  • FaultContext for struct vm_fault.
  • FaultResult bitflags or enum variants for vm_fault_t.
  • PageTableWalk and PteGuard types for mutation under page-table locks.
  • ZoneAllocator with order-indexed free lists and explicit AllocPolicy.
  • PageCache indexed by file offset, returning revalidated pinned folios.
  • ScanControl plus FolioReclaimState for reclaim.
  • SlabCache<T> for typed fixed-size object allocation.

Unsafe code should be narrow and hardware-facing: page-table writes, atomic PTE updates, TLB/cache operations, and architecture-specific memory ordering. The safe layer should encode lock ownership, VMA validity, folio lock state, and retry/drop-lock outcomes in types so stale handles are hard to misuse.

AI-Native Translation

AI-native runtimes can borrow the same architecture for large context and tool memory:

  • Address spaces become tenant/session memory domains.
  • VMAs become typed context regions with permissions, backing store, and lazy materialization policy.
  • Page faults become cache misses with typed outcomes: synthesize, fetch, clone-on-write, retry, throttle, demote, or fail.
  • GFP policy becomes allocation intent: latency-sensitive, reclaimable, no-IO, no-wait, no-fail, or background.
  • Reclaim becomes pressure handling for conversation context, embedding caches, tool outputs, and derived artifacts.
  • Slab caches become typed pools for frequently allocated runtime objects.

The key lesson is that memory policy must be explicit. Hidden allocation and implicit cache growth make agent systems unpredictable under load; Linux keeps allocation intent, reclaim permission, lock dropping, and retry behavior visible at every important boundary.

Evidence Table

SourceEvidence
include/linux/mm_types.hmm_struct, vm_area_struct, struct vm_fault, and vm_fault_t define the central VM state and fault result model.
include/linux/mm.hPublic MM APIs and VMA operation tables define the boundary used by architecture, file, and special-mapping code.
mm/mmap.cdo_mmap() lines 336-565 implement VMA creation policy; dup_mmap() lines 1731-1840 implements fork-time address-space copy.
mm/memory.chandle_mm_fault() lines 6644-6716, __handle_mm_fault() lines 6411-6515, and handle_pte_fault() lines 6328-6408 implement core fault dispatch.
mm/page_alloc.cBuddy allocator comments and code at lines 913-1019 plus allocation slowpath lines 4724-5023 show physical-page policy.
mm/filemap.cPage-cache lookup lines 1862-1923, read path lines 2677-2744, mmap fault lines 3523-3704, and write path lines 4335-4415 show file-backed memory.
mm/vmscan.cscan_control lines 74-180, shrink_folio_list() lines 1058-1594, and balance_pgdat() lines 7056-7290 show reclaim policy.
mm/slab_common.cCache creation and destruction paths define fixed-size object cache lifecycle.

Source Notes

  • file-notes/linux__include__linux__mm_types.h.md
  • file-notes/linux__include__linux__mm.h.md
  • file-notes/linux__mm__mmap.c.md
  • file-notes/linux__mm__memory.c.md
  • file-notes/linux__mm__page_alloc.c.md
  • file-notes/linux__mm__filemap.c.md
  • file-notes/linux__mm__vmscan.c.md
  • file-notes/linux__mm__slab_common.c.md
  • file-notes/linux__Documentation__admin-guide__mm__concepts.rst.md
  • file-notes/linux__Documentation__core-api__memory-allocation.rst.md