【Linux Document】Memory Manager : Concept
本篇解读Linux Document 中Memory Manger部分





  1. 内存资源紧俏
  2. 内存的有的连续有的不连续
  3. 不同架构或者相同架构的不同设计导致寻址范围不同



The virtual memory abstracts the details of physical memory from the application software, allows to keep only needed information in the physical memory (demand paging) and provides a mechanism for the protection and controlled sharing of data between processes.


  1. 屏蔽物理内存细节,抽象物理内存,仅保存必要的信息,以供软件使用
  2. 保护共享数据


只有当CPU执行的当前指令试图读写内存时,就需要翻译虚拟地址为物理地址。 通过分页来管理内存,不同架构的设计允许定制化页面大小,每一个物理页面可以映射单个活多个虚拟地址,这种映射关系是由页表来管理的。页表是层级管理,高层次页表管理低层次页表,最低层次页表管理虚拟内存到物理内存的映射(也就是常说的多级页表)。顶级页表的地址存放在寄存器中。


Usually TLB is pretty scarce resource and applications with large memory working set will experience performance hit because of TLB misses

TLB来加速通过页表查询的速度,TLB是比内存更加紧俏的资源,每一次的TLB Miss都是对性能的严重打击。 现代CPU架构允许高层次Page Table 直接映射物理页面,这样就少了几次页表的查询(因为页表是层级结构,原本只有最低层级的页表记录着最终的物理地址与虚拟地址的映射)。统称这些由非最低层级页表直接映射的页面为大页(为啥大?因为通常大于4KB)

Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance.


There are two mechanisms in Linux that enable mapping of the physical memory with the huge pages. The first one is HugeTLB filesystem, or hugetlbfs. It is a pseudo filesystem that uses RAM as its backing store. For the files created in this filesystem the data resides in the memory and mapped using huge pages. The hugetlbfs is described at HugeTLB Pages. Another, more recent, mechanism that enables use of the huge pages is called Transparent HugePages, or THP. Unlike the hugetlbfs that requires users and/or system administrators to configure what parts of the system memory should and can be mapped by the huge pages, THP manages such mappings transparently to the user and hence the name. See Transparent Hugepage Support for more details about THP.



Often hardware poses restrictions on how different physical memory ranges can be accessed. In some cases, devices cannot perform DMA to all the addressable memory. In other cases, the size of the physical memory exceeds the maximal addressable size of virtual memory and special actions are required to access portions of the memory. Linux groups memory pages into zones according to their possible usage. For example, ZONE_DMA will contain memory that can be used by devices for DMA, ZONE_HIGHMEM will contain memory that is not permanently mapped into kernel’s address space and ZONE_NORMAL will contain normally addressed pages. The actual layout of the memory zones is hardware dependent as not all architectures define all zones, and requirements for DMA are different for different platforms.



Many multi-processor machines are NUMA - Non-Uniform Memory Access - systems. In such systems the memory is arranged into banks that have different access latency depending on the “distance” from the processor. Each bank is referred to as a node and for each node Linux constructs an independent memory management subsystem. A node has its own set of zones, lists of free and used pages and various statistics counters.


Page cache

Page cache的存在是为了提高硬件和内存之间的访问速度,利用Page cache 将要读入的或者即将写入文件的内容暂存,写操作较为特殊,将要被写入(在内存中做出了修改,需要更新到辅助存储设备)的页面被标记为dirty,当再次使用该页面时,同步到辅助存储设备中。

Anonymous Memory

The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call. Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out



Throughout the system lifetime, a physical page can be used for storing different types of data. It can be kernel internal data structures, DMA’able buffers for device drivers use, data read from a filesystem, memory allocated by user space processes etc. Depending on the page usage it is treated differently by the Linux memory management. The pages that can be freed at any time, either because they cache the data available elsewhere, for instance, on a hard disk, or because they can be swapped out, again, to the hard disk, are called reclaimable. The most notable categories of the reclaimable pages are page cache and anonymous memory. In most cases, the pages holding internal kernel data and used as DMA buffers cannot be repurposed, and they remain pinned until freed by their user. Such pages are called unreclaimable. However, in certain circumstances, even pages occupied with kernel data structures can be reclaimed. For instance, in-memory caches of filesystem metadata can be re-read from the storage device and therefore it is possible to discard them from the main memory when system is under memory pressure. The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (low watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.

全文翻译: 在系统的整个生命周期中,物理页面可以用于存储不同类型的数据。它可以是内核内部数据结构、设备驱动程序使用的可进行DMA访问的缓冲区、从文件系统读取的数据,或者是由用户空间进程分配的内存等。




总结: 这段话主要解释了Linux内存管理中的页面回收机制。在系统运行过程中,物理页面可以用于存储各种类型的数据,包括内核数据、DMA缓冲区、文件系统数据等。根据页面的使用方式,可以将其分为可回收页面和不可回收页面。 可回收页面包括页面缓存和匿名内存,它们可以在系统需要时被异步地释放或重新利用。当系统负载较低时,大部分内存是空闲的,分配请求可以立即满足。但随着负载增加,空闲页面减少,当达到一定阈值时,会唤醒kswapd守护进程进行异步扫描和释放页面,或将脏页写回到存储设备。当内存使用进一步增加并达到另一个阈值时,会触发直接回收,即暂停内存分配直到回收足够的页面来满足请求。 需要注意的是,一些用于保存内核数据结构和DMA缓冲区的页面是不可回收的,它们会一直保持固定状态,直到它们的用户释放。但在某些情况下,即使是这些页面也可以被回收,例如重新读取文件系统元数据的内存缓存。 总之,这个机制确保了系统在内存紧张的情况下能够回收和重新利用可回收页面,从而更高效地利用内存资源,并及时满足分配请求。


As the system runs, tasks allocate and free the memory and it becomes fragmented. Although with virtual memory it is possible to present scattered physical pages as virtually contiguous range, sometimes it is necessary to allocate large physically contiguous memory areas. Such need may arise, for instance, when a device driver requires a large buffer for DMA, or when THP allocates a huge page. Memory compaction addresses the fragmentation issue. This mechanism moves occupied pages from the lower part of a memory zone to free pages in the upper part of the zone. When a compaction scan is finished free pages are grouped together at the beginning of the zone and allocations of large physically contiguous areas become possible. Like reclaim, the compaction may happen asynchronously in the kcompactd daemon or synchronously as a result of a memory allocation request.


OOM Killer

在负载高的机器上,内存有可能会耗尽,内核将无法回收足够的内存以继续运行。为了保护系统的其余部分,内核会触发OOM(Out of Memory)Killer。 OOM杀手选择一个任务来牺牲,以维护整个系统的健康。选择的任务被终止,希望在其退出后释放足够的内存来恢复正常操作。