admin管理员组文章数量:1599275
文章 | 内容 |
---|---|
Linux内存管理:Bootmem的率先登场 | Bootmem 启动过程内存分配器 |
Linux内存管理:Buddy System姗姗来迟 | Buddy System 伙伴系统内存分配器 |
Linux内存管理:Slab闪亮登场 | Slab 内存分配器 |
Linux内存管理:内存分配和内存回收原理 | 内存分配和内存回收原理 |
这是源码剖析专栏的第四篇文章
主要分成四大模块来剖析:内存管理、设备管理、系统启动和其他部分
其中内存管理分为Bootmem
、Buddy System
和Slab
三部分来阐述,当然,除了内存初始化,还必然有内存分配和内存回收
有些
todo
后续会补上
目录
- Reclaim Memory
- Basic Concept
- Mem Allocator
- 快速内存分配
- get_page_from_freelist
- zone_watermark_ok
- buffered_rmqueue
- 慢速内存分配
- try_to_free_pages
- Scan_Control
- Mem Shrink
- shrink_zone
- get_scan_ratio
- shrink_list
- shrink_active_list
Reclaim Memory
Basic Concept
当linux
系统内存压力大时,就会对系统的每个压力大的zone
内存回收
内存回收主要是针对匿名页和文件页进行的
- 对于匿名页,内存回收过程中会筛选出一些不经常使用的匿名页,写入
swap
分区,作为空闲页框释放到伙伴系统- 对于文件页,如果此文件页保存的内容是一个干净的页,就无需会写,直接将空闲页释放给伙伴系统;相反,脏页则先写回到磁盘中,再释放到伙伴系统
但是此时会有一个弊端,便是会给I/O
造成极大的压力,因此,在系统中,一般每个zone
会设置一条线,当空闲页框数量不满足这条线时,就会执行内存回收操作;否则不执行内存回收
内存回收是以zone
为单位的,一般情况下zone
有三条线:
watermark[WMARK_MIN]
:在快速分配失败后的慢速分配中会使用此阈值进行分配,如果慢速分配中此值还是无法进行分配,则直接内存回收和快速内存回收watermark[WMARK_LOW]
:低阈值,是快速分配的默认阈值,在分配过程中,如果zone
的空闲页数量低于此阈值,系统会对zone
执行快速内存回收watermark[WMARK_HIGH]
:高阈值,是zone
对于空闲页数量比较满意的一个值。一般情况下,对zone
进行内存回收时,目标是将zone
的空闲页数量提高到此值
liuzixuan@liuzixuan-ubuntu ~ # cat /proc/zoneinfo
Node 0, zone Normal
pages free 5179
min 4189
low 5236
high 6283
对于
zone
内存的回收,主要针对三个东西:slab
、lru
链表中的页和buffer_head
,其中lru
主要用于管理进程空间中使用的内存页,主要管理三种类型的页:匿名页、文件页和shmem
页
判断内存页能够回收的前提是
page->_count = 0
Mem Allocator
内存分配alloc_page
和alloc_page
一般都是调用__alloc_pages_nodemask->__alloc_pages_internal
static inline struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
}
__alloc_pages_internal
一般会进行一次low
阈值的快速内存分配get_page_from_freelist
和一次使用min
阈值的慢速内存分配
struct page *
__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
// ...
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;
快速内存分配
快速内存分配函数get_page_from_freelist
通过low
阈值从zonelist
中获取合适的zone
进行分配,如果zone
没有达到low
阈值,则会进行快速内存回收,快速内存回收后再尝试分配
gfp_mask
:申请内存所使用的gfp mask
order
:申请物理内存阶数zonelist
:zone
节点的zonelist
数组alloc_flags
:转换后申请内存flags
high_zoneidx
:所允许申请内存最高zone
alloc flags
是buddy内部申请内存所使用flag
,决定一些内存分配行为:
/* The ALLOC_WMARK bits are used as an index to zone->watermark */
#define ALLOC_WMARK_MIN WMARK_MIN
#define ALLOC_WMARK_LOW WMARK_LOW
#define ALLOC_WMARK_HIGH WMARK_HIGH
#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */
/* Mask to get the watermark bits */
#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)
/*
* Only MMU archs have async oom victim reclaim - aka oom_reaper so we
* cannot assume a reduced access to memory reserves is sufficient for
* !MMU
*/
#ifdef CONFIG_MMU
#define ALLOC_OOM 0x08
#else
#define ALLOC_OOM ALLOC_NO_WATERMARKS
#endif
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
#ifdef CONFIG_ZONE_DMA32
#define ALLOC_NOFRAGMENT 0x100 /* avoid mixing pageblock types */
#else
#define ALLOC_NOFRAGMENT 0x0
#endif
#define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
标志含义如下
ALLO_WMARK_XXX
:是申请内存时与watermark
相关ALLOC_NO_WATERMARKS
:申请内存时不检查water mark
ALLOC_OOM
:内存不足时允许触发OOMALLOC_HARDER
:是否允许使用页迁移中的MIGRATE_HIGHATOMIC
保留内存ALLOC_HIGH
:与__GFP_HIGH
功能相同ALLOC_CPUSET
:是否使用CPUSET
功能控制内存申请ALLOC_CMA
:允许从CMA
中申请内存ALLOC_NOFRAGMENT
:如果设置则决定内存不足时使用no_fallback
策略,不允许从远端节点中申请内存即不允许产生外内存碎片ALLOC_KSWAPD
:内存不足时允许开启kswapd
get_page_from_freelist
是为buddy
算法第一次尝试申请内存,核心思想就是当内存足够时从zone
中对应的order
的freelist
中,获取到物理页
get_page_from_freelist
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
{
struct zoneref *z;
struct page *page = NULL;
int classzone_idx;
struct zone *zone, *preferred_zone;
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
获得zone
的id
classzone_idx = zone_idx(preferred_zone);
来看zonelist_scan
标号,该标号的通过zonelist
,寻找有足够空闲页的zone
zonelist_scan:
/*
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask) {
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
int ret;
if (alloc_flags & ALLOC_WMARK_MIN)
mark = zone->pages_min;
else if (alloc_flags & ALLOC_WMARK_LOW)
mark = zone->pages_low;
else
mark = zone->pages_high;
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
goto try_this_zone;
if (zone_reclaim_mode == 0)
goto this_zone_full;
ret = zone_reclaim(zone, gfp_mask, order);
switch (ret) {
case ZONE_RECLAIM_NOSCAN:
/* did not scan */
goto try_next_zone;
case ZONE_RECLAIM_FULL:
/* scanned but unreclaimable */
goto this_zone_full;
default:
/* did we reclaim enough */
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
goto this_zone_full;
}
}
关于宏zonelist_scan
展开为
for (z = first_zones_zonelist(zonelist, high_zoneidx, nodemask, &zone)
{
zone;
z = next_zones_zonelist(++z, high_zoneidx, nodemask, &zone)// 获取zonelist中的下一个zone
}
static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
enum zone_type highest_zoneidx,
nodemask_t *nodes,
struct zone **zone)
{
return next_zones_zonelist(zonelist->_zonerefs, highest_zoneidx, nodes,
zone);
}
struct zoneref *next_zones_zonelist(struct zoneref *z,
enum zone_type highest_zoneidx,
nodemask_t *nodes,
struct zone **zone)
{
/*
* Find the next suitable zone to use for the allocation.
* Only filter based on nodemask if it's set
*/
if (likely(nodes == NULL))
while (zonelist_zone_idx(z) > highest_zoneidx)
z++;
else
while (zonelist_zone_idx(z) > highest_zoneidx ||
(z->zone && !zref_in_nodemask(z, nodes)))
z++;
*zone = zonelist_zone(z); // 获得zonelist中的zone
return z;
}
if (NUMA_BUILD && zlc_active &&
// z->zone所在的节点不允许分配或者该zone已经饱满了
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
if ((alloc_flags & ALLOC_CPUSET) &&
// 开启了检查内存节点是否在指定CPU集合,并且该zone不被允许在该CPU上分配内存
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;
zlc_zone_worth_trying
查看所在的节点是否允许分配或者该zone
是否已经饱满
static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
nodemask_t *allowednodes)
{
struct zonelist_cache *zlc; /* cached zonelist speedup info */
int i; /* index of *z in zonelist zones */
int n; /* node that zone *z is on */
zlc = zonelist->zlcache_ptr; // 得到zonelist_cache指针信息
if (!zlc)
return 1;
i = z - zonelist->_zonerefs; // 获得_zonerefs数组位置
n = zlc->z_to_n[i];
/* This zone is worth trying if it is allowed but not full */
return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
}
struct zonelist_cache {
unsigned short z_to_n[MAX_ZONES_PER_ZONELIST]; /* zone->nid */
DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST); /* zone full? */
unsigned long last_full_zap; /* when last zap'd (jiffies) */
};
其中主要是检验两个函数:node_isset
和test_bit
#define node_isset(node, nodemask) test_bit((node), (nodemask).bits)
static inline int test_bit(int nr, const volatile unsigned long *addr)
{
return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
}
cpuset_zone_allowed_softwall
函数同理,不做概述
获取快速内存分配的水位
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
int ret;
if (alloc_flags & ALLOC_WMARK_MIN)
mark = zone->pages_min; // 选择min阈值
else if (alloc_flags & ALLOC_WMARK_LOW)
mark = zone->pages_low; // 选择low阈值
else
mark = zone->pages_high; // 选择high阈值
zone_watermark_ok
检查该zone
是否有足够的页来分配
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
goto try_this_zone;
检查水位,其中mask
可能是low
、min
或者high
中的一个,从该函数可见,要分配到2^order
的页面得满足几个条件
- 除了被分配的页框,内存管理区至少还有
min
个空闲页框 - 除了被分配的页框,在
order
至少为o
的块中,有大于等于min/2^o
个空闲页框
/*
* Return 1 if free pages are above 'mark'. This takes into account the order
* of the allocation.
*/
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
int classzone_idx, int alloc_flags)
{
/* free_pages my go negative - that's OK */
long min = mark;
// 获得空闲页的数量vm_stat[NR_FREE_PAGES]
// 减去要分配的页面(1 << order)
long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
int o;
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;
// lowmem_reserve表示要预留的页面个数
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return 0;
// 除去要分配的页面个数,从order k 到 order 10的空闲页面总数,至少得是 min/(2^k)
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
free_pages -= z->free_area[o].nr_free << o;
/* Require fewer higher order pages to be free */
min >>= 1;
if (free_pages <= min)
return 0;
}
return 1;
}
通过了zone_watermark_ok
水位监测,则直接去try_this_zone
去分配页框,即buffered_rmqueue
函数
buffered_rmqueue
该函数是伙伴系统中指定zone
区域进行页面分配的核心函数
static struct page *buffered_rmqueue(struct zone *preferred_zone,
struct zone *zone, int order, gfp_t gfp_flags)
{
unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
int migratetype = allocflags_to_migratetype(gfp_flags); //
again:
cpu = get_cpu();
if (likely(order == 0)) { // 表示单页,从pcplist进行分配 冷热页
struct per_cpu_pages *pcp;
// 获取到本节点的cpu高速缓存页
pcp = &zone_pcp(zone, cpu)->pcp;
local_irq_save(flags);
// 该链表为空,大概率上次获取的cpu高速缓存的迁移类型和这次不一致
if (!pcp->count) {
// 从伙伴系统中获得页,然后向高速缓存中添加内存页
pcp->count = rmqueue_bulk(zone, 0,
pcp->batch, &pcp->list, migratetype);
// 如果链表仍然为空,那么说明伙伴系统中页面也没有了,分配失败
if (unlikely(!pcp->count))
goto failed;
}
/* Find a page of the appropriate migrate type */
// 如果分配的页面不需要考虑硬件缓存(注意不是每CPU页面缓存),则取出链表的最后一个节点返回给上层
if (cold) {
list_for_each_entry_reverse(page, &pcp->list, lru)
if (page_private(page) == migratetype)
break;
} else { // 如果要考虑硬件缓存,则取出链表的第一个页面,这个页面是最近刚释放到每CPU缓存的,缓存热度更高
list_for_each_entry(page, &pcp->list, lru)
if (page_private(page) == migratetype)
break;
}
/* Allocate more to the pcp list if necessary */
if (unlikely(&page->lru == &pcp->list)) {
pcp->count += rmqueue_bulk(zone, 0,
pcp->batch, &pcp->list, migratetype);
page = list_entry(pcp->list.next, struct page, lru);
}
//将页面从每CPU缓存链表中取出,并将每CPU缓存计数减1
list_del(&page->lru);
pcp->count--;
// 分配的是多个页面,不需要考虑每CPU页面缓存,直接从系统中分配
} else { // 去指定的migratetype的链表中去分配
spin_lock_irqsave(&zone->lock, flags); //关中断,并获得管理区的锁
page = __rmqueue(zone, order, migratetype);
spin_unlock(&zone->lock); //先回收(打开)锁,待后面统计计数设置完毕后再开中断
if (!page)
goto failed;
}
// 事件统计计数,debug(调试)用
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(preferred_zone, zone);
local_irq_restore(flags); //恢复中断
put_cpu();
VM_BUG_ON(bad_range(zone, page));
if (prep_new_page(page, order, gfp_flags))
goto again;
return page;
failed:
local_irq_restore(flags);
put_cpu();
return NULL;
}
而__rmqueue
函数,分为两种情况
- 快速分配
__rmqueue_smallest
:直接从指定的迁移类型链表中分配 - 慢速分配
__rmqueue_fallback
:当指定链表中迁移类型内存不足,此时使用备份列表
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static struct page *__rmqueue(struct zone *zone, unsigned int order,
int migratetype)
{
struct page *page;
// 快速分配
page = __rmqueue_smallest(zone, order, migratetype);
if (unlikely(!page))
page = __rmqueue_fallback(zone, order, migratetype);
return page;
}
此时完结,若zone_watermark_ok
水位检测不合格,则继续向下调用
运行到这里表明该zone
要回收页
即当水位检测不合格的时候,说明没有可用页,因此在此处进行一定数量的内存回收
ret = zone_reclaim(zone, gfp_mask, order);
通过zone_reclaim
进行一些页面回收
回收到了
2^order
数量的页框时,才会返回真,即使回收了,没达到这个数量,也返回假
int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
int node_id;
int ret;
// 都小于最小设定的值
if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
return ZONE_RECLAIM_FULL;
if (zone_is_all_unreclaimable(zone)) // 设定了标识不回收
return ZONE_RECLAIM_FULL;
// 如果没有设置__GFP_WAIT,即wait为0,则不继续进行内存分配
// 如果PF_MEMALLOC被设置,也就是说调用内存分配函数的本身就是内存回收进程,则不继续进行内存分配
if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
return ZONE_RECLAIM_NOSCAN;
node_id = zone_to_nid(zone);// 获得本zone的nodeid
// 不属于该cpu范围
if (node_state(node_id, N_CPU) && node_id != numa_node_id())
return ZONE_RECLAIM_NOSCAN;
// 其他进程在回收
if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
return ZONE_RECLAIM_NOSCAN;
ret = __zone_reclaim(zone, gfp_mask, order);// 回收该zone的页
zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);// 释放回收锁
if (!ret)
count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
return ret;
}
此处注意
PF_MEMALLOC
和__GFP_WAIT
,其中PF_MEMALLOC
是一个进程标志位,一般非内存管理子系统不应该使用该标志
继续向下调用__zone_reclaim
static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
/* Minimum pages needed in order to stay on node */
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
int priority;
struct scan_control sc = { // 控制扫描结果
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
.swap_cluster_max = max_t(unsigned long, nr_pages,
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
.swappiness = vm_swappiness,
.order = order,
.isolate_pages = isolate_pages_global,
};
unsigned long slab_reclaimable;
disable_swap_token();
cond_resched();
/*
* We need to be able to allocate from the reserves for RECLAIM_SWAP
* and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_SWAP.
*/
p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
priority = ZONE_RECLAIM_PRIORITY;
do {
note_zone_scanning_priority(zone, priority);
shrink_zone(priority, zone, &sc); // 回收内存
priority--;
} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
}
slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (slab_reclaimable > zone->min_slab_pages) {
while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
slab_reclaimable - nr_pages)
;
sc.nr_reclaimed += slab_reclaimable -
zone_page_state(zone, NR_SLAB_RECLAIMABLE);
}
p->reclaim_state = NULL;
current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
return sc.nr_reclaimed >= nr_pages;
}
关于shrink_zone
和shrink_slab
在后面会讲解
switch (ret) {
case ZONE_RECLAIM_NOSCAN:
/* did not scan */
goto try_next_zone;
case ZONE_RECLAIM_FULL:
/* scanned but unreclaimable */
goto this_zone_full;
default:
/* did we reclaim enough */
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
goto this_zone_full;
}
理想状态下,开始分配内存
try_this_zone:
page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
if (page)
break;
该zone
已经满了
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
继续下一个zone
try_next_zone:
if (NUMA_BUILD && !did_zlc_setup) {
/* we do zlc_setup after the first zone is tried */
allowednodes = zlc_setup(zonelist, alloc_flags);
zlc_active = 1;
did_zlc_setup = 1;
}
}
再循环一次
if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
/* Disable zlc cache for second zonelist scan */
zlc_active = 0;
goto zonelist_scan;
}
慢速内存分配
慢速内存分配:如果快速内存分配,也就是zonelist
中所有zone
在快速分配中都没有获取内存,则会使用min
阈值进行慢速分配
- 异步内存压缩
- 直接内存回收
- 轻同步内存压缩(视
oom
进行分配)
唤醒每个node
的kswapd
内核线程
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
wakeup_kswapd(zone, order); // 唤醒每个node的kswapd内核线程
/*
* A zone is low on free memory, so wake its kswapd task to service it.
*/
void wakeup_kswapd(struct zone *zone, int order)
{
pg_data_t *pgdat;
if (!populated_zone(zone))
return;
pgdat = zone->zone_pgdat;
// 检查水位
if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
return;
if (pgdat->kswapd_max_order < order)
pgdat->kswapd_max_order = order;
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) // 允许位
return;
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
wake_up_interruptible(&pgdat->kswapd_wait);
}
降低要求,尝试以min
阈值为标准进行快速内存分配
alloc_flags = ALLOC_WMARK_MIN;
if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
alloc_flags |= ALLOC_HARDER;
if (gfp_mask & __GFP_HIGH)
alloc_flags |= ALLOC_HIGH;
if (wait)
alloc_flags |= ALLOC_CPUSET;
此处的几个宏定义
ALLOC_HARDER
: 表示试图更努力的分配内存ALLOC_HIGH
:表示 设置调用者__GFP_HIGH
高优先级ALLOC_CPUSET
:表示 检查cpuset
,是否允许分配内存页
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags);
if (page)
goto got_pg;
此处有一个说法叫做“五剑式”,第一剑式,随意而为,以
low
为标准分配,即直接调用get_free_page_list()
,第二剑式即此处的使用min
为标准进行内存分配,且在分配内存的标志中增加ALLOC_WMARK_MIN
和ALLOC_HARDER
以及ALLOC_HIGH
的标志
此处是完全不检查水位进行调用,将alloc_flags
赋值为ALLOC_NO_WATERMARKS
rebalance:
if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
&& !in_interrupt()) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); // 不检查水位分配内存
if (page)
goto got_pg;
if (gfp_mask & __GFP_NOFAIL) {
congestion_wait(WRITE, HZ/50);
goto nofail_alloc;
}
}
goto nopage;
}
try_to_free_pages
通过同步释放内存来获取内存页,主要函数是try_to_free_pages
cpuset_update_task_memory_state();
p->flags |= PF_MEMALLOC;
lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
did_some_progress = try_to_free_pages(zonelist, order,
gfp_mask, nodemask);
p->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
p->flags &= ~PF_MEMALLOC;
函数体为
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
struct scan_control sc = { // 扫描控制结构
.gfp_mask = gfp_mask,
.may_writepage = !laptop_mode,
.swap_cluster_max = SWAP_CLUSTER_MAX,
.may_unmap = 1,
.may_swap = 1,
.swappiness = vm_swappiness,
.order = order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
.nodemask = nodemask,
};
return do_try_to_free_pages(zonelist, &sc);
}
有一个概念问题要弄清楚,前面三个方法是分配内存,如果内存不足从其他
node
节点获取页框即可(当然也有内存回收zone_reclaim
,但是本质是从其他节点找内存),而从此处开始是直接在本节点进行寻找
do_try_to_free_pages
充斥着大量的shrink_zone
和shrink_slab
,是回收页的主逻辑函数
static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct scan_control *sc)
{
int priority;
unsigned long ret = 0;
unsigned long total_scanned = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
struct zoneref *z;
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
delayacct_freepages_start();
if (scanning_global_lru(sc))
count_vm_event(ALLOCSTALL);
/*
* mem_cgroup will not do shrink_slab.
*/
if (scanning_global_lru(sc)) {
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
lru_pages += zone_lru_pages(zone);
}
}
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
sc->nr_scanned = 0;
if (!priority)
disable_swap_token();
shrink_zones(priority, zonelist, sc);
/*
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
*/
if (scanning_global_lru(sc)) {
shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
}
}
total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->swap_cluster_max) {
ret = sc->nr_reclaimed;
goto out;
}
/*
* Try to write back as many pages as we just scanned. This
* tends to cause slow streaming writers to write data to the
* disk smoothly, at the dirtying rate, which is nice. But
* that's undesirable in laptop mode, where we *want* lumpy
* writeout. So in laptop mode, write out the whole world.
*/
if (total_scanned > sc->swap_cluster_max +
sc->swap_cluster_max / 2) {
wakeup_pdflush(laptop_mode ? 0 : total_scanned);
sc->may_writepage = 1;
}
/* Take a nap, wait for some writeback to complete */
if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
congestion_wait(WRITE, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
ret = sc->nr_reclaimed;
out:
/*
* Now that we've scanned all the zones at this priority level, note
* that level within the zone so that the next thread which performs
* scanning of this zone will immediately start out at this priority
* level. This affects only the decision whether or not to bring
* mapped pages onto the inactive list.
*/
if (priority < 0)
priority = 0;
if (scanning_global_lru(sc)) {
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
zone->prev_priority = priority;
}
} else
mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
delayacct_freepages_end();
return ret;
}
if (likely(did_some_progress)) {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, alloc_flags); // 回收后再去分配
if (page)
goto got_pg;
todo
,过于复杂,后面再看
最后是使用omm
机制,即实在是没有页柯分配了,那么就杀掉某一个进程占用的页(有些残忍),即所谓的out of memory killer
机制
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
}
/*
* Go through the zonelist yet one more time, keep
* very high watermark here, this is only to catch
* a parallel oom killing, we must fail if we're still
* under heavy pressure.
*/
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
ALLOC_WMARK_HIGH|ALLOC_CPUSET); // 虚晃一枪,使用ALLOC_WMARK_HIGH来要求,明显不可能完成
if (page) {
clear_zonelist_oom(zonelist, gfp_mask);
goto got_pg;
}
/* The OOM killer will not help higher order allocs so fail */
if (order > PAGE_ALLOC_COSTLY_ORDER) {
clear_zonelist_oom(zonelist, gfp_mask);
goto nopage;
}
out_of_memory(zonelist, gfp_mask, order); // 释放进程的内存
clear_zonelist_oom(zonelist, gfp_mask);
goto restart;
Scan_Control
扫描控制结构,它的主要作用保存对一次内存回收或者内存压缩的变量和参数,一些处理结果也会保存在这里
主要应用于内存回收和内存压缩
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned; // 已经扫描的页框数量
/* Number of pages freed so far during a call to shrink_zones() */
unsigned long nr_reclaimed; // 已经回收的页框数量
/* This context's GFP mask */
gfp_t gfp_mask; // 申请内存时使用的分配标志
int may_writepage; // 能否执行回写操作
/* Can mapped pages be reclaimed? */
int may_unmap; // 能否进行unmap操作,即将所有映射了此页的页表项清空
/* Can pages be swapped as part of reclaim? */
int may_swap; // 能否进行swap交换
/* This context's SWAP_CLUSTER_MAX. If freeing memory for
* suspend, we effectively ignore SWAP_CLUSTER_MAX.
* In this context, it doesn't matter that we scan the
* whole list at once. */
int swap_cluster_max;
int swappiness;
int all_unreclaimable;
int order; // 申请内存时使用的order值,因为只有申请内存,然后内存不足时才会进行扫描
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup; // 目标memcg,如果是针对整个zone的,则此为NULL
/*
* Nodemask of nodes allowed by the caller. If NULL, all nodes
* are scanned.
*/
nodemask_t *nodemask; // 允许执行扫描的node节点掩码
/* Pluggable isolate pages callback */
unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
unsigned long *scanned, int order, int mode,
struct zone *z, struct mem_cgroup *mem_cont,
int active, int file);
};
内存压缩技术:一般情况下内存紧张的时候,为了达到系统最优,频繁的将内存数据通过
I/O
写回到磁盘不仅影响flash
寿命,还严重影响系统性能,因此,引入内存压缩技术,主流的有以下几种
zSwap
:交换空间,一般压缩的是匿名页zRam
:使用内存模拟块设备的方法,一般压缩的是匿名页zCache
:一般压缩的是文件页
匿名页就是没有关联到文件的页,如进程的堆、栈等,即无法与磁盘文件交换,但可以通过硬盘上划分额外的
swap
交换分区或使用交换文件进行交换
关于活动页和惰性页,一般情况下过该页是否经常被系统中的应用程序访问来判定该页是不是活跃的,如果该页没有被置位,则说明是惰性的,它需要被移到惰性链表,而如果页被置位,说明它近期被访问过,则应该移到活动链表
着时间的推移,最不活跃的页会处于惰性链表的尾端,在出现内存不足时,内核会换出这些页,因为这些页从出生到被换出时,一直都很少被使用,所以根据
LRU
的原理,换出这些页对系统的破坏是最小的
Mem Shrink
内存回收一般指的是对zone
的内存回收,也可能指的是最某个memcg
进行回收
- 每次回收
2^(order+1)
个页框,满足于本次内存分配,尽量回收更多页框。如果非活动lru
链表中的数量不满足这个标准时,则取消这种状态的判断 zone
的内存回收旺旺伴随着zone
的内存压缩,所以进行zone
的内存回收时,会回受到空闲页框满足进行内存压缩为止
根据前面的回收函数可知,主要有三个函数
shrink_zone
、shrink_list
和shrink_slab
shrink_zone
static void shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
unsigned long percent[2]; /* anon @ 0; file @ 1 */
enum lru_list l;
unsigned long nr_reclaimed = sc->nr_reclaimed;
unsigned long swap_cluster_max = sc->swap_cluster_max;
get_scan_ratio
get_scan_ratio(zone, sc, percent);
一般情况下当物理内存不够时,有两种选项
- 将一部分匿名页置换到
swap
分区 - 将
page cache
里面的数据刷回到磁盘,或者直接清理掉
在这两种方法中,置换到swap
的权重
/*
* With swappiness at 100, anonymous and file have the same priority.
* This scanning priority is essentially the inverse of IO cost.
*/
anon_prio = sc->swappiness;
file_prio = 200 - sc->swappiness;
首先判断是否完全关闭了swap
static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
unsigned long *percent)
{
unsigned long anon, file, free;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
percent[0] = 0;
percent[1] = 100;
return;
}
计算匿名页和page cache
页的个数
anon = zone_nr_pages(zone, sc, LRU_ACTIVE_ANON) +
zone_nr_pages(zone, sc, LRU_INACTIVE_ANON);
file = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) +
zone_nr_pages(zone, sc, LRU_INACTIVE_FILE);
空闲页个数free
+page cache
页个数小于high
阈值,则全部放到swap
中
if (scanning_global_lru(sc)) {
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
force-scan anon pages. */
if (unlikely(file + free <= zone->pages_high)) {
percent[0] = 100;
percent[1] = 0;
return;
}
}
计算比例
anon_prio = sc->swappiness;
file_prio = 200 - sc->swappiness;
/*
* The amount of pressure on anon vs file pages is inversely
* proportional to the fraction of recently scanned pages on
* each list that were recently referenced and in active use.
*/
ap = (anon_prio + 1) * (reclaim_stat->recent_scanned[0] + 1);
ap /= reclaim_stat->recent_rotated[0] + 1;
fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
/* Normalize to percentages */
percent[0] = 100 * ap / (ap + fp + 1);
percent[1] = 100 - percent[0];
中间有一个过渡
for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;
scan = zone_nr_pages(zone, sc, l);
if (priority) {
scan >>= priority;
scan = (scan * percent[file]) / 100;
}
if (scanning_global_lru(sc)) {
zone->lru[l].nr_scan += scan;
nr[l] = zone->lru[l].nr_scan;
if (nr[l] >= swap_cluster_max)
zone->lru[l].nr_scan = 0;
else
nr[l] = 0;
} else
nr[l] = scan;
}
宏for_each_evictable_lru
展开为
for (l = 0; l <= LRU_ACTIVE_FILE; l++)
l
代表的变量是struct lru_list
enum lru_list {
LRU_INACTIVE_ANON = LRU_BASE,
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
#ifdef CONFIG_UNEVICTABLE_LRU
LRU_UNEVICTABLE,
#else
LRU_UNEVICTABLE = LRU_ACTIVE_FILE, /* avoid compiler errors in dead code */
#endif
NR_LRU_LISTS
};
可见
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE
,表示可回收的链表
因此此循环中即循环这三个lru
链表,遍历所有可回收的链表
shrink_list
以LRU_INACTIVE_ANON,LRU_ACTIVE_ANON,LRU_INACTIVE_FILE,LRU_ACTIVE_FILE
这个顺序遍历lru
链表
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
if (nr[l]) {
nr_to_scan = min(nr[l], swap_cluster_max);
nr[l] -= nr_to_scan;
// 对此lru类型的链表进行回收
nr_reclaimed += shrink_list(l, nr_to_scan,
zone, sc, priority);
}
}
/*
* On large memory systems, scan >> priority can become
* really large. This is fine for the starting priority;
* we want to put equal scanning pressure on each zone.
* However, if the VM has a harder time of freeing pages,
* with multiple processes reclaiming pages, the total
* freeing target can get unreasonably large.
*/
if (nr_reclaimed > swap_cluster_max &&
priority < DEF_PRIORITY && !current_is_kswapd())
break;
}
还记得在vfs_cache_init
中对各个目录项存放的shrinker_list
吗?就是在此处进行内存回收,其中主要分类
LRU_ACTIVE_FILE
:活动文件,调用shrink_active_list
,对活动lru
链表进行处理LRU_ACTIVE_ANON
:活动匿名,调用shrink_active_list
,对活动lru
链表进行处理并且非匿名活动页太少shrink_inactive_list
,非活动lru
链表
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{
int file = is_file_lru(lru);
if (lru == LRU_ACTIVE_FILE) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
}
设置回收到的内存
sc->nr_reclaimed = nr_reclaimed;
当非匿名活动页太少,调用shrink_active_list
/*
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
if (inactive_anon_is_low(zone, sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
如果太多脏页回写了,此处休眠一会儿
throttle_vm_writeout(sc->gfp_mask);
shrink_active_list
选取一个函数来看
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
struct scan_control *sc, int priority, int file)
{
unsigned long pgmoved;
int pgdeactivate = 0;
unsigned long pgscanned;
LIST_HEAD(l_hold); /* The pages which were snipped off */
LIST_HEAD(l_inactive);
struct page *page;
struct pagevec pvec;
enum lru_list lru;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
ISOLATE_ACTIVE, zone,
sc->mem_cgroup, 1, file);
/*
* zone->pages_scanned is used for detect zone's oom
* mem_cgroup remembers nr_scan by itself.
*/
if (scanning_global_lru(sc)) {
zone->pages_scanned += pgscanned;
}
reclaim_stat->recent_scanned[!!file] += pgmoved;
if (file)
__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
else
__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
spin_unlock_irq(&zone->lru_lock);
pgmoved = 0;
while (!list_empty(&l_hold)) {
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);
if (unlikely(!page_evictable(page, NULL))) {
putback_lru_page(page);
continue;
}
/* page_referenced clears PageReferenced */
if (page_mapping_inuse(page) &&
page_referenced(page, 0, sc->mem_cgroup))
pgmoved++;
list_add(&page->lru, &l_inactive);
}
/*
* Move the pages to the [file or anon] inactive list.
*/
pagevec_init(&pvec, 1);
lru = LRU_BASE + file * LRU_FILE;
spin_lock_irq(&zone->lru_lock);
/*
* Count referenced pages from currently used mappings as
* rotated, even though they are moved to the inactive list.
* This helps balance scan pressure between file and anonymous
* pages in get_scan_ratio.
*/
reclaim_stat->recent_rotated[!!file] += pgmoved;
pgmoved = 0;
while (!list_empty(&l_inactive)) {
page = lru_to_page(&l_inactive);
prefetchw_prev_lru_page(page, &l_inactive, flags);
VM_BUG_ON(PageLRU(page));
SetPageLRU(page);
VM_BUG_ON(!PageActive(page));
ClearPageActive(page);
list_move(&page->lru, &zone->lru[lru].list);
mem_cgroup_add_lru_list(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
spin_unlock_irq(&zone->lru_lock);
pgdeactivate += pgmoved;
pgmoved = 0;
if (buffer_heads_over_limit)
pagevec_strip(&pvec);
__pagevec_release(&pvec);
spin_lock_irq(&zone->lru_lock);
}
}
__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
pgdeactivate += pgmoved;
__count_zone_vm_events(PGREFILL, zone, pgscanned);
__count_vm_events(PGDEACTIVATE, pgdeactivate);
spin_unlock_irq(&zone->lru_lock);
if (buffer_heads_over_limit)
pagevec_strip(&pvec);
pagevec_release(&pvec);
}
快速内存回收
处于get_page_from_freelist()
函数中,在遍历zonelist
过程中,对每个zone
都在分配前进行判断,如果分配后的zone
的空闲内存数量<阈值+保留页框数量,那么此zone
就会进行快速内存回收
阈值可能是
min/low/high
的任何一种
直接内存回收
直接内存回收发生在慢速分配中,在慢速分配中,首先唤醒所有node
节点的kswap
内核线程,然后调用get_page_from_freelist
尝试用min
阈值从zonelist
的zone
后去连续页框,如果失败,则对zonelist
的zone
进行异步压缩,异步压缩之后在此调用get_page_from_freelist
尝试用min
阈值,如果失败,则进行直接内存回收
kswapd内存回收
kswapd->balance_pgdat()->kswapd_shrink_zone()->shrink_zone()
在分配过程中,只要get_page_from_freelist()
函数无法以low
阈值从zonelist
的zone
中获取到连续页框,并且分配内存标志gfp_mask
没有标记__GFP_NO_KSWAPD
,则会唤醒kswapd
内核线程,在当中执行kswapd
内存回收
版权声明:本文标题:Linux内存管理:内存分配和内存回收原理 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://m.elefans.com/dongtai/1728312455a1153224.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论