从零到负一

【LM13】slab系统的初始化

2023/12/22

之前停更了很久,几个月了,现在终于来到slab系统了。slab系统我也是看了很久才搞明白,这篇笔记主要针对slab系统的介绍、结构体以及初始化的过程。

为什么需要slab系统

之前我们已经分析了伙伴系统,伙伴系统有一些自己的问题,比如只能管理2 ^ order个页的内存空间;比如每次分配和释放的流程都太长,除了影响性能外还可能导致CPU的缓存和TLB被污染;比如每次分配的内存空间都是按页对齐的,对于小的需求,可能只会用一个页的前几个、几十个字节。这样有更高的概率导致不同的对象映射到相同的cache line,从而导致cache性能的低下。slab系统的诞生,就是为了解决这些问题。slab系统本质上是使用伙伴系统创建一系列的对象内存池,有了这些内存池,伙伴系统的这些缺陷都可以很好地被解决。

  1. slab系统可以用已经分配好的物理页分配小于2 ^ order大小的内存空间;
  2. 由于slab系统使用内存池,因此其分配、回收对象内存的流程会短很多,从而提高了速度并减少了缓存, TLB的污染;
  3. slab系统分配的对象地址有一定的offset,因此大大降低了对象共用cache line的情况。

相关结构体

slab是多个结构体共同组成的系统(没有一个单独的结构体叫slab),因此在理解上会有些困难。我这里根据自己的理解,绘制了一个简化版的slab系统。

结合这个图,我们来看看这几个结构体。

kmem_cache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/*
* Definitions unique to the original Linux SLAB allocator.
*/

struct kmem_cache {
struct array_cache __percpu *cpu_cache;

/* 1) Cache tunables. Protected by slab_mutex */
// 它们在setup_cpu_cache()以及enable_cpucache()中被初始化
unsigned int batchcount;
unsigned int limit;
unsigned int shared;

unsigned int size;
struct reciprocal_value reciprocal_buffer_size;
/* 2) touched by every alloc & free from the backend */

slab_flags_t flags; /* constant flags */
unsigned int num; /* # of objs per slab */

/* 3) cache_grow/shrink */
/* order of pgs per slab (2^n) */
unsigned int gfporder;

/* force GFP flags, e.g. GFP_DMA */
gfp_t allocflags;

size_t colour; /* cache colouring range */
unsigned int colour_off; /* colour offset */
struct kmem_cache *freelist_cache;
unsigned int freelist_size;

/* constructor func */
void (*ctor)(void *obj);

/* 4) cache creation/removal */
const char *name;
struct list_head list;
int refcount;
int object_size;
int align;

/* 5) statistics */
#ifdef CONFIG_DEBUG_SLAB
unsigned long num_active;
unsigned long num_allocations;
unsigned long high_mark;
unsigned long grown;
unsigned long reaped;
unsigned long errors;
unsigned long max_freeable;
unsigned long node_allocs;
unsigned long node_frees;
unsigned long node_overflow;
atomic_t allochit;
atomic_t allocmiss;
atomic_t freehit;
atomic_t freemiss;
#ifdef CONFIG_DEBUG_SLAB_LEAK
atomic_t store_user_clean;
#endif

/*
* If debugging is enabled, then the allocator can add additional
* fields and/or padding to every object. 'size' contains the total
* object size including these internal fields, while 'obj_offset'
* and 'object_size' contain the offset to the user object and its
* size.
*/
int obj_offset;
#endif /* CONFIG_DEBUG_SLAB */

#ifdef CONFIG_MEMCG
struct memcg_cache_params memcg_params;
#endif
#ifdef CONFIG_KASAN
struct kasan_cache kasan_info;
#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM
unsigned int *random_seq;
#endif

unsigned int useroffset; /* Usercopy region offset */
unsigned int usersize; /* Usercopy region size */

struct kmem_cache_node *node[MAX_NUMNODES];
};

array_cache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/*
* struct array_cache
*
* Purpose:
* - LIFO ordering, to hand out cache-warm objects from _alloc
* - reduce the number of linked list operations
* - reduce spinlock operations
*
* The limit is stored in the per-cpu structure to reduce the data cache
* footprint.
*
*/
// 在初始化阶段,per-cpu缓存在setup_cpu_cache()中被初始化,其中limit, batchcount都等于1
// 在这个阶段,shared缓存没有被创建
// 当初始化结束,per-cpu缓存和shared缓存在kmem_cache_init_late()->enable_cpucache()中重新被初始化,
// limit, batchcount等的值是根据缓存的大小等决定的
// shared缓存是根据slab缓存的大小以及CPU数量决定是否创建
struct array_cache {
unsigned int avail;
unsigned int limit;
unsigned int batchcount;
unsigned int touched;
void *entry[]; /*
* Must have this definition in here for the proper
* alignment of array_cache. Also simplifies accessing
* the entries.
*/
};

kmem_cache_node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/*
* The slab lists for all objects.
*/
struct kmem_cache_node {
spinlock_t list_lock; // spin_lock_init - 这些变量都在kmem_cache_node_init()中被初始化

#ifdef CONFIG_SLAB
struct list_head slabs_partial; // INIT_LIST_HEAD /* partial list first, better asm code */
struct list_head slabs_full; // INIT_LIST_HEAD
struct list_head slabs_free; // INIT_LIST_HEAD
unsigned long total_slabs; // 0 /* length of all slab lists */
unsigned long free_slabs; // 0 /* length of free slab list only */
unsigned long free_objects; // 0
unsigned int free_limit;
unsigned int colour_next; // 0 /* Per-node cache coloring */
struct array_cache *shared; // NULL /* shared per node */
struct alien_cache **alien; // NULL /* on other nodes */
unsigned long next_reap; /* updated without locking */
int free_touched; // 0 /* updated without locking */
#endif

#ifdef CONFIG_SLUB
unsigned long nr_partial;
struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
struct list_head full;
#endif
#endif
};

slab系统的初始化

有了结构体的基础,现在我们来看看slab系统是如何进行初始化的。

kmem_cache_init()

有了slab的基本背景知识,我们第一个要看的函数就是kmem_cache_init()。这里会牵涉部分slab缓存分配相关的知识,对于这部分知识,这里不深究细节,在之后的笔记中再详细讨论slab缓存的分配和释放。
kmem_cache_init()函数主要是为了解决一个简单但重要的问题 - 如何创建第一个kmem_cache类型的缓存(同时也是第一个slab缓存)。这里我简单说说为什么这是一个问题:在我们创建任何一个slab缓存时,我们都需要使用slab分配一个kmem_cache类型的对象。假如我们第一次创建slab缓存,这时整个内核中都没有kmem_cache类型的缓存(因此没法获得kmem_cache类型的对象),那么内核就需要先创建一个kmem_cache类型的缓存,然后再分配kmem_cache类型的对象。但创建kmem_cache类型缓存的过程同样需要使用slab分配一个kmem_cache类型的对象。这就进入了死循环,我们需要kmem_cache类型的缓存,但创建它又需要kmem_cache类型的缓存,这样就没有办法完成第一个kmem_cache类型缓存的创建。kmem_cache_init()使用静态分配的方式创建了第一个kmem_cache类型的缓存,从而解决了这个难题。我们现在来看看这个函数是如何实现的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
// ./mm/slab.c

/*
* Initialisation. Called after the page allocator have been initialised and
* before smp_init().
*/
void __init kmem_cache_init(void)
{
int i;

// kmem_cache_boot是静态定义的kmem_cache类型的slab缓存
// kmem_cache是全局变量
kmem_cache = &kmem_cache_boot;

if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1)
use_alien_caches = 0;

// 内核静态定义了NUM_INIT_LISTS个kmem_cache_node,这里NUM_INIT_LISTS等于2
for (i = 0; i < NUM_INIT_LISTS; i++)
// 初始化静态定义的kmem_cache_node
kmem_cache_node_init(&init_kmem_cache_node[i]);

/*
* Fragmentation resistance on low memory - only use bigger
* page orders on machines with more than 32MB of memory if
* not overridden on the command line.
*/
if (!slab_max_order_set && totalram_pages() > (32 << 20) >> PAGE_SHIFT)
slab_max_order = SLAB_MAX_ORDER_HI;

/* Bootstrap is tricky, because several objects are allocated
* from caches that do not exist yet:
* 1) initialize the kmem_cache cache: it contains the struct
* kmem_cache structures of all caches, except kmem_cache itself:
* kmem_cache is statically allocated.
* Initially an __init data area is used for the head array and the
* kmem_cache_node structures, it's replaced with a kmalloc allocated
* array at the end of the bootstrap.
* 2) Create the first kmalloc cache.
* The struct kmem_cache for the new cache is allocated normally.
* An __init data area is used for the head array.
* 3) Create the remaining kmalloc caches, with minimally sized
* head arrays.
* 4) Replace the __init data head arrays for kmem_cache and the first
* kmalloc cache with kmalloc allocated arrays.
* 5) Replace the __init data for kmem_cache_node for kmem_cache and
* the other cache's with kmalloc allocated memory.
* 6) Resize the head arrays of the kmalloc caches to their final sizes.
*/

/*
* struct kmem_cache size depends on nr_node_ids & nr_cpu_ids
*/
// -------------------------------------------------------------------------------------------------- (1)
// 完成对静态定义的kmem_cache的初始化
create_boot_cache(kmem_cache,
"kmem_cache",
// 在UMA的系统中,kmem_cache的大小是定的,因为只有一个node
// 对于有NUMA系统,这里需要使用如下计算方式确定kmem_cache的大小
offsetof(struct kmem_cache, node) + nr_node_ids * sizeof(struct kmem_cache_node *),
SLAB_HWCACHE_ALIGN,
0,
0);

// 将创建好的kmem_cache链接到slab_caches上,kmem_cache是slab_caches上的第一个slab缓存(kmem_cache类型)
list_add(&kmem_cache->list, &slab_caches);

// 没有定义CONFIG_MEMCG_KMEM时为空函数
memcg_link_cache(kmem_cache);

// 更新slab_state的状态,目前已经完成 1) kmem_cache类型的slab缓存的创建
slab_state = PARTIAL;

/*
* Initialize the caches that provide memory for the kmem_cache_node
* structures first. Without this, further allocations will bug.
*/
// -------------------------------------------------------------------------------------------------- (2)
// 创建一个大小适合kmem_cache_node类型的通用缓存。这里需要使用上面创建的kmem_cache类型的缓存分配对象,具体的分配
// 过程这里就不深入讨论了。
//
// 1. #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node))
// 2. kmalloc_info提供通用缓存的名字和大小;
//
// 这里简单给个函数的调用流程,
// 1. 首先通过调用kmem_cache_zalloc()函数从kmem_cache(这个就是(1)中创建的)中获取一个kmem_cache类型的对象;
// 2. 调用create_boot_cache()函数来初始化这个kmem_cache,其中大小、名字等都来自kmalloc_info[];
// 3. 将这个通用缓存链接入slab_caches,最后将其放入kmalloc_caches中。
//
// 注意,这个通用缓存并没有分配实际的缓存空间,这在之后的分配过程中会通过伙伴系统获取
kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE] = create_kmalloc_cache(kmalloc_info[INDEX_NODE].name, // name
kmalloc_size(INDEX_NODE), // size
ARCH_KMALLOC_FLAGS, // flag
0, // useroffset
kmalloc_size(INDEX_NODE)); // usersize

// 更新slab_state的状态,目前已经完成 1) kmem_cache类型的slab缓存的创建;
// 2) 大小适合kmem_cache_node类型的通用slab缓存的创建
slab_state = PARTIAL_NODE;

// 更新一些特殊情况下的size_index[]
setup_kmalloc_cache_index_table();

slab_early_init = 0;

/* 5) Replace the bootstrap kmem_cache_node */
{
int nid;
// (2)中已经创建了可供kmem_cache_node使用的通用缓存,这里主要做以下几件事情
// 1. 通过通用缓存获取kmem_cache_node的对象;
// 2. 将init_kmem_cache_node[]中的内容复制到该kmem_cache_node对象;
// 3. 用该kmem_cache_node对象更新kmem_cache->node.
//
// 完成后,kmem_cache和kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE]就有动态生成的kmem_cache_node了
for_each_online_node(nid) {
init_list(kmem_cache, &init_kmem_cache_node[CACHE_CACHE + nid], nid);

init_list(kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE],
&init_kmem_cache_node[SIZE_NODE + nid], nid);
}
}

// -------------------------------------------------------------------------------------------------- (3)
// 完成剩下所有通用缓存的创建(和上面(2)中创建方式一样,最后调用create_kmalloc_cache()创建这戏通用缓存)并修改slab_state = UP
// 到目前为止,所有kmem_cache_node中的shared都为NULL
create_kmalloc_caches(ARCH_KMALLOC_FLAGS);
}

kmem_cache_init()完成后,slab系统就可以正常使用了。关于函数的说明,我已在注释中写得很明白了,重要的部分我也标记了(1), (2), (3)。这里再简单补充一个点,在(1)完成后,kmem_cache类型的缓存创建成功。这里的成功指的是kmem_cache这个结构体完成了初始化,但并没有真正地给它分配内存空间。当第一次分配对象时,slab的分配函数才会去调用伙伴系统来分配内存空间。当内存空间分配好后,slab会把获取的页拆分成一个一个的对象并进行分配和释放。
上面代码注释中,我已经解释了(2), (3)处的内容,这里我们就只看(1)处的函数,

create_boot_cache()

这个函数先对要创建的缓存进行简单地初始化,然后调用__kmem_cache_create()来完成更多的初始化工作。我们先来看看哪些函数会调用__kmem_cache_create()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// 可以看出,只要是创建新的slab缓存,都需要调用__kmem_cache_create()函数
// 在下一篇笔记中,我们可以看出kmem_cache_create()也会调用__kmem_cache_create()函数

create_boot_cache()
---> __kmem_cache_create()

create_kmalloc_cache()
---> create_boot_cache()
---> __kmem_cache_create()

create_kmalloc_caches()
---> create_kmalloc_cache()
---> create_boot_cache()
---> __kmem_cache_create()

kmem_cache_create()
---> kmem_cache_create_usercopy()
---> create_cache()
---> __kmem_cache_create()

接下来我们就进入create_kmalloc_cache()函数吧,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// ./mm/slab_common.c

/* Create a cache during boot when no slab services are available yet */
void __init create_boot_cache(struct kmem_cache *s, const char *name,
unsigned int size, slab_flags_t flags,
unsigned int useroffset, unsigned int usersize)
{
int err;

s->name = name;
s->size = s->object_size = size;
s->align = calculate_alignment(flags, ARCH_KMALLOC_MINALIGN, size);
s->useroffset = useroffset;
s->usersize = usersize;

// 没有定义CONFIG_MEMCG_KMEM时为空函数
slab_init_memcg_params(s);

err = __kmem_cache_create(s, flags);

if (err)
panic("Creation of kmalloc slab %s size=%u failed. Reason %d\n", name, size, err);

s->refcount = -1; /* Exempt from merging for now */
}

__kmem_cache_create()

这个函数比较长,需要花点时间来慢慢分析。这个函数是一个通用的函数,除了用在这里,kmem_cache_create()这个通用的slab缓存创建函数也会最终调用该函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
//  ./mm/slab.c

/**
* __kmem_cache_create - Create a cache.
* @cachep: cache management descriptor
* @flags: SLAB flags
*
* Returns a ptr to the cache on success, NULL on failure.
* Cannot be called within a int, but can be interrupted.
* The @ctor is run when new pages are allocated by the cache.
*
* The flags are
*
* %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
* to catch references to uninitialised memory.
*
* %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
* for buffer overruns.
*
* %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
* cacheline. This can be beneficial if you're counting cycles as closely
* as davem.
*/
int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
{
size_t ralign = BYTES_PER_WORD;
gfp_t gfp;
int err;
unsigned int size = cachep->size;

// 根据不同的条件设置debug的标志位(SLAB_RED_ZONE, SLAB_STORE_USER, SLAB_POISON)
#if DEBUG
#if FORCED_DEBUG
/*
* Enable redzoning and last user accounting, except for caches with
* large objects, if the increased size would increase the object size
* above the next power of two: caches with object sizes just above a
* power of two have a significant amount of internal fragmentation.
*/
if (size < 4096 || fls(size - 1) == fls(size - 1 + REDZONE_ALIGN + 2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
if (!(flags & SLAB_TYPESAFE_BY_RCU))
flags |= SLAB_POISON;
#endif
#endif

// --------------------------------------------------------------------------------------------- (1)
// 将对象大小按照word进行对齐
/*
* Check that size is in terms of words. This is needed to avoid
* unaligned accesses for some archs when redzoning is used, and makes
* sure any on-slab bufctl's are also correctly aligned.
*/
size = ALIGN(size, BYTES_PER_WORD);

// 更新对象的对齐值
if (flags & SLAB_RED_ZONE) {
// #define REDZONE_ALIGN max(BYTES_PER_WORD, __alignof__(unsigned long long))
ralign = REDZONE_ALIGN;
/* If redzoning, ensure that the second redzone is suitably
* aligned, by adjusting the object size accordingly. */
// 再次调整对象大小的对齐
size = ALIGN(size, REDZONE_ALIGN);
}

/* 3) caller mandated alignment */
if (ralign < cachep->align) {
ralign = cachep->align;
}
/* disable debug if necessary */
if (ralign > __alignof__(unsigned long long))
flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
/*
* 4) Store it.
*/
// 更新缓存的对齐
cachep->align = ralign;

// 初始化并更新slab的着色offset
cachep->colour_off = cache_line_size();
/* Offset must be a multiple of the alignment. */
if (cachep->colour_off < cachep->align)
cachep->colour_off = cachep->align;

// 判断slab是否完成初始化(slab_state>=UP)
if (slab_is_available())
gfp = GFP_KERNEL;
else
gfp = GFP_NOWAIT;

// 根据debug的标志位更新obj_offset和size
#if DEBUG
/*
* Both debugging options require word-alignment which is calculated
* into align above.
*/
if (flags & SLAB_RED_ZONE) {
/* add space for red zone words */
cachep->obj_offset += sizeof(unsigned long long);
size += 2 * sizeof(unsigned long long);
}
if (flags & SLAB_STORE_USER) {
/* user store requires one word storage behind the end of
* the real object. But if the second red zone needs to be
* aligned to 64 bits, we must allow that much space.
*/
if (flags & SLAB_RED_ZONE)
size += REDZONE_ALIGN;
else
size += BYTES_PER_WORD;
}
#endif

// 没有定义CONFIG_KASAN时为空函数
kasan_cache_create(cachep, &size, &flags);

// 再一次对对象大小进行对齐
size = ALIGN(size, cachep->align);

/*
* We should restrict the number of objects in a slab to implement
* byte sized index. Refer comment on SLAB_OBJ_MIN_SIZE definition.
*/
// 这部分参考SLAB_OBJ_MIN_SIZE的注释,这里简单解释下,
// 如果只用一个byte来做slab的索引,那么只能映射2 ^ 8个slab
// 因此,在这种情况slab缓存的大小不能太小,否则一个page就有可能
// 超过2 ^ 8个slab
if (FREELIST_BYTE_INDEX && size < SLAB_OBJ_MIN_SIZE)
size = ALIGN(SLAB_OBJ_MIN_SIZE, cachep->align);

// --------------------------------------------------------------------------------------------- (2)
#if DEBUG
/*
* To activate debug pagealloc, off-slab management is necessary
* requirement. In early phase of initialization, small sized slab
* doesn't get initialized so it would not be possible. So, we need
* to check size >= 256. It guarantees that all necessary small
* sized slab is initialized in current slab initialization sequence.
*/
if (size >= 256 && cachep->object_size > cache_line_size()) {
if (size < PAGE_SIZE || size % PAGE_SIZE == 0) {
size_t tmp_size = ALIGN(size, PAGE_SIZE);
if (set_off_slab_cache(cachep, tmp_size, flags)) {
flags |= CFLGS_OFF_SLAB;
cachep->obj_offset += tmp_size - size;
size = tmp_size;
goto done;
}
}
}
#endif

// 情况1. 所有的freelist_idx_t都在一个object中
// 不满足条件要求:
// if (cachep->num * sizeof(freelist_idx_t) > cachep->object_size) return false;
if (set_objfreelist_slab_cache(cachep, size, flags)) {
flags |= CFLGS_OBJFREELIST_SLAB;
goto done;
}

// 情况2. 所有的freelist_idx_t都在一个外部的缓存中
// 不满足条件要求:
// if (left >= cachep->num * sizeof(freelist_idx_t)) return false;
if (set_off_slab_cache(cachep, size, flags)) {
flags |= CFLGS_OFF_SLAB;
goto done;
}

// 情况3. 最基本的情况,所有的freelist_idx_t都在当前缓存中
if (set_on_slab_cache(cachep, size, flags))
goto done;

return -E2BIG;

// --------------------------------------------------------------------------------------------- (3)
done:
// 更新slab缓存中的flah, size等
cachep->freelist_size = cachep->num * sizeof(freelist_idx_t);
cachep->flags = flags;
cachep->allocflags = __GFP_COMP;
if (flags & SLAB_CACHE_DMA)
cachep->allocflags |= GFP_DMA;
if (flags & SLAB_RECLAIM_ACCOUNT)
cachep->allocflags |= __GFP_RECLAIMABLE;
cachep->size = size;
cachep->reciprocal_buffer_size = reciprocal_value(size);

#if DEBUG
/*
* If we're going to use the generic kernel_map_pages()
* poisoning, then it's going to smash the contents of
* the redzone and userword anyhow, so switch them off.
*/
// 当设置了CONFIG_PAGE_POISONING后,每个对象会写入一些特定的数据,
// 这些数据会破坏redzone和user信息,因此需要清除(SLAB_RED_ZONE | SLAB_STORE_USER)
if (IS_ENABLED(CONFIG_PAGE_POISONING) && (cachep->flags & SLAB_POISON) && is_debug_pagealloc_cache(cachep))
cachep->flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
#endif

// 分配一个新的通用的slab缓存用于存储freelist
if (OFF_SLAB(cachep)) {
cachep->freelist_cache = kmalloc_slab(cachep->freelist_size, 0u);
}

// 设置per-cpu的缓存
err = setup_cpu_cache(cachep, gfp);
if (err) {
__kmem_cache_release(cachep);
return err;
}

return 0;
}

这个函数很长,我这里简单将其分成3部分,
第一部分是(2)之前所有的内容,这部分主要是对对象的大小、标志位等进行设置在DEBUG模式下,可能会在对象中添加RED_ZONEPOISON
第二部分是(2) - (3)之间的内容,这部分主要是确定freelist怎么放以及放在哪里,并修改相应的标志位;
第三部分是该函数的最后部分,它会更新kmem_cache中的一些成员变量以及设置per-cpu的缓存。

在这三部分中,第二、三部分比较重要,接下来我们先看第二部分。
这部分主要看3个函数 - set_objfreelist_slab_cache(), set_off_slab_cache()以及set_on_slab_cache()。这三个函数特别相似,这里我就分析第一个就行了。

set_objfreelist_slab_cache()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// ./mm/slab.c

static bool set_objfreelist_slab_cache(struct kmem_cache *cachep,
size_t size, slab_flags_t flags)
{
size_t left;

cachep->num = 0;

if (cachep->ctor || flags & SLAB_TYPESAFE_BY_RCU)
return false;

// ----------------------------------------------------------------------- (1)
left = calculate_slab_order(cachep, size, flags | CFLGS_OBJFREELIST_SLAB);
if (!cachep->num)
return false;

// ----------------------------------------------------------------------- (2)
// 如果不满足所有的freelist_idx_t能放入一个object,那么就不能使用这种方式
if (cachep->num * sizeof(freelist_idx_t) > cachep->object_size)
return false;

cachep->colour = left / cachep->colour_off;

return true;
}

(1) 这个函数主要做了两件事情 - 1) 获取满足要求的最小的order,它用于伙伴系统给slab分配页;2) 计算除去freelist占用的空间外页中还剩多少内存空间,我们现在来看看这个函数。

calculate_slab_order()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
// ./mm/slab.c

/**
* calculate_slab_order - calculate size (page order) of slabs
* @cachep: pointer to the cache that is being created
* @size: size of objects to be created in this cache.
* @flags: slab allocation flags
*
* Also calculates the number of objects per slab.
*
* This could be made much more intelligent. For now, try to avoid using
* high order pages for slabs. When the gfp() functions are more friendly
* towards high-order requests, this should be changed.
*/
static size_t calculate_slab_order(struct kmem_cache *cachep,
size_t size, slab_flags_t flags)
{
size_t left_over = 0;
int gfporder;

for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
unsigned int num;
size_t remainder;
// 计算出slab cache中最多可以装多少obj
num = cache_estimate(gfporder, size, flags, &remainder);
if (!num)
continue;

/* Can't handle number of objects more than SLAB_OBJ_MAX_NUM */
// 找到最小的order满足这个条件后就跳出循环
if (num > SLAB_OBJ_MAX_NUM)
break;

// 对于OFF_SLAB的情况,还需要考虑是否有合适大小的通用缓存
// 在启动阶段,flag中没有开启CFLGS_OFF_SLAB
if (flags & CFLGS_OFF_SLAB) {
struct kmem_cache *freelist_cache;
size_t freelist_size;

freelist_size = num * sizeof(freelist_idx_t);
freelist_cache = kmalloc_slab(freelist_size, 0u);
if (!freelist_cache)
continue;

/*
* Needed to avoid possible looping condition
* in cache_grow_begin()
*/
if (OFF_SLAB(freelist_cache))
continue;

/* check if off slab has enough benefit */
if (freelist_cache->size > cachep->size / 2)
continue;
}

// 当前order满足条件,先更新kmem_cache的成员变量
/* Found something acceptable - save it away */
cachep->num = num;
cachep->gfporder = gfporder;
left_over = remainder;

/*
* A VFS-reclaimable slab tends to have most allocations
* as GFP_NOFS and we really don't want to have to be allocating
* higher-order pages when we are unable to shrink dcache.
*/
if (flags & SLAB_RECLAIM_ACCOUNT)
break;

/*
* Large number of objects is good, but very large slabs are
* currently bad for the gfp()s.
*/
if (gfporder >= slab_max_order)
break;

/*
* Acceptable internal fragmentation?
*/
if (left_over * 8 <= (PAGE_SIZE << gfporder))
break;
}
return left_over;
}

/*
* Calculate the number of objects and left-over bytes for a given buffer size.
*/
static unsigned int cache_estimate(unsigned long gfporder, size_t buffer_size,
slab_flags_t flags, size_t *left_over)
{
unsigned int num;
size_t slab_size = PAGE_SIZE << gfporder;

/*
* The slab management structure can be either off the slab or
* on it. For the latter case, the memory allocated for a
* slab is used for:
*
* - @buffer_size bytes for each object
* - One freelist_idx_t for each object
*
* We don't need to consider alignment of freelist because
* freelist will be at the end of slab page. The objects will be
* at the correct alignment.
*
* If the slab management structure is off the slab, then the
* alignment will already be calculated into the size. Because
* the slabs are all pages aligned, the objects will be at the
* correct alignment when allocated.
*/
// 这两种情况,我们都不需要考虑freelist_idx_t占用的空间。第一种情况,所有的freelist_idx_t就在一个object中,
// 将其当成一个普通object即可; 第二种情况,所有的freelist_idx_t在其它的页中。
if (flags & (CFLGS_OBJFREELIST_SLAB | CFLGS_OFF_SLAB)) {
num = slab_size / buffer_size;
*left_over = slab_size % buffer_size;
// 这种情况,每个object都对应一个freelist_idx_t,因此需要将它们两个看做一个整体
} else {
num = slab_size / (buffer_size + sizeof(freelist_idx_t));
*left_over = slab_size %
(buffer_size + sizeof(freelist_idx_t));
}

return num;
}

有了上面的基础,下面两个函数也很好理解了,直接看注释就行了。

set_off_slab_cache()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// ./mm/slab.c

static bool set_off_slab_cache(struct kmem_cache *cachep,
size_t size, slab_flags_t flags)
{
size_t left;

cachep->num = 0;

/*
* Always use on-slab management when SLAB_NOLEAKTRACE
* to avoid recursive calls into kmemleak.
*/
if (flags & SLAB_NOLEAKTRACE)
return false;

/*
* Size is large, assume best to place the slab management obj
* off-slab (should allow better packing of objs).
*/
left = calculate_slab_order(cachep, size, flags | CFLGS_OFF_SLAB);
if (!cachep->num)
return false;

/*
* If the slab has been placed off-slab, and we have enough space then
* move it on-slab. This is at the expense of any extra colouring.
*/
// 只要存放object的页有足够空间方向所有的freelist_idx_t,那么就不用这种方法
if (left >= cachep->num * sizeof(freelist_idx_t))
return false;

cachep->colour = left / cachep->colour_off;

return true;
}

set_on_slab_cache()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// ./mm/slab.c

static bool set_on_slab_cache(struct kmem_cache *cachep,


cachep->num = 0;

left = calculate_slab_order(cachep, size, flags);
// 这个函数已经是最后一个选择了,只要object的数量不为0就行
if (!cachep->num)
return false;

cachep->colour = left / cachep->colour_off;

return true;
}

重新回到__kmem_cache_create()

到这里,我们可以重新回到__kmem_cache_create()函数的第三部分,这里我再次贴上这部分代码,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
    // --------------------------------------------------------------------------------------------- (3)
done:
// 更新slab缓存中的flag, size等
cachep->freelist_size = cachep->num * sizeof(freelist_idx_t);
cachep->flags = flags;
cachep->allocflags = __GFP_COMP;
if (flags & SLAB_CACHE_DMA)
cachep->allocflags |= GFP_DMA;
if (flags & SLAB_RECLAIM_ACCOUNT)
cachep->allocflags |= __GFP_RECLAIMABLE;
cachep->size = size;
cachep->reciprocal_buffer_size = reciprocal_value(size);

#if DEBUG
/*
* If we're going to use the generic kernel_map_pages()
* poisoning, then it's going to smash the contents of
* the redzone and userword anyhow, so switch them off.
*/
// 当设置了CONFIG_PAGE_POISONING后,每个对象会写入一些特定的数据,
// 这些数据会破坏redzone和user信息,因此需要清除(SLAB_RED_ZONE | SLAB_STORE_USER)
if (IS_ENABLED(CONFIG_PAGE_POISONING) && (cachep->flags & SLAB_POISON) && is_debug_pagealloc_cache(cachep))
cachep->flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
#endif

// 分配一个新的通用的slab缓存用于存储freelist
// 在启动阶段,这里是不会被执行的,因为上面(2)处已经决定了CFLGS_OFF_SLAB不会被设置成1
if (OFF_SLAB(cachep)) {
// 根据freelist_size的大小去kmalloc_caches[]获取通用缓存
cachep->freelist_cache = kmalloc_slab(cachep->freelist_size, 0u);
}

// 设置per-cpu的缓存,这部分下面具体讨论
err = setup_cpu_cache(cachep, gfp);
if (err) {
__kmem_cache_release(cachep);
return err;
}

return 0;

上面代码中的注释已经写的很清楚了,这里只需要看最后一个函数 - setup_cpu_cache()即可。

setup_cpu_cache()

根据名字可知,这个函数主要就是设置per-cpu缓存的。它是()
这部分内容还是比较简单的,除此之外,这个函数还设置了一些kmem_cache中和cpu_cache相关的内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// ./mm/slab.c
// __kmem_cache_create()->setup_cpu_cache()

static int __ref setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
{
if (slab_state >= FULL)
// --------------------------------------------------------------------- (1)
// 当slab缓存初始化完全结束,我们可以正式初始化per-cpu缓存了
return enable_cpucache(cachep, gfp);

// 分配一个cpu_cache的对象,这里是基于per-cpu的分配(不是用slab进行分配), 这里只分配一个entry
cachep->cpu_cache = alloc_kmem_cache_cpus(cachep, 1, 1);
if (!cachep->cpu_cache)
return 1;

// 根据slab的状态设置kmem_cache的cache->node[] = init_kmem_cache_node[]
if (slab_state == DOWN) {
/* Creation of first cache (kmem_cache). */
set_up_node(kmem_cache, CACHE_CACHE);
} else if (slab_state == PARTIAL) {
/* For kmem_cache_node */
set_up_node(cachep, SIZE_NODE);
} else {
int node;

for_each_online_node(node) {
cachep->node[node] = kmalloc_node(
sizeof(struct kmem_cache_node), gfp, node);
BUG_ON(!cachep->node[node]);
kmem_cache_node_init(cachep->node[node]);
}
}

// 设置和缓存回收相关内容
cachep->node[numa_mem_id()]->next_reap =
jiffies + REAPTIMEOUT_NODE +
((unsigned long)cachep) % REAPTIMEOUT_NODE;

// 更新部分kmem_cache中per_cpu的信息
cpu_cache_get(cachep)->avail = 0;
cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
cpu_cache_get(cachep)->batchcount = 1;
cpu_cache_get(cachep)->touched = 0;
// 更新部分kmem_cache的信息
cachep->batchcount = 1;
cachep->limit = BOOT_CPUCACHE_ENTRIES;
return 0;
}

(1)处函数的内容很多,我这里就不具体分析该函数了。简单来说,这个函数会根据slab cache的大小、cpu的数量等,通过计算得出ac->limit, ac->batchcount, shared(这里是kmem_cache中的shared)等;同时,kmem_cache中的limit, batchcount也被初始化,其值和per-cpu中相同变量的值一样。除了对per-cpu缓存进行初始化,它还对kmem_cache_node进行初始化,其中一个重要的变量就是n->shared。下面这段代码完成了对n->shared的初始化,注意其中limitbatchcount的赋值,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// ./mm/slab.c

// setup_kmem_cache_node()
// shared缓存的大小由cache->shared决定,后者在enable_cpucache()中确定值
// shared = 0; if (cachep->size <= PAGE_SIZE && num_possible_cpus() > 1) shared = 8;
// 从源码可以看出,如果只有一个CPU,那么shared就等于0
if (cachep->shared)
{
new_shared = alloc_arraycache(node,
cachep->shared * cachep->batchcount,
0xbaadf00d,
gfp);
}

// ###########################################################################
// alloc_arraycache()
// ###########################################################################
static struct array_cache *alloc_arraycache(int node, int entries, int batchcount, gfp_t gfp)
{
size_t memsize = sizeof(void *) * entries + sizeof(struct array_cache);
struct array_cache *ac = NULL;

ac = kmalloc_node(memsize, gfp, node);
init_arraycache(ac, entries, batchcount);
return ac;
}

// ###########################################################################
// init_arraycache()
// ###########################################################################
static void init_arraycache(struct array_cache *ac, int limit, int batch)
{
/*
* The array_cache structures contain pointers to free object.
* However, when such objects are allocated or transferred to another
* cache the pointers are not cleared and they could be counted as
* valid references during a kmemleak scan. Therefore, kmemleak must
* not scan such objects.
*/
kmemleak_no_scan(ac);
if (ac) {
ac->avail = 0;
ac->limit = limit;
ac->batchcount = batch;
ac->touched = 0;
}
}

从上面函数可以看出,当slab_state >= FULL之前,我们都不会对per-cpu缓存或者kmem_block_node->shared进行完全的初始化。那么什么时候slab_state >= FULL呢?这部分是在kmem_cache_init_late()中完成的。

kmem_cache_init_late()

这个函数在start_kernel()中被调用;kmem_cache_init()start_kernel()->mm_init()中被调用。当调用该函数时,它会调用enable_cpucache()函数来重新对per-cpu缓存以及kmem_cache_node进行初始化。完成后,slab_state就变成FULL了。在这之后创建的slab缓存,都会在setup_cpu_cache()中直接调用enable_cpucache()来进行初始化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// ./mm/slab.c
// 所有的slab cache(包括特定和通用的)都要重新初始化per-cpu缓存, shared缓存以及部分kmem_cache的成员变量

void __init kmem_cache_init_late(void)
{
struct kmem_cache *cachep;

/* 6) resize the head arrays to their final sizes */
mutex_lock(&slab_mutex);
list_for_each_entry(cachep, &slab_caches, list)
if (enable_cpucache(cachep, GFP_NOWAIT))
BUG();
mutex_unlock(&slab_mutex);

/* Done! */
slab_state = FULL;

#ifdef CONFIG_NUMA
/*
* Register a memory hotplug callback that initializes and frees
* node.
*/
hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
#endif

/*
* The reap timers are started later, with a module init call: That part
* of the kernel is not yet operational.
*/
}

总结

至此,kmem_cache_init()kmem_cache_init_late()都介绍完了。在下一篇笔记中,我将分析slab缓存的创建以及slab缓存的分配,又是一场硬仗,加油!
最后,附上一幅slab缓存初始化的简化版流程图作为参考。其中,红色箭头表示create_kmalloc_cache()会调用create_boot_cache(),蓝色箭头表示create_kmalloc_caches()会调用create_kmalloc_cache()

参考资料

  1. 细节拉满,带你一步一图推演 slab 内存池的设计与实现
  2. 从内核源码看 slab 内存池的创建初始化流程
CATALOG
  1. 1. 为什么需要slab系统
  2. 2. 相关结构体
    1. 2.1. kmem_cache
    2. 2.2. array_cache
    3. 2.3. kmem_cache_node
  3. 3. slab系统的初始化
    1. 3.1. kmem_cache_init()
      1. 3.1.1. create_boot_cache()
      2. 3.1.2. __kmem_cache_create()
      3. 3.1.3. set_objfreelist_slab_cache()
      4. 3.1.4. calculate_slab_order()
      5. 3.1.5. set_off_slab_cache()
      6. 3.1.6. set_on_slab_cache()
      7. 3.1.7. 重新回到__kmem_cache_create()
      8. 3.1.8. setup_cpu_cache()
      9. 3.1.9. kmem_cache_init_late()
  4. 4. 总结
  5. 5. 参考资料