从零到负一

11. 【Slab】3 - slab的基本操作 - kmem_cache_create()

2023/02/22

在上一篇笔记10. 【slab】2 - slab的基本操作 - kmem_cache_init()中,当遇到kmem_cache_create()时,我说之后会写笔记进行总结。这篇笔记就是对这个函数的详细总结。内容比较多、比较杂,我自己也写的乱,需要好好理解思考,静下来看。

kmem_cache_create()主要用于创建专用高速缓存,但在初始化阶段(kmem_cache_init()),它还用于初始化普通高速缓存。这个函数比较复杂,我们分几步来分析该函数。

  1. 计算对齐值;
  2. 获取高速缓存的描述符;
  3. 计算高速缓存中对象的大小、剩余空间等;
  4. 计算slab颜色相关变量;
  5. enable_cpucache;
  6. 做一些检查并将高速缓存放入cache_chain

获取高速缓存的描述符

这部分包括上面的1和2,需要涉及函数 - kmem_cache_alloc(),任务量巨大,我准备另开一个笔记记录这个函数(好吧,又需要多写一篇)。
这一部分除了kmem_cache_alloc()外,只做了2件简单的工作 - 将size进行BYTES_PER_WORD对齐;根据情况调整align(buffer alignment),总体而言就是解决对齐问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
/**
* kmem_cache_create - Create a cache.
* @name: A string which is used in /proc/slabinfo to identify this cache.
* @size: The size of objects to be created in this cache.
* @align: The required alignment for the objects.
* @flags: SLAB flags
* @ctor: A constructor for the objects.
* @dtor: A destructor for the objects.
*
* Returns a ptr to the cache on success, NULL on failure.
* Cannot be called within a int, but can be interrupted.
* The @ctor is run when new pages are allocated by the cache
* and the @dtor is run before the pages are handed back.
*
* @name must be valid until the cache is destroyed. This implies that
* the module calling this has to destroy the cache before getting
* unloaded.
*
* The flags are
*
* %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
* to catch references to uninitialised memory.
*
* %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
* for buffer overruns.
*
* %SLAB_NO_REAP - Don't automatically reap this cache when we're under
* memory pressure.
*
* %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
* cacheline. This can be beneficial if you're counting cycles as closely
* as davem.
*/
kmem_cache_t *
kmem_cache_create(const char *name,
size_t size,
size_t align,
unsigned long flags,
void (*ctor)(void*, kmem_cache_t *, unsigned long),
void (*dtor)(void*, kmem_cache_t *, unsigned long))
{
size_t left_over, slab_size, ralign;
kmem_cache_t *cachep = NULL;

/*
* Sanity checks... these are all serious usage bugs.
*/
if ((!name) ||
in_interrupt() ||
(size < BYTES_PER_WORD) ||
(size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
(dtor && !ctor)) {
printk(KERN_ERR "%s: Early error in slab %s\n",
__FUNCTION__, name);
BUG();
}

if (flags & SLAB_DESTROY_BY_RCU)
BUG_ON(dtor);

/*
* Always checks flags, a caller might be expecting debug
* support which isn't available.
*/
if (flags & ~CREATE_MASK)
BUG();

/* Check that size is in terms of words. This is needed to avoid
* unaligned accesses for some archs when redzoning is used, and makes
* sure any on-slab bufctl's are also correctly aligned.
*/
// 将size按BYTES_PER_WORD进行对齐
if (size & (BYTES_PER_WORD-1)) {
size += (BYTES_PER_WORD-1);
size &= ~(BYTES_PER_WORD-1);
}

/* calculate out the final buffer alignment: */
/* 1) arch recommendation: can be overridden for debug */
if (flags & SLAB_HWCACHE_ALIGN) {
/* Default alignment: as specified by the arch code.
* Except if an object is really small, then squeeze multiple
* objects into one cacheline.
*/
ralign = cache_line_size();
while (size <= ralign/2)
ralign /= 2;
} else {
ralign = BYTES_PER_WORD;
}
/* 2) arch mandated alignment: disables debug if necessary */
if (ralign < ARCH_SLAB_MINALIGN) {
ralign = ARCH_SLAB_MINALIGN;
if (ralign > BYTES_PER_WORD)
flags &= ~(SLAB_RED_ZONE|SLAB_STORE_USER);
}
/* 3) caller mandated alignment: disables debug if necessary */
if (ralign < align) {
ralign = align;
if (ralign > BYTES_PER_WORD)
flags &= ~(SLAB_RED_ZONE|SLAB_STORE_USER);
}
/* 4) Store it. Note that the debug code below can reduce
* the alignment to BYTES_PER_WORD.
*/
// 将最终确定的buffer alignment写回align
align = ralign;

/* Get cache's description obj. */
// 关键步骤 - 通过slab系统获取kmem_cache_t结构体(获取一个对象)
cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL);
if (!cachep)
goto opps;
memset(cachep, 0, sizeof(kmem_cache_t));

...
}

确定slab内部结构以及计算其颜色变量

这部分还是比较容易理解的,首先根据各种条件不断地计算,直到最后确定slab内部的结构(比如slab和对象是否在同一个页,slab的偏移是多少,页内剩余空间多少等);有了这些信息,我们就可以通过剩余空间来计算slab一共可以有多少种颜色,slab颜色偏移的值等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
    /* Determine if the slab management is 'on' or 'off' slab. */
if (size >= (PAGE_SIZE>>3))
/*
* Size is large, assume best to place the slab management obj
* off-slab (should allow better packing of objs).
*/
// 对象越大越应该将slab和对象描述符放在外面
flags |= CFLGS_OFF_SLAB;

// size再次对齐
size = ALIGN(size, align);

// 开始计算slab的内部结构
if ((flags & SLAB_RECLAIM_ACCOUNT) && size <= PAGE_SIZE) {
/*
* A VFS-reclaimable slab tends to have most allocations
* as GFP_NOFS and we really don't want to have to be allocating
* higher-order pages when we are unable to shrink dcache.
*/
cachep->gfporder = 0;
cache_estimate(cachep->gfporder, size, align, flags, &left_over, &cachep->num);
} else {
/*
* Calculate size (in pages) of slabs, and the num of objs per
* slab. This could be made much more intelligent. For now,
* try to avoid using high page-orders for slabs. When the
* gfp() funcs are more friendly towards high-order requests,
* this should be changed.
*/
do {
unsigned int break_flag = 0;
cal_wastage:
cache_estimate(cachep->gfporder, size, align, flags, &left_over, &cachep->num);
if (break_flag)
break;
if (cachep->gfporder >= MAX_GFP_ORDER)
break;
// 如果没有办法放下1个对象,那么就需要加大页的order
if (!cachep->num)
goto next;
// 一种特殊的错误情况,需要减少页的order再计算一次
if (flags & CFLGS_OFF_SLAB && cachep->num > offslab_limit) {
/* This num of objs will cause problems. */
cachep->gfporder--;
break_flag++;
goto cal_wastage;
}

/*
* Large num of objs is good, but v. large slabs are
* currently bad for the gfp()s.
*/
if (cachep->gfporder >= slab_break_gfp_order)
break;

if ((left_over * 8) <= (PAGE_SIZE << cachep->gfporder))
break; /* Acceptable internal fragmentation. */
next:
cachep->gfporder++;
} while (1);
}

// 如果通过增大order还是不能获取对象,那么只能报错并释放之前分配的高速缓存描述符
if (!cachep->num) {
printk("kmem_cache_create: couldn't create cache %s.\n", name);
// 释放掉已经分配的cachep
kmem_cache_free(&cache_cache, cachep);
cachep = NULL;
goto opps;
}
slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab), align);

/*
* If the slab has been placed off-slab, and we have enough space then
* move it on-slab. This is at the expense of any extra colouring.
*/
if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {
flags &= ~CFLGS_OFF_SLAB;
left_over -= slab_size;
}

if (flags & CFLGS_OFF_SLAB) {
/* really off slab. No need for manual alignment */
// 如果slab在外部,不需要进行额外的对齐了
slab_size = cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab);
}

// 从这里开始计算slab颜色相关的变量
cachep->colour_off = cache_line_size();
/* Offset must be a multiple of the alignment. */
if (cachep->colour_off < align)
cachep->colour_off = align;
cachep->colour = left_over / cachep->colour_off;
cachep->slab_size = slab_size;
cachep->flags = flags;
cachep->gfpflags = 0;
if (flags & SLAB_CACHE_DMA)
cachep->gfpflags |= GFP_DMA;
spin_lock_init(&cachep->spinlock);
cachep->objsize = size;
/* NUMA */
INIT_LIST_HEAD(&cachep->lists.slabs_full);
INIT_LIST_HEAD(&cachep->lists.slabs_partial);
INIT_LIST_HEAD(&cachep->lists.slabs_free);

if (flags & CFLGS_OFF_SLAB)
// slabp_cache is the general cache which contains slab descriptor
// 这个函数很简单,就是遍历malloc_sizes然后返回大于slab_size最小的高速缓存
cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0);
cachep->ctor = ctor;
cachep->dtor = dtor;
cachep->name = name;

初始化CPU本地缓存

在上面这几步完成后,高速缓存描述符、slab内部结构(包括着色)都完成了,接下来就是初始化CPU的本地高速缓存了。这个过程比较繁琐,需要静下心来慢慢地看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/* Don't let CPUs to come and go */
lock_cpu_hotplug();

// 当slab完全初始化后,就通过enable_cpucache()函数分配CPU的本地高速缓存 -> 这个函数有比较繁杂,下面单独介绍
if (g_cpucache_up == FULL) {
enable_cpucache(cachep);
} else {
// 这部分还算好理解,这里用了全局变量g_cpucache_up来区别slab是否初始化完成。在slab初始化完成前,g_cpucache_up都不会是FULL的,
// 并且第一次时是NONE,之后都是PARTIAL
if (g_cpucache_up == NONE) {
/* Note: the first kmem_cache_create must create
* the cache that's used by kmalloc(24), otherwise
* the creation of further caches will BUG().
*/
cachep->array[smp_processor_id()] = &initarray_generic.cache;
g_cpucache_up = PARTIAL;
} else {
// 只要不是第一次,我们已经有了普通高速缓存可以使用了
cachep->array[smp_processor_id()] = kmalloc(sizeof(struct arraycache_init),GFP_KERNEL);
}
BUG_ON(!ac_data(cachep));
ac_data(cachep)->avail = 0;
ac_data(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
ac_data(cachep)->batchcount = 1;
ac_data(cachep)->touched = 0;
// 高速缓存的这两个参数和CPU本地高速缓存保持一致
cachep->batchcount = 1;
cachep->limit = BOOT_CPUCACHE_ENTRIES;
// TODO: 不太理解这里,是[(本地高速缓存 + shared的高速缓存) * batchcount + 一个slab的容量]吗?
cachep->free_limit = (1 + num_online_cpus()) * cachep->batchcount + cachep->num;
}

cachep->lists.next_reap = jiffies + REAPTIMEOUT_LIST3 +
((unsigned long)cachep)%REAPTIMEOUT_LIST3;

enable_cpucache()

这个函数在两个地方会被使用,第一个就是上一篇中提到的kmem_cache_init(),还有一个地方就是这里,并且是当kmem_cache_init()初始化完成后。这个函数主要是完成对CPU本地缓存以及shared缓存的初始化。其最主要部分都在do_tune_cpucache()中完成。

【补充内容】不管是在kmem_cache_init()还是在kmem_cache_create()里面,本地高速缓存和shared的高速缓存在这之前都是没有完全初始化的(kmem_cache_init()中普通高速缓存的本地高速缓存只有一个对象,并且没有shared高速缓存)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
static void enable_cpucache (kmem_cache_t *cachep)
{
int err;
int limit, shared;

/* The head array serves three purposes:
* - create a LIFO ordering, i.e. return objects that are cache-warm
* - reduce the number of spinlock operations.
* - reduce the number of linked list operations on the slab and
* bufctl chains: array operations are cheaper.
* The numbers are guessed, we should auto-tune as described by
* Bonwick.
*/
// 根据对象的大小确定本地高速缓存的最大空闲对象数
if (cachep->objsize > 131072)
limit = 1;
else if (cachep->objsize > PAGE_SIZE)
limit = 8;
else if (cachep->objsize > 1024)
limit = 24;
else if (cachep->objsize > 256)
limit = 54;
else
limit = 120;

/* Cpu bound tasks (e.g. network routing) can exhibit cpu bound
* allocation behaviour: Most allocs on one cpu, most free operations
* on another cpu. For these cases, an efficient object passing between
* cpus is necessary. This is provided by a shared array. The array
* replaces Bonwick's magazine layer.
* On uniprocessor, it's functionally equivalent (but less efficient)
* to a larger limit. Thus disabled by default.
*/
shared = 0;
#ifdef CONFIG_SMP
if (cachep->objsize <= PAGE_SIZE)
shared = 8;
#endif

#if DEBUG
/* With debugging enabled, large batchcount lead to excessively
* long periods with disabled local interrupts. Limit the
* batchcount
*/
if (limit > 32)
limit = 32;
#endif
// 只要不是SMP,shared就是0,shared的高速缓存就没对象
err = do_tune_cpucache(cachep, // kmem_cache_t* cachep
limit, // int limit - 本地高速缓存的最大对象数
(limit + 1) / 2, // int batchcount - 对象填充、移动时的数量,这里设置为limit的一半
shared); // int shared
if (err)
printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
cachep->name, -err);
}

do_tune_cpucache()

在介绍这个函数前,我们还需要了解几个辅助函数以及结构体,比如ccupdate_struct, alloc_arraycache等。注意,这里我不打算分析kmalloc()kfree(),我们直接开始看其它几个函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
// 这里我们默认NR_CPUS = 1,也就是只有1个核
struct ccupdate_struct {
kmem_cache_t *cachep;
struct array_cache *new[NR_CPUS];
};

// 分配并获取array_cache的内存空间
static struct array_cache *alloc_arraycache(int cpu, int entries, int batchcount)
{
// 我们需要的内存空间包括括描述符和对象的地址空间
int memsize = sizeof(void *) * entries + sizeof(struct array_cache);
struct array_cache *nc = NULL;

if (cpu != -1) {
// ARM或者X86_64中这个函数实际还是调用kmem_cache_alloc()
nc = kmem_cache_alloc_node(kmem_find_general_cachep(memsize, GFP_KERNEL), cpu_to_node(cpu));
}
// 如果cpu == -1(分配shared的高速缓存时)或者因为种种原因,nc == NULL,那么就用kmalloc()函数
if (!nc)
nc = kmalloc(memsize, GFP_KERNEL);
// 更新array_cache
if (nc) {
nc->avail = 0; // 到现在我们并没有获取对象的地址,因此这个值是0
nc->limit = entries;
nc->batchcount = batchcount;
nc->touched = 0;
}
return nc;
}

// 这个函数用于释放array_cache中的对象,并更新它们对应的slab
static void free_block(kmem_cache_t *cachep, void **objpp, int nr_objects)
{
int i;

check_spinlock_acquired(cachep);

/* NUMA: move add into loop */
cachep->lists.free_objects += nr_objects;

// array_cache中的block可能来自不同的slab - 比如shared的高速缓存
// TODO: CPU本地高速缓存是否会出现这种情况?
for (i = 0; i < nr_objects; i++) {
void *objp = objpp[i];
struct slab *slabp;
unsigned int objnr;

slabp = GET_PAGE_SLAB(virt_to_page(objp));
list_del(&slabp->list);
// 确定是第几个对象地址
objnr = (objp - slabp->s_mem) / cachep->objsize;
check_slabp(cachep, slabp);
// 参考 12. 【Slab】4 - Slab的基本操作 - kmem_cache_alloc(),这篇笔记有分析对象描述符如何工作的
slab_bufctl(slabp)[objnr] = slabp->free;
slabp->free = objnr;
STATS_DEC_ACTIVE(cachep);
slabp->inuse--;
check_slabp(cachep, slabp);

/* fixup slab chains */
// 将slab重新放回合适的链表
if (slabp->inuse == 0) {
// 这种情况什么时候会发生?free_objects > free_limit?
if (cachep->lists.free_objects > cachep->free_limit) {
// 减去一个slab中所有free的对象
cachep->lists.free_objects -= cachep->num;
// 关于销毁等的函数单独开个笔记分析吧,太多内容了!
slab_destroy(cachep, slabp);
} else {
list_add(&slabp->list, &list3_data_ptr(cachep, objp)->slabs_free);
}
} else {
/* Unconditionally move a slab to the end of the
* partial list on free - maximum time for the
* other objects to be freed, too.
*/
list_add_tail(&slabp->list, &list3_data_ptr(cachep, objp)->slabs_partial);
}
}
}

static void do_ccupdate_local(void *info)
{
struct ccupdate_struct *new = (struct ccupdate_struct *)info;
struct array_cache *old;

check_irq_off();
old = ac_data(new->cachep);

// 这里的命名一言难尽,都是new,这里将新分配的地址和cachep原来指向的地址对换
new->cachep->array[smp_processor_id()] = new->new[smp_processor_id()];
new->new[smp_processor_id()] = old;
}

// 这个函数主要用于分配CPU本地高速缓存以及CPU共享的高速缓存描述符
static int do_tune_cpucache (kmem_cache_t* cachep, int limit, int batchcount, int shared)
{
struct ccupdate_struct new;
struct array_cache *new_shared;
int i;

memset(&new.new, 0, sizeof(new.new));
// 这里用一个本地变量来临时保存分配的内存空间地址
for (i = 0; i < NR_CPUS; i++) {
if (cpu_online(i)) {
// 从普通高速缓存获取对象
new.new[i] = alloc_arraycache(i, limit, batchcount);
if (!new.new[i]) {
for (i--; i >= 0; i--) kfree(new.new[i]);
return -ENOMEM;
}
} else {
new.new[i] = NULL;
}
}
new.cachep = cachep;

// 这个函数主要是调用do_ccupdate_local()这个回调函数,它用于实现将新分配的本地高速缓存赋值给cachep
smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);

check_irq_on();
spin_lock_irq(&cachep->spinlock);
cachep->batchcount = batchcount;
cachep->limit = limit;
// 同上一个TODO,不清楚这个值到底什么意思
cachep->free_limit = (1 + num_online_cpus()) * cachep->batchcount + cachep->num;
spin_unlock_irq(&cachep->spinlock);

for (i = 0; i < NR_CPUS; i++) {
struct array_cache *ccold = new.new[i];
// 对于刚创建的缓存,其本地高速缓存就是NULL,不需要进行释放
if (!ccold)
continue;
spin_lock_irq(&cachep->spinlock);
// 释放ccold中的对象
free_block(cachep, ac_entry(ccold), ccold->avail);
spin_unlock_irq(&cachep->spinlock);
kfree(ccold);
}
// 这部分负责分配shared的高速缓存,具体和上面类似
// shared的高速缓存使用了一个超大的batchcount
new_shared = alloc_arraycache(-1, batchcount * shared, 0xbaadf00d);
if (new_shared) {
struct array_cache *old;

spin_lock_irq(&cachep->spinlock);
old = cachep->lists.shared;
cachep->lists.shared = new_shared;
if (old)
free_block(cachep, ac_entry(old), old->avail);
spin_unlock_irq(&cachep->spinlock);
kfree(old);
}

return 0;
}

最后的检查以及将高速缓存插入cache_chain

这部分很简单,直接看代码就行了,不用解释了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
    /* Need the semaphore to access the chain. */
down(&cache_chain_sem);
{
struct list_head *p;
mm_segment_t old_fs;

old_fs = get_fs();
set_fs(KERNEL_DS);
list_for_each(p, &cache_chain) {
kmem_cache_t *pc = list_entry(p, kmem_cache_t, next);
char tmp;
/* This happens when the module gets unloaded and doesn't
destroy its slab cache and noone else reuses the vmalloc
area of the module. Print a warning. */
if (__get_user(tmp,pc->name)) {
printk("SLAB: cache with size %d has lost its name\n",
pc->objsize);
continue;
}
if (!strcmp(pc->name,name)) {
printk("kmem_cache_create: duplicate cache %s\n",name);
up(&cache_chain_sem);
unlock_cpu_hotplug();
BUG();
}
}
set_fs(old_fs);
}

/* cache setup completed, link it into the list */
list_add(&cachep->next, &cache_chain);
up(&cache_chain_sem);
unlock_cpu_hotplug();
opps:
if (!cachep && (flags & SLAB_PANIC))
panic("kmem_cache_create(): failed to create slab `%s'\n", name);
return cachep;
}

总结

至此,一个高速缓存就这么分配好了。通过上面分析我们可以看出,这个新分配的高速缓存并没有实际分配任何的slab和对象,它们会在以后真正需要使用时再被分配。这部分内容以及高速缓存的销毁、slab的释放等都会在之后的笔记中记录,这篇就到此结束。

CATALOG
  1. 1. 获取高速缓存的描述符
  2. 2. 确定slab内部结构以及计算其颜色变量
  3. 3. 初始化CPU本地缓存
    1. 3.1. enable_cpucache()
      1. 3.1.1. do_tune_cpucache()
  4. 4. 最后的检查以及将高速缓存插入cache_chain
  5. 5. 总结