【LM13】slab系统的初始化

硬核技术 Linux内核 Linux_5_0 ARMv8_64 Linux内存管理

 2023/12/22 

之前停更了很久，几个月了，现在终于来到slab系统了。slab系统我也是看了很久才搞明白，这篇笔记主要针对slab系统的介绍、结构体以及初始化的过程。

为什么需要slab系统

之前我们已经分析了伙伴系统，伙伴系统有一些自己的问题，比如只能管理2 ^ order个页的内存空间；比如每次分配和释放的流程都太长，除了影响性能外还可能导致CPU的缓存和TLB被污染；比如每次分配的内存空间都是按页对齐的，对于小的需求，可能只会用一个页的前几个、几十个字节。这样有更高的概率导致不同的对象映射到相同的cache line，从而导致cache性能的低下。slab系统的诞生，就是为了解决这些问题。slab系统本质上是使用伙伴系统创建一系列的对象内存池，有了这些内存池，伙伴系统的这些缺陷都可以很好地被解决。

slab系统可以用已经分配好的物理页分配小于2 ^ order大小的内存空间；
由于slab系统使用内存池，因此其分配、回收对象内存的流程会短很多，从而提高了速度并减少了缓存, TLB的污染；
slab系统分配的对象地址有一定的offset，因此大大降低了对象共用cache line的情况。

slab系统的初始化

有了结构体的基础，现在我们来看看slab系统是如何进行初始化的。

kmem_cache_init()

有了slab的基本背景知识，我们第一个要看的函数就是kmem_cache_init()。这里会牵涉部分slab缓存分配相关的知识，对于这部分知识，这里不深究细节，在之后的笔记中再详细讨论slab缓存的分配和释放。
kmem_cache_init()函数主要是为了解决一个简单但重要的问题 - 如何创建第一个kmem_cache类型的缓存(同时也是第一个slab缓存)。这里我简单说说为什么这是一个问题：在我们创建任何一个slab缓存时，我们都需要使用slab分配一个kmem_cache类型的对象。假如我们第一次创建slab缓存，这时整个内核中都没有kmem_cache类型的缓存(因此没法获得kmem_cache类型的对象)，那么内核就需要先创建一个kmem_cache类型的缓存，然后再分配kmem_cache类型的对象。但创建kmem_cache类型缓存的过程同样需要使用slab分配一个kmem_cache类型的对象。这就进入了死循环，我们需要kmem_cache类型的缓存，但创建它又需要kmem_cache类型的缓存，这样就没有办法完成第一个kmem_cache类型缓存的创建。kmem_cache_init()使用静态分配的方式创建了第一个kmem_cache类型的缓存，从而解决了这个难题。我们现在来看看这个函数是如何实现的。

// ./mm/slab.c

/*
 * Initialisation.  Called after the page allocator have been initialised and
 * before smp_init().
 */
void __init kmem_cache_init(void)
{
    int i;
    
    // kmem_cache_boot是静态定义的kmem_cache类型的slab缓存
    // kmem_cache是全局变量 
    kmem_cache = &kmem_cache_boot;

    if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1)
        use_alien_caches = 0;
    
    // 内核静态定义了NUM_INIT_LISTS个kmem_cache_node，这里NUM_INIT_LISTS等于2
    for (i = 0; i < NUM_INIT_LISTS; i++)
        // 初始化静态定义的kmem_cache_node
        kmem_cache_node_init(&init_kmem_cache_node[i]);

    /*
     * Fragmentation resistance on low memory - only use bigger
     * page orders on machines with more than 32MB of memory if
     * not overridden on the command line.
     */
    if (!slab_max_order_set && totalram_pages() > (32 << 20) >> PAGE_SHIFT)
        slab_max_order = SLAB_MAX_ORDER_HI;

    /* Bootstrap is tricky, because several objects are allocated
     * from caches that do not exist yet:
     * 1) initialize the kmem_cache cache: it contains the struct
     *    kmem_cache structures of all caches, except kmem_cache itself:
     *    kmem_cache is statically allocated.
     *    Initially an __init data area is used for the head array and the
     *    kmem_cache_node structures, it's replaced with a kmalloc allocated
     *    array at the end of the bootstrap.
     * 2) Create the first kmalloc cache.
     *    The struct kmem_cache for the new cache is allocated normally.
     *    An __init data area is used for the head array.
     * 3) Create the remaining kmalloc caches, with minimally sized
     *    head arrays.
     * 4) Replace the __init data head arrays for kmem_cache and the first
     *    kmalloc cache with kmalloc allocated arrays.
     * 5) Replace the __init data for kmem_cache_node for kmem_cache and
     *    the other cache's with kmalloc allocated memory.
     * 6) Resize the head arrays of the kmalloc caches to their final sizes.
     */

    /*
     * struct kmem_cache size depends on nr_node_ids & nr_cpu_ids
     */
    // -------------------------------------------------------------------------------------------------- (1)    
    // 完成对静态定义的kmem_cache的初始化
    create_boot_cache(kmem_cache, 
                      "kmem_cache",
                      // 在UMA的系统中，kmem_cache的大小是定的，因为只有一个node
                      // 对于有NUMA系统，这里需要使用如下计算方式确定kmem_cache的大小
                      offsetof(struct kmem_cache, node) + nr_node_ids * sizeof(struct kmem_cache_node *),
                      SLAB_HWCACHE_ALIGN, 
                      0, 
                      0);
    
    // 将创建好的kmem_cache链接到slab_caches上，kmem_cache是slab_caches上的第一个slab缓存(kmem_cache类型)
    list_add(&kmem_cache->list, &slab_caches);
    
    // 没有定义CONFIG_MEMCG_KMEM时为空函数 
    memcg_link_cache(kmem_cache);
    
    // 更新slab_state的状态，目前已经完成 1) kmem_cache类型的slab缓存的创建
    slab_state = PARTIAL;

    /*
     * Initialize the caches that provide memory for the kmem_cache_node
     * structures first.  Without this, further allocations will bug.
     */
    // -------------------------------------------------------------------------------------------------- (2)    
    // 创建一个大小适合kmem_cache_node类型的通用缓存。这里需要使用上面创建的kmem_cache类型的缓存分配对象，具体的分配
    // 过程这里就不深入讨论了。
    // 
    // 1. #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node))
    // 2. kmalloc_info提供通用缓存的名字和大小;
    //
    // 这里简单给个函数的调用流程，
    // 1. 首先通过调用kmem_cache_zalloc()函数从kmem_cache(这个就是(1)中创建的)中获取一个kmem_cache类型的对象;
    // 2. 调用create_boot_cache()函数来初始化这个kmem_cache，其中大小、名字等都来自kmalloc_info[]; 
    // 3. 将这个通用缓存链接入slab_caches，最后将其放入kmalloc_caches中。 
    //
    // 注意，这个通用缓存并没有分配实际的缓存空间，这在之后的分配过程中会通过伙伴系统获取
    kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE] = create_kmalloc_cache(kmalloc_info[INDEX_NODE].name, // name
                                                                     kmalloc_size(INDEX_NODE),      // size 
                                                                     ARCH_KMALLOC_FLAGS,            // flag
                                                                     0,                             // useroffset
                                                                     kmalloc_size(INDEX_NODE));     // usersize
    
    // 更新slab_state的状态，目前已经完成 1) kmem_cache类型的slab缓存的创建；
    //                                  2) 大小适合kmem_cache_node类型的通用slab缓存的创建
    slab_state = PARTIAL_NODE;
    
    // 更新一些特殊情况下的size_index[]
    setup_kmalloc_cache_index_table();

    slab_early_init = 0;

    /* 5) Replace the bootstrap kmem_cache_node */
    {
        int nid;
        // (2)中已经创建了可供kmem_cache_node使用的通用缓存，这里主要做以下几件事情
        // 1. 通过通用缓存获取kmem_cache_node的对象；
        // 2. 将init_kmem_cache_node[]中的内容复制到该kmem_cache_node对象;
        // 3. 用该kmem_cache_node对象更新kmem_cache->node.
        //
        // 完成后，kmem_cache和kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE]就有动态生成的kmem_cache_node了
        for_each_online_node(nid) {
            init_list(kmem_cache, &init_kmem_cache_node[CACHE_CACHE + nid], nid);

            init_list(kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE],
                      &init_kmem_cache_node[SIZE_NODE + nid], nid);
        }
    }

    // -------------------------------------------------------------------------------------------------- (3)    
    // 完成剩下所有通用缓存的创建(和上面(2)中创建方式一样，最后调用create_kmalloc_cache()创建这戏通用缓存)并修改slab_state = UP
    // 到目前为止，所有kmem_cache_node中的shared都为NULL
    create_kmalloc_caches(ARCH_KMALLOC_FLAGS);
}

kmem_cache_init()完成后，slab系统就可以正常使用了。关于函数的说明，我已在注释中写得很明白了，重要的部分我也标记了(1), (2), (3)。这里再简单补充一个点，在(1)完成后，kmem_cache类型的缓存创建成功。这里的成功指的是kmem_cache这个结构体完成了初始化，但并没有真正地给它分配内存空间。当第一次分配对象时，slab的分配函数才会去调用伙伴系统来分配内存空间。当内存空间分配好后，slab会把获取的页拆分成一个一个的对象并进行分配和释放。
上面代码注释中，我已经解释了(2), (3)处的内容，这里我们就只看(1)处的函数，

create_boot_cache()

这个函数先对要创建的缓存进行简单地初始化，然后调用__kmem_cache_create()来完成更多的初始化工作。我们先来看看哪些函数会调用__kmem_cache_create()，

// 可以看出，只要是创建新的slab缓存，都需要调用__kmem_cache_create()函数
// 在下一篇笔记中，我们可以看出kmem_cache_create()也会调用__kmem_cache_create()函数

create_boot_cache()
---> __kmem_cache_create()

create_kmalloc_cache()
---> create_boot_cache()
     ---> __kmem_cache_create()

create_kmalloc_caches()
---> create_kmalloc_cache()
     ---> create_boot_cache()
          ---> __kmem_cache_create()

kmem_cache_create()
---> kmem_cache_create_usercopy()
     ---> create_cache()
          ---> __kmem_cache_create()

接下来我们就进入create_kmalloc_cache()函数吧，

// ./mm/slab_common.c

/* Create a cache during boot when no slab services are available yet */
void __init create_boot_cache(struct kmem_cache *s, const char *name,
                              unsigned int size, slab_flags_t flags,
                              unsigned int useroffset, unsigned int usersize)
{
    int err;

    s->name = name;
    s->size = s->object_size = size;
    s->align = calculate_alignment(flags, ARCH_KMALLOC_MINALIGN, size);
    s->useroffset = useroffset;
    s->usersize = usersize;
    
    // 没有定义CONFIG_MEMCG_KMEM时为空函数
    slab_init_memcg_params(s);

    err = __kmem_cache_create(s, flags);

    if (err)
        panic("Creation of kmalloc slab %s size=%u failed. Reason %d\n", name, size, err);

    s->refcount = -1;    /* Exempt from merging for now */
}

__kmem_cache_create()

这个函数比较长，需要花点时间来慢慢分析。这个函数是一个通用的函数，除了用在这里，kmem_cache_create()这个通用的slab缓存创建函数也会最终调用该函数。

//  ./mm/slab.c

/**
 * __kmem_cache_create - Create a cache.
 * @cachep: cache management descriptor
 * @flags: SLAB flags
 *
 * Returns a ptr to the cache on success, NULL on failure.
 * Cannot be called within a int, but can be interrupted.
 * The @ctor is run when new pages are allocated by the cache.
 *
 * The flags are
 *
 * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
 * to catch references to uninitialised memory.
 *
 * %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
 * for buffer overruns.
 *
 * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
 * cacheline.  This can be beneficial if you're counting cycles as closely
 * as davem.
 */
int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
{
    size_t ralign = BYTES_PER_WORD;
    gfp_t gfp;
    int err;
    unsigned int size = cachep->size;

// 根据不同的条件设置debug的标志位(SLAB_RED_ZONE, SLAB_STORE_USER, SLAB_POISON)
#if DEBUG
#if FORCED_DEBUG
    /*
     * Enable redzoning and last user accounting, except for caches with
     * large objects, if the increased size would increase the object size
     * above the next power of two: caches with object sizes just above a
     * power of two have a significant amount of internal fragmentation.
     */
    if (size < 4096 || fls(size - 1) == fls(size - 1 + REDZONE_ALIGN + 2 * sizeof(unsigned long long)))
        flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
    if (!(flags & SLAB_TYPESAFE_BY_RCU))
        flags |= SLAB_POISON;
#endif
#endif

    // --------------------------------------------------------------------------------------------- (1)
    // 将对象大小按照word进行对齐
    /*
     * Check that size is in terms of words.  This is needed to avoid
     * unaligned accesses for some archs when redzoning is used, and makes
     * sure any on-slab bufctl's are also correctly aligned.
     */
    size = ALIGN(size, BYTES_PER_WORD);
    
    // 更新对象的对齐值
    if (flags & SLAB_RED_ZONE) {
        // #define REDZONE_ALIGN max(BYTES_PER_WORD, __alignof__(unsigned long long))
        ralign = REDZONE_ALIGN;
        /* If redzoning, ensure that the second redzone is suitably
         * aligned, by adjusting the object size accordingly. */
        // 再次调整对象大小的对齐
        size = ALIGN(size, REDZONE_ALIGN);
    }

    /* 3) caller mandated alignment */
    if (ralign < cachep->align) {
        ralign = cachep->align;
    }
    /* disable debug if necessary */
    if (ralign > __alignof__(unsigned long long))
        flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
    /*
     * 4) Store it.
     */
    // 更新缓存的对齐
    cachep->align = ralign;
    
    // 初始化并更新slab的着色offset     
    cachep->colour_off = cache_line_size();
    /* Offset must be a multiple of the alignment. */
    if (cachep->colour_off < cachep->align)
        cachep->colour_off = cachep->align;
    
    // 判断slab是否完成初始化(slab_state>=UP)
    if (slab_is_available())
        gfp = GFP_KERNEL;
    else
        gfp = GFP_NOWAIT;

// 根据debug的标志位更新obj_offset和size
#if DEBUG
    /*
     * Both debugging options require word-alignment which is calculated
     * into align above.
     */
    if (flags & SLAB_RED_ZONE) {
        /* add space for red zone words */
        cachep->obj_offset += sizeof(unsigned long long);
        size += 2 * sizeof(unsigned long long);
    }
    if (flags & SLAB_STORE_USER) {
        /* user store requires one word storage behind the end of
         * the real object. But if the second red zone needs to be
         * aligned to 64 bits, we must allow that much space.
         */
        if (flags & SLAB_RED_ZONE)
            size += REDZONE_ALIGN;
        else
            size += BYTES_PER_WORD;
    }
#endif
    
    // 没有定义CONFIG_KASAN时为空函数
    kasan_cache_create(cachep, &size, &flags);
    
    // 再一次对对象大小进行对齐
    size = ALIGN(size, cachep->align);
    
    /*
     * We should restrict the number of objects in a slab to implement
     * byte sized index. Refer comment on SLAB_OBJ_MIN_SIZE definition.
     */
    // 这部分参考SLAB_OBJ_MIN_SIZE的注释，这里简单解释下，
    // 如果只用一个byte来做slab的索引，那么只能映射2 ^ 8个slab
    // 因此，在这种情况slab缓存的大小不能太小，否则一个page就有可能
    // 超过2 ^ 8个slab
    if (FREELIST_BYTE_INDEX && size < SLAB_OBJ_MIN_SIZE)
        size = ALIGN(SLAB_OBJ_MIN_SIZE, cachep->align);

    // --------------------------------------------------------------------------------------------- (2)
#if DEBUG
    /*
     * To activate debug pagealloc, off-slab management is necessary
     * requirement. In early phase of initialization, small sized slab
     * doesn't get initialized so it would not be possible. So, we need
     * to check size >= 256. It guarantees that all necessary small
     * sized slab is initialized in current slab initialization sequence.
     */
    if (size >= 256 && cachep->object_size > cache_line_size()) {
        if (size < PAGE_SIZE || size % PAGE_SIZE == 0) {
            size_t tmp_size = ALIGN(size, PAGE_SIZE);
            if (set_off_slab_cache(cachep, tmp_size, flags)) {
                flags |= CFLGS_OFF_SLAB;
                cachep->obj_offset += tmp_size - size;
                size = tmp_size;
                goto done;
            }
        }
    }
#endif
    
    // 情况1. 所有的freelist_idx_t都在一个object中
    // 不满足条件要求：
    // if (cachep->num * sizeof(freelist_idx_t) > cachep->object_size) return false;
    if (set_objfreelist_slab_cache(cachep, size, flags)) {
        flags |= CFLGS_OBJFREELIST_SLAB;
        goto done;
    }
    
    // 情况2. 所有的freelist_idx_t都在一个外部的缓存中
    // 不满足条件要求：
    // if (left >= cachep->num * sizeof(freelist_idx_t)) return false;
    if (set_off_slab_cache(cachep, size, flags)) {
        flags |= CFLGS_OFF_SLAB;
        goto done;
    }
    
    // 情况3. 最基本的情况，所有的freelist_idx_t都在当前缓存中
    if (set_on_slab_cache(cachep, size, flags))
        goto done;

    return -E2BIG;
    
    // --------------------------------------------------------------------------------------------- (3)
done:
    // 更新slab缓存中的flah, size等
    cachep->freelist_size = cachep->num * sizeof(freelist_idx_t);
    cachep->flags = flags;
    cachep->allocflags = __GFP_COMP;
    if (flags & SLAB_CACHE_DMA)
        cachep->allocflags |= GFP_DMA;
    if (flags & SLAB_RECLAIM_ACCOUNT)
        cachep->allocflags |= __GFP_RECLAIMABLE;
    cachep->size = size;
    cachep->reciprocal_buffer_size = reciprocal_value(size);

#if DEBUG
    /*
     * If we're going to use the generic kernel_map_pages()
     * poisoning, then it's going to smash the contents of
     * the redzone and userword anyhow, so switch them off.
     */
    // 当设置了CONFIG_PAGE_POISONING后，每个对象会写入一些特定的数据，
    // 这些数据会破坏redzone和user信息，因此需要清除(SLAB_RED_ZONE | SLAB_STORE_USER)
    if (IS_ENABLED(CONFIG_PAGE_POISONING) && (cachep->flags & SLAB_POISON) && is_debug_pagealloc_cache(cachep))
        cachep->flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
#endif
    
    // 分配一个新的通用的slab缓存用于存储freelist
    if (OFF_SLAB(cachep)) {
        cachep->freelist_cache = kmalloc_slab(cachep->freelist_size, 0u);
    }
    
    // 设置per-cpu的缓存
    err = setup_cpu_cache(cachep, gfp);
    if (err) {
        __kmem_cache_release(cachep);
        return err;
    }

    return 0;
}

这个函数很长，我这里简单将其分成3部分，
第一部分是(2)之前所有的内容，这部分主要是对对象的大小、标志位等进行设置在DEBUG模式下，可能会在对象中添加RED_ZONE和POISON；
第二部分是(2) - (3)之间的内容，这部分主要是确定freelist怎么放以及放在哪里，并修改相应的标志位；
第三部分是该函数的最后部分，它会更新kmem_cache中的一些成员变量以及设置per-cpu的缓存。

在这三部分中，第二、三部分比较重要，接下来我们先看第二部分。
这部分主要看3个函数 - set_objfreelist_slab_cache(), set_off_slab_cache()以及set_on_slab_cache()。这三个函数特别相似，这里我就分析第一个就行了。

set_objfreelist_slab_cache()

// ./mm/slab.c

static bool set_objfreelist_slab_cache(struct kmem_cache *cachep,
            size_t size, slab_flags_t flags)
{
    size_t left;

    cachep->num = 0;

    if (cachep->ctor || flags & SLAB_TYPESAFE_BY_RCU)
        return false;

    // ----------------------------------------------------------------------- (1)
    left = calculate_slab_order(cachep, size, flags | CFLGS_OBJFREELIST_SLAB);
    if (!cachep->num)
        return false;
    
    // ----------------------------------------------------------------------- (2)
    // 如果不满足所有的freelist_idx_t能放入一个object，那么就不能使用这种方式 
    if (cachep->num * sizeof(freelist_idx_t) > cachep->object_size)
        return false;

    cachep->colour = left / cachep->colour_off;

    return true;
}

(1) 这个函数主要做了两件事情 - 1) 获取满足要求的最小的order，它用于伙伴系统给slab分配页；2) 计算除去freelist占用的空间外页中还剩多少内存空间，我们现在来看看这个函数。

calculate_slab_order()

// ./mm/slab.c

/**
 * calculate_slab_order - calculate size (page order) of slabs
 * @cachep: pointer to the cache that is being created
 * @size: size of objects to be created in this cache.
 * @flags: slab allocation flags
 *
 * Also calculates the number of objects per slab.
 *
 * This could be made much more intelligent.  For now, try to avoid using
 * high order pages for slabs.  When the gfp() functions are more friendly
 * towards high-order requests, this should be changed.
 */
static size_t calculate_slab_order(struct kmem_cache *cachep,
                size_t size, slab_flags_t flags)
{
    size_t left_over = 0;
    int gfporder;

    for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
        unsigned int num;
        size_t remainder;
        // 计算出slab cache中最多可以装多少obj
        num = cache_estimate(gfporder, size, flags, &remainder);
        if (!num)
            continue;

        /* Can't handle number of objects more than SLAB_OBJ_MAX_NUM */
        // 找到最小的order满足这个条件后就跳出循环    
        if (num > SLAB_OBJ_MAX_NUM)
            break;

        // 对于OFF_SLAB的情况，还需要考虑是否有合适大小的通用缓存
        // 在启动阶段，flag中没有开启CFLGS_OFF_SLAB
        if (flags & CFLGS_OFF_SLAB) {
            struct kmem_cache *freelist_cache;
            size_t freelist_size;

            freelist_size = num * sizeof(freelist_idx_t);
            freelist_cache = kmalloc_slab(freelist_size, 0u);
            if (!freelist_cache)
                continue;

            /*
             * Needed to avoid possible looping condition
             * in cache_grow_begin()
             */
            if (OFF_SLAB(freelist_cache))
                continue;

            /* check if off slab has enough benefit */
            if (freelist_cache->size > cachep->size / 2)
                continue;
        }

        // 当前order满足条件，先更新kmem_cache的成员变量
        /* Found something acceptable - save it away */
        cachep->num = num;
        cachep->gfporder = gfporder;
        left_over = remainder;

        /*
         * A VFS-reclaimable slab tends to have most allocations
         * as GFP_NOFS and we really don't want to have to be allocating
         * higher-order pages when we are unable to shrink dcache.
         */
        if (flags & SLAB_RECLAIM_ACCOUNT)
            break;

        /*
         * Large number of objects is good, but very large slabs are
         * currently bad for the gfp()s.
         */
        if (gfporder >= slab_max_order)
            break;

        /*
         * Acceptable internal fragmentation?
         */
        if (left_over * 8 <= (PAGE_SIZE << gfporder))
            break;
    }
    return left_over;
}

/*
 * Calculate the number of objects and left-over bytes for a given buffer size.
 */
static unsigned int cache_estimate(unsigned long gfporder, size_t buffer_size,
        slab_flags_t flags, size_t *left_over)
{
    unsigned int num;
    size_t slab_size = PAGE_SIZE << gfporder;

    /*
     * The slab management structure can be either off the slab or
     * on it. For the latter case, the memory allocated for a
     * slab is used for:
     *
     * - @buffer_size bytes for each object
     * - One freelist_idx_t for each object
     *
     * We don't need to consider alignment of freelist because
     * freelist will be at the end of slab page. The objects will be
     * at the correct alignment.
     *
     * If the slab management structure is off the slab, then the
     * alignment will already be calculated into the size. Because
     * the slabs are all pages aligned, the objects will be at the
     * correct alignment when allocated.
     */
    // 这两种情况，我们都不需要考虑freelist_idx_t占用的空间。第一种情况，所有的freelist_idx_t就在一个object中，
    // 将其当成一个普通object即可; 第二种情况，所有的freelist_idx_t在其它的页中。    
    if (flags & (CFLGS_OBJFREELIST_SLAB | CFLGS_OFF_SLAB)) {
        num = slab_size / buffer_size;
        *left_over = slab_size % buffer_size;
    // 这种情况，每个object都对应一个freelist_idx_t，因此需要将它们两个看做一个整体    
    } else {
        num = slab_size / (buffer_size + sizeof(freelist_idx_t));
        *left_over = slab_size %
            (buffer_size + sizeof(freelist_idx_t));
    }

    return num;
}

有了上面的基础，下面两个函数也很好理解了，直接看注释就行了。

set_off_slab_cache()

// ./mm/slab.c

static bool set_off_slab_cache(struct kmem_cache *cachep,
            size_t size, slab_flags_t flags)
{
    size_t left;

    cachep->num = 0;

    /*
     * Always use on-slab management when SLAB_NOLEAKTRACE
     * to avoid recursive calls into kmemleak.
     */
    if (flags & SLAB_NOLEAKTRACE)
        return false;

    /*
     * Size is large, assume best to place the slab management obj
     * off-slab (should allow better packing of objs).
     */
    left = calculate_slab_order(cachep, size, flags | CFLGS_OFF_SLAB);
    if (!cachep->num)
        return false;

    /*
     * If the slab has been placed off-slab, and we have enough space then
     * move it on-slab. This is at the expense of any extra colouring.
     */
    // 只要存放object的页有足够空间方向所有的freelist_idx_t，那么就不用这种方法    
    if (left >= cachep->num * sizeof(freelist_idx_t))
        return false;

    cachep->colour = left / cachep->colour_off;

    return true;
}

set_on_slab_cache()

// ./mm/slab.c

static bool set_on_slab_cache(struct kmem_cache *cachep,
    

    cachep->num = 0;

    left = calculate_slab_order(cachep, size, flags);
    // 这个函数已经是最后一个选择了，只要object的数量不为0就行
    if (!cachep->num)
        return false;

    cachep->colour = left / cachep->colour_off;

    return true;
}

重新回到__kmem_cache_create()

到这里，我们可以重新回到__kmem_cache_create()函数的第三部分，这里我再次贴上这部分代码，

    // --------------------------------------------------------------------------------------------- (3)
done:
    // 更新slab缓存中的flag, size等
    cachep->freelist_size = cachep->num * sizeof(freelist_idx_t);
    cachep->flags = flags;
    cachep->allocflags = __GFP_COMP;
    if (flags & SLAB_CACHE_DMA)
        cachep->allocflags |= GFP_DMA;
    if (flags & SLAB_RECLAIM_ACCOUNT)
        cachep->allocflags |= __GFP_RECLAIMABLE;
    cachep->size = size;
    cachep->reciprocal_buffer_size = reciprocal_value(size);

#if DEBUG
    /*
     * If we're going to use the generic kernel_map_pages()
     * poisoning, then it's going to smash the contents of
     * the redzone and userword anyhow, so switch them off.
     */
    // 当设置了CONFIG_PAGE_POISONING后，每个对象会写入一些特定的数据，
    // 这些数据会破坏redzone和user信息，因此需要清除(SLAB_RED_ZONE | SLAB_STORE_USER)
    if (IS_ENABLED(CONFIG_PAGE_POISONING) && (cachep->flags & SLAB_POISON) && is_debug_pagealloc_cache(cachep))
        cachep->flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
#endif
    
    // 分配一个新的通用的slab缓存用于存储freelist
    // 在启动阶段，这里是不会被执行的，因为上面(2)处已经决定了CFLGS_OFF_SLAB不会被设置成1
    if (OFF_SLAB(cachep)) {
        // 根据freelist_size的大小去kmalloc_caches[]获取通用缓存
        cachep->freelist_cache = kmalloc_slab(cachep->freelist_size, 0u);
    }
    
    // 设置per-cpu的缓存，这部分下面具体讨论
    err = setup_cpu_cache(cachep, gfp);
    if (err) {
        __kmem_cache_release(cachep);
        return err;
    }

    return 0;

上面代码中的注释已经写的很清楚了，这里只需要看最后一个函数 - setup_cpu_cache()即可。

setup_cpu_cache()

根据名字可知，这个函数主要就是设置per-cpu缓存的。它是()
这部分内容还是比较简单的，除此之外，这个函数还设置了一些kmem_cache中和cpu_cache相关的内容。

// ./mm/slab.c
// __kmem_cache_create()->setup_cpu_cache()

static int __ref setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
{
    if (slab_state >= FULL)
        // --------------------------------------------------------------------- (1)	
	    // 当slab缓存初始化完全结束，我们可以正式初始化per-cpu缓存了
        return enable_cpucache(cachep, gfp);

    // 分配一个cpu_cache的对象，这里是基于per-cpu的分配(不是用slab进行分配), 这里只分配一个entry
    cachep->cpu_cache = alloc_kmem_cache_cpus(cachep, 1, 1);
    if (!cachep->cpu_cache)
        return 1;

    // 根据slab的状态设置kmem_cache的cache->node[] = init_kmem_cache_node[]
    if (slab_state == DOWN) {
        /* Creation of first cache (kmem_cache). */
        set_up_node(kmem_cache, CACHE_CACHE);
    } else if (slab_state == PARTIAL) {
        /* For kmem_cache_node */
        set_up_node(cachep, SIZE_NODE);
    } else {
        int node;

        for_each_online_node(node) {
            cachep->node[node] = kmalloc_node(
                sizeof(struct kmem_cache_node), gfp, node);
            BUG_ON(!cachep->node[node]);
            kmem_cache_node_init(cachep->node[node]);
        }
    }

    // 设置和缓存回收相关内容
    cachep->node[numa_mem_id()]->next_reap =
            jiffies + REAPTIMEOUT_NODE +
            ((unsigned long)cachep) % REAPTIMEOUT_NODE;

    // 更新部分kmem_cache中per_cpu的信息
    cpu_cache_get(cachep)->avail = 0;
    cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
    cpu_cache_get(cachep)->batchcount = 1;
    cpu_cache_get(cachep)->touched = 0;
    // 更新部分kmem_cache的信息    
    cachep->batchcount = 1;
    cachep->limit = BOOT_CPUCACHE_ENTRIES;
    return 0;
}

(1)处函数的内容很多，我这里就不具体分析该函数了。简单来说，这个函数会根据slab cache的大小、cpu的数量等，通过计算得出ac->limit, ac->batchcount, shared（这里是kmem_cache中的shared)等；同时，kmem_cache中的limit, batchcount也被初始化，其值和per-cpu中相同变量的值一样。除了对per-cpu缓存进行初始化，它还对kmem_cache_node进行初始化，其中一个重要的变量就是n->shared。下面这段代码完成了对n->shared的初始化，注意其中limit和batchcount的赋值，

// ./mm/slab.c

// setup_kmem_cache_node()
// shared缓存的大小由cache->shared决定，后者在enable_cpucache()中确定值
// shared = 0; if (cachep->size <= PAGE_SIZE && num_possible_cpus() > 1) shared = 8;
// 从源码可以看出，如果只有一个CPU，那么shared就等于0
if (cachep->shared) 
{
    new_shared = alloc_arraycache(node, 
	                             cachep->shared * cachep->batchcount, 
	                             0xbaadf00d, 
	                             gfp); 
}

// ###########################################################################
// alloc_arraycache()
// ###########################################################################
static struct array_cache *alloc_arraycache(int node, int entries, int batchcount, gfp_t gfp)
{
	size_t memsize = sizeof(void *) * entries + sizeof(struct array_cache);
	struct array_cache *ac = NULL;

	ac = kmalloc_node(memsize, gfp, node);
	init_arraycache(ac, entries, batchcount);
	return ac;
}

// ###########################################################################
// init_arraycache()
// ###########################################################################
static void init_arraycache(struct array_cache *ac, int limit, int batch)
{
	/*
	 * The array_cache structures contain pointers to free object.
	 * However, when such objects are allocated or transferred to another
	 * cache the pointers are not cleared and they could be counted as
	 * valid references during a kmemleak scan. Therefore, kmemleak must
	 * not scan such objects.
	 */
	kmemleak_no_scan(ac);
	if (ac) {
		ac->avail = 0;
		ac->limit = limit;
		ac->batchcount = batch;
		ac->touched = 0;
	}
}

从上面函数可以看出，当slab_state >= FULL之前，我们都不会对per-cpu缓存或者kmem_block_node->shared进行完全的初始化。那么什么时候slab_state >= FULL呢？这部分是在kmem_cache_init_late()中完成的。

kmem_cache_init_late()

这个函数在start_kernel()中被调用；kmem_cache_init()在start_kernel()->mm_init()中被调用。当调用该函数时，它会调用enable_cpucache()函数来重新对per-cpu缓存以及kmem_cache_node进行初始化。完成后，slab_state就变成FULL了。在这之后创建的slab缓存，都会在setup_cpu_cache()中直接调用enable_cpucache()来进行初始化。

// ./mm/slab.c
// 所有的slab cache(包括特定和通用的)都要重新初始化per-cpu缓存, shared缓存以及部分kmem_cache的成员变量

void __init kmem_cache_init_late(void)
{
	struct kmem_cache *cachep;

	/* 6) resize the head arrays to their final sizes */
	mutex_lock(&slab_mutex);
	list_for_each_entry(cachep, &slab_caches, list)
		if (enable_cpucache(cachep, GFP_NOWAIT))
			BUG();
	mutex_unlock(&slab_mutex);

	/* Done! */
	slab_state = FULL;

#ifdef CONFIG_NUMA
	/*
	 * Register a memory hotplug callback that initializes and frees
	 * node.
	 */
	hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
#endif

	/*
	 * The reap timers are started later, with a module init call: That part
	 * of the kernel is not yet operational.
	 */
}

总结

至此，kmem_cache_init()和kmem_cache_init_late()都介绍完了。在下一篇笔记中，我将分析slab缓存的创建以及slab缓存的分配，又是一场硬仗，加油！
最后，附上一幅slab缓存初始化的简化版流程图作为参考。其中，红色箭头表示create_kmalloc_cache()会调用create_boot_cache()，蓝色箭头表示create_kmalloc_caches()会调用create_kmalloc_cache()。

参考资料

Next Post

【LM14】slab系统的创建和分配
Previous Post

【LM12】伙伴系统内存释放篇

CATALOG

1. 为什么需要slab系统
2. 相关结构体
3. slab系统的初始化
1. 3.1. kmem_cache_init()
4. 总结
5. 参考资料