11. 【Slab】3 - slab的基本操作 - kmem_cache_create()

Linux内核内存管理 Linux 2.6.11

 2023/02/22 

在上一篇笔记10. 【slab】2 - slab的基本操作 - kmem_cache_init()中，当遇到kmem_cache_create()时，我说之后会写笔记进行总结。这篇笔记就是对这个函数的详细总结。内容比较多、比较杂，我自己也写的乱，需要好好理解思考，静下来看。

kmem_cache_create()主要用于创建专用高速缓存，但在初始化阶段(kmem_cache_init())，它还用于初始化普通高速缓存。这个函数比较复杂，我们分几步来分析该函数。

计算对齐值；
获取高速缓存的描述符；
计算高速缓存中对象的大小、剩余空间等；
计算slab颜色相关变量；
enable_cpucache；
做一些检查并将高速缓存放入cache_chain。

获取高速缓存的描述符

这部分包括上面的1和2，需要涉及函数 - kmem_cache_alloc()，任务量巨大，我准备另开一个笔记记录这个函数（好吧，又需要多写一篇）。
这一部分除了kmem_cache_alloc()外，只做了2件简单的工作 - 将size进行BYTES_PER_WORD对齐；根据情况调整align(buffer alignment)，总体而言就是解决对齐问题。

/**
 * kmem_cache_create - Create a cache.
 * @name: A string which is used in /proc/slabinfo to identify this cache.
 * @size: The size of objects to be created in this cache.
 * @align: The required alignment for the objects.
 * @flags: SLAB flags
 * @ctor: A constructor for the objects.
 * @dtor: A destructor for the objects.
 *
 * Returns a ptr to the cache on success, NULL on failure.
 * Cannot be called within a int, but can be interrupted.
 * The @ctor is run when new pages are allocated by the cache
 * and the @dtor is run before the pages are handed back.
 *
 * @name must be valid until the cache is destroyed. This implies that
 * the module calling this has to destroy the cache before getting 
 * unloaded.
 * 
 * The flags are
 *
 * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
 * to catch references to uninitialised memory.
 *
 * %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
 * for buffer overruns.
 *
 * %SLAB_NO_REAP - Don't automatically reap this cache when we're under
 * memory pressure.
 *
 * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
 * cacheline.  This can be beneficial if you're counting cycles as closely
 * as davem.
 */
kmem_cache_t *
kmem_cache_create(const char *name, 
                  size_t size, 
                  size_t align,
                  unsigned long flags, 
                  void (*ctor)(void*, kmem_cache_t *, unsigned long),
                  void (*dtor)(void*, kmem_cache_t *, unsigned long))
{
    size_t left_over, slab_size, ralign;
    kmem_cache_t *cachep = NULL;

    /*
     * Sanity checks... these are all serious usage bugs.
     */
    if ((!name) ||
        in_interrupt() ||
        (size < BYTES_PER_WORD) ||
        (size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
        (dtor && !ctor)) {
            printk(KERN_ERR "%s: Early error in slab %s\n",
                    __FUNCTION__, name);
            BUG();
        }

    if (flags & SLAB_DESTROY_BY_RCU)
        BUG_ON(dtor);

    /*
     * Always checks flags, a caller might be expecting debug
     * support which isn't available.
     */
    if (flags & ~CREATE_MASK)
        BUG();

    /* Check that size is in terms of words.  This is needed to avoid
     * unaligned accesses for some archs when redzoning is used, and makes
     * sure any on-slab bufctl's are also correctly aligned.
     */
    // 将size按BYTES_PER_WORD进行对齐
    if (size & (BYTES_PER_WORD-1)) { 
        size += (BYTES_PER_WORD-1);
        size &= ~(BYTES_PER_WORD-1);
    }

    /* calculate out the final buffer alignment: */
    /* 1) arch recommendation: can be overridden for debug */
    if (flags & SLAB_HWCACHE_ALIGN) {
        /* Default alignment: as specified by the arch code.
         * Except if an object is really small, then squeeze multiple
         * objects into one cacheline.
         */
        ralign = cache_line_size();
        while (size <= ralign/2)
            ralign /= 2;
    } else {
        ralign = BYTES_PER_WORD;
    }
    /* 2) arch mandated alignment: disables debug if necessary */
    if (ralign < ARCH_SLAB_MINALIGN) {
        ralign = ARCH_SLAB_MINALIGN;
        if (ralign > BYTES_PER_WORD)
            flags &= ~(SLAB_RED_ZONE|SLAB_STORE_USER);
    }
    /* 3) caller mandated alignment: disables debug if necessary */
    if (ralign < align) {
        ralign = align;
        if (ralign > BYTES_PER_WORD)
            flags &= ~(SLAB_RED_ZONE|SLAB_STORE_USER);
    }
    /* 4) Store it. Note that the debug code below can reduce
     *    the alignment to BYTES_PER_WORD.
     */
    // 将最终确定的buffer alignment写回align
    align = ralign;

    /* Get cache's description obj. */
    // 关键步骤 - 通过slab系统获取kmem_cache_t结构体(获取一个对象)
    cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL);
    if (!cachep)
        goto opps;
    memset(cachep, 0, sizeof(kmem_cache_t));
    
    ...
}

确定slab内部结构以及计算其颜色变量

这部分还是比较容易理解的，首先根据各种条件不断地计算，直到最后确定slab内部的结构（比如slab和对象是否在同一个页，slab的偏移是多少，页内剩余空间多少等）；有了这些信息，我们就可以通过剩余空间来计算slab一共可以有多少种颜色，slab颜色偏移的值等。

    /* Determine if the slab management is 'on' or 'off' slab. */
    if (size >= (PAGE_SIZE>>3))
        /*
         * Size is large, assume best to place the slab management obj
         * off-slab (should allow better packing of objs).
         */
        // 对象越大越应该将slab和对象描述符放在外面 
        flags |= CFLGS_OFF_SLAB;

    // size再次对齐 
    size = ALIGN(size, align);

    // 开始计算slab的内部结构
    if ((flags & SLAB_RECLAIM_ACCOUNT) && size <= PAGE_SIZE) {
        /*
         * A VFS-reclaimable slab tends to have most allocations
         * as GFP_NOFS and we really don't want to have to be allocating
         * higher-order pages when we are unable to shrink dcache.
         */
        cachep->gfporder = 0;
        cache_estimate(cachep->gfporder, size, align, flags, &left_over, &cachep->num);
    } else {
        /*
         * Calculate size (in pages) of slabs, and the num of objs per
         * slab.  This could be made much more intelligent.  For now,
         * try to avoid using high page-orders for slabs.  When the
         * gfp() funcs are more friendly towards high-order requests,
         * this should be changed.
         */
        do {
            unsigned int break_flag = 0;
cal_wastage:
            cache_estimate(cachep->gfporder, size, align, flags, &left_over, &cachep->num);
            if (break_flag)
                break;
            if (cachep->gfporder >= MAX_GFP_ORDER)
                break;
            // 如果没有办法放下1个对象，那么就需要加大页的order 
            if (!cachep->num)
                goto next;
            // 一种特殊的错误情况，需要减少页的order再计算一次
            if (flags & CFLGS_OFF_SLAB && cachep->num > offslab_limit) {
                /* This num of objs will cause problems. */
                cachep->gfporder--;
                break_flag++;
                goto cal_wastage;
            }

            /*
             * Large num of objs is good, but v. large slabs are
             * currently bad for the gfp()s.
             */
            if (cachep->gfporder >= slab_break_gfp_order)
                break;

            if ((left_over * 8) <= (PAGE_SIZE << cachep->gfporder))
                break;    /* Acceptable internal fragmentation. */
next:
            cachep->gfporder++;
        } while (1);
    }

    // 如果通过增大order还是不能获取对象，那么只能报错并释放之前分配的高速缓存描述符
    if (!cachep->num) {
        printk("kmem_cache_create: couldn't create cache %s.\n", name);
        // 释放掉已经分配的cachep 
        kmem_cache_free(&cache_cache, cachep);
        cachep = NULL;
        goto opps;
    }
    slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab), align);

    /*
     * If the slab has been placed off-slab, and we have enough space then
     * move it on-slab. This is at the expense of any extra colouring.
     */
    if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {
        flags &= ~CFLGS_OFF_SLAB;
        left_over -= slab_size;
    }

    if (flags & CFLGS_OFF_SLAB) {
        /* really off slab. No need for manual alignment */
        // 如果slab在外部，不需要进行额外的对齐了
        slab_size = cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab);
    }

    // 从这里开始计算slab颜色相关的变量
    cachep->colour_off = cache_line_size();
    /* Offset must be a multiple of the alignment. */
    if (cachep->colour_off < align)
        cachep->colour_off = align;
    cachep->colour = left_over / cachep->colour_off;
    cachep->slab_size = slab_size;
    cachep->flags = flags;
    cachep->gfpflags = 0;
    if (flags & SLAB_CACHE_DMA)
        cachep->gfpflags |= GFP_DMA;
    spin_lock_init(&cachep->spinlock);
    cachep->objsize = size;
    /* NUMA */
    INIT_LIST_HEAD(&cachep->lists.slabs_full);
    INIT_LIST_HEAD(&cachep->lists.slabs_partial);
    INIT_LIST_HEAD(&cachep->lists.slabs_free);

    if (flags & CFLGS_OFF_SLAB)
        // slabp_cache is the general cache which contains slab descriptor    
        // 这个函数很简单，就是遍历malloc_sizes然后返回大于slab_size最小的高速缓存
        cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0);
    cachep->ctor = ctor;
    cachep->dtor = dtor;
    cachep->name = name;

初始化CPU本地缓存

在上面这几步完成后，高速缓存描述符、slab内部结构（包括着色）都完成了，接下来就是初始化CPU的本地高速缓存了。这个过程比较繁琐，需要静下心来慢慢地看。

/* Don't let CPUs to come and go */
lock_cpu_hotplug();

// 当slab完全初始化后，就通过enable_cpucache()函数分配CPU的本地高速缓存 -> 这个函数有比较繁杂，下面单独介绍
if (g_cpucache_up == FULL) {
    enable_cpucache(cachep);
} else {
    // 这部分还算好理解，这里用了全局变量g_cpucache_up来区别slab是否初始化完成。在slab初始化完成前，g_cpucache_up都不会是FULL的，
    // 并且第一次时是NONE，之后都是PARTIAL
    if (g_cpucache_up == NONE) {
        /* Note: the first kmem_cache_create must create
         * the cache that's used by kmalloc(24), otherwise
         * the creation of further caches will BUG().
         */
        cachep->array[smp_processor_id()] = &initarray_generic.cache;
        g_cpucache_up = PARTIAL;
    } else {
        // 只要不是第一次，我们已经有了普通高速缓存可以使用了
        cachep->array[smp_processor_id()] = kmalloc(sizeof(struct arraycache_init),GFP_KERNEL);
    }
    BUG_ON(!ac_data(cachep));
    ac_data(cachep)->avail = 0;
    ac_data(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
    ac_data(cachep)->batchcount = 1;
    ac_data(cachep)->touched = 0;
    // 高速缓存的这两个参数和CPU本地高速缓存保持一致
    cachep->batchcount = 1;
    cachep->limit = BOOT_CPUCACHE_ENTRIES;
    // TODO: 不太理解这里，是[(本地高速缓存 + shared的高速缓存) * batchcount + 一个slab的容量]吗？
    cachep->free_limit = (1 + num_online_cpus()) * cachep->batchcount + cachep->num;
} 
 
cachep->lists.next_reap = jiffies + REAPTIMEOUT_LIST3 +
                ((unsigned long)cachep)%REAPTIMEOUT_LIST3;

enable_cpucache()

这个函数在两个地方会被使用，第一个就是上一篇中提到的kmem_cache_init()，还有一个地方就是这里，并且是当kmem_cache_init()初始化完成后。这个函数主要是完成对CPU本地缓存以及shared缓存的初始化。其最主要部分都在do_tune_cpucache()中完成。

【补充内容】不管是在kmem_cache_init()还是在kmem_cache_create()里面，本地高速缓存和shared的高速缓存在这之前都是没有完全初始化的（kmem_cache_init()中普通高速缓存的本地高速缓存只有一个对象，并且没有shared高速缓存）。

static void enable_cpucache (kmem_cache_t *cachep)
{
    int err;
    int limit, shared;

    /* The head array serves three purposes:
     * - create a LIFO ordering, i.e. return objects that are cache-warm
     * - reduce the number of spinlock operations.
     * - reduce the number of linked list operations on the slab and 
     *   bufctl chains: array operations are cheaper.
     * The numbers are guessed, we should auto-tune as described by
     * Bonwick.
     */
    // 根据对象的大小确定本地高速缓存的最大空闲对象数
    if (cachep->objsize > 131072)
        limit = 1;
    else if (cachep->objsize > PAGE_SIZE)
        limit = 8;
    else if (cachep->objsize > 1024)
        limit = 24;
    else if (cachep->objsize > 256)
        limit = 54;
    else
        limit = 120;

    /* Cpu bound tasks (e.g. network routing) can exhibit cpu bound
     * allocation behaviour: Most allocs on one cpu, most free operations
     * on another cpu. For these cases, an efficient object passing between
     * cpus is necessary. This is provided by a shared array. The array
     * replaces Bonwick's magazine layer.
     * On uniprocessor, it's functionally equivalent (but less efficient)
     * to a larger limit. Thus disabled by default.
     */
    shared = 0;
#ifdef CONFIG_SMP
    if (cachep->objsize <= PAGE_SIZE)
        shared = 8;
#endif

#if DEBUG
    /* With debugging enabled, large batchcount lead to excessively
     * long periods with disabled local interrupts. Limit the 
     * batchcount
     */
    if (limit > 32)
        limit = 32;
#endif
    // 只要不是SMP，shared就是0，shared的高速缓存就没对象
    err = do_tune_cpucache(cachep,          // kmem_cache_t* cachep 
                           limit,           // int limit - 本地高速缓存的最大对象数
                           (limit + 1) / 2, // int batchcount - 对象填充、移动时的数量，这里设置为limit的一半
                           shared);         // int shared 
    if (err)
        printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
                    cachep->name, -err);
}

do_tune_cpucache()

在介绍这个函数前，我们还需要了解几个辅助函数以及结构体，比如ccupdate_struct, alloc_arraycache等。注意，这里我不打算分析kmalloc()和kfree()，我们直接开始看其它几个函数。

// 这里我们默认NR_CPUS = 1，也就是只有1个核
struct ccupdate_struct {
    kmem_cache_t *cachep;
    struct array_cache *new[NR_CPUS];
};

// 分配并获取array_cache的内存空间
static struct array_cache *alloc_arraycache(int cpu, int entries, int batchcount)
{
    // 我们需要的内存空间包括括描述符和对象的地址空间
    int memsize = sizeof(void *) * entries + sizeof(struct array_cache);
    struct array_cache *nc = NULL;

    if (cpu != -1) {
        // ARM或者X86_64中这个函数实际还是调用kmem_cache_alloc()
        nc = kmem_cache_alloc_node(kmem_find_general_cachep(memsize, GFP_KERNEL), cpu_to_node(cpu));
    }
    // 如果cpu == -1(分配shared的高速缓存时)或者因为种种原因，nc == NULL，那么就用kmalloc()函数 
    if (!nc)
        nc = kmalloc(memsize, GFP_KERNEL);
    // 更新array_cache 
    if (nc) {
        nc->avail = 0; // 到现在我们并没有获取对象的地址，因此这个值是0
        nc->limit = entries;
        nc->batchcount = batchcount;
        nc->touched = 0;
    }
    return nc;
}

// 这个函数用于释放array_cache中的对象，并更新它们对应的slab
static void free_block(kmem_cache_t *cachep, void **objpp, int nr_objects)
{
    int i;

    check_spinlock_acquired(cachep);

    /* NUMA: move add into loop */
    cachep->lists.free_objects += nr_objects;

    // array_cache中的block可能来自不同的slab - 比如shared的高速缓存
    // TODO: CPU本地高速缓存是否会出现这种情况？
    for (i = 0; i < nr_objects; i++) {
        void *objp = objpp[i];
        struct slab *slabp;
        unsigned int objnr;

        slabp = GET_PAGE_SLAB(virt_to_page(objp));
        list_del(&slabp->list);
        // 确定是第几个对象地址 
        objnr = (objp - slabp->s_mem) / cachep->objsize;
        check_slabp(cachep, slabp);
        // 参考 12. 【Slab】4 - Slab的基本操作 - kmem_cache_alloc()，这篇笔记有分析对象描述符如何工作的
        slab_bufctl(slabp)[objnr] = slabp->free;
        slabp->free = objnr;
        STATS_DEC_ACTIVE(cachep);
        slabp->inuse--;
        check_slabp(cachep, slabp);

        /* fixup slab chains */
        // 将slab重新放回合适的链表
        if (slabp->inuse == 0) {
            // 这种情况什么时候会发生？free_objects > free_limit?
            if (cachep->lists.free_objects > cachep->free_limit) {
                // 减去一个slab中所有free的对象
                cachep->lists.free_objects -= cachep->num;
                // 关于销毁等的函数单独开个笔记分析吧，太多内容了！
                slab_destroy(cachep, slabp);
            } else {
                list_add(&slabp->list, &list3_data_ptr(cachep, objp)->slabs_free);
            }
        } else {
            /* Unconditionally move a slab to the end of the
             * partial list on free - maximum time for the
             * other objects to be freed, too.
             */
            list_add_tail(&slabp->list, &list3_data_ptr(cachep, objp)->slabs_partial);
        }
    }
}

static void do_ccupdate_local(void *info)
{
    struct ccupdate_struct *new = (struct ccupdate_struct *)info;
    struct array_cache *old;

    check_irq_off();
    old = ac_data(new->cachep);
    
    // 这里的命名一言难尽，都是new，这里将新分配的地址和cachep原来指向的地址对换
    new->cachep->array[smp_processor_id()] = new->new[smp_processor_id()];
    new->new[smp_processor_id()] = old;
}

// 这个函数主要用于分配CPU本地高速缓存以及CPU共享的高速缓存描述符
static int do_tune_cpucache (kmem_cache_t* cachep, int limit, int batchcount, int shared)
{
    struct ccupdate_struct new;
    struct array_cache *new_shared;
    int i;

    memset(&new.new, 0, sizeof(new.new));
    // 这里用一个本地变量来临时保存分配的内存空间地址
    for (i = 0; i < NR_CPUS; i++) {
        if (cpu_online(i)) {
            // 从普通高速缓存获取对象
            new.new[i] = alloc_arraycache(i, limit, batchcount);
            if (!new.new[i]) {
                for (i--; i >= 0; i--) kfree(new.new[i]);
                return -ENOMEM;
            }
        } else {
            new.new[i] = NULL;
        }
    }
    new.cachep = cachep;

    // 这个函数主要是调用do_ccupdate_local()这个回调函数，它用于实现将新分配的本地高速缓存赋值给cachep
    smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
    
    check_irq_on();
    spin_lock_irq(&cachep->spinlock);
    cachep->batchcount = batchcount;
    cachep->limit = limit;
    // 同上一个TODO，不清楚这个值到底什么意思 
    cachep->free_limit = (1 + num_online_cpus()) * cachep->batchcount + cachep->num;
    spin_unlock_irq(&cachep->spinlock);

    for (i = 0; i < NR_CPUS; i++) {
        struct array_cache *ccold = new.new[i];
        // 对于刚创建的缓存，其本地高速缓存就是NULL，不需要进行释放 
        if (!ccold)
            continue;
        spin_lock_irq(&cachep->spinlock);
        // 释放ccold中的对象 
        free_block(cachep, ac_entry(ccold), ccold->avail);
        spin_unlock_irq(&cachep->spinlock);
        kfree(ccold);
    }
    // 这部分负责分配shared的高速缓存，具体和上面类似
    // shared的高速缓存使用了一个超大的batchcount
    new_shared = alloc_arraycache(-1, batchcount * shared, 0xbaadf00d);
    if (new_shared) {
        struct array_cache *old;

        spin_lock_irq(&cachep->spinlock);
        old = cachep->lists.shared;
        cachep->lists.shared = new_shared;
        if (old)
            free_block(cachep, ac_entry(old), old->avail);
        spin_unlock_irq(&cachep->spinlock);
        kfree(old);
    }

    return 0;
}

最后的检查以及将高速缓存插入cache_chain

这部分很简单，直接看代码就行了，不用解释了。

    /* Need the semaphore to access the chain. */
    down(&cache_chain_sem);
    {
        struct list_head *p;
        mm_segment_t old_fs;

        old_fs = get_fs();
        set_fs(KERNEL_DS);
        list_for_each(p, &cache_chain) {
            kmem_cache_t *pc = list_entry(p, kmem_cache_t, next);
            char tmp;
            /* This happens when the module gets unloaded and doesn't
               destroy its slab cache and noone else reuses the vmalloc
               area of the module. Print a warning. */
            if (__get_user(tmp,pc->name)) { 
                printk("SLAB: cache with size %d has lost its name\n", 
                    pc->objsize); 
                continue; 
            }     
            if (!strcmp(pc->name,name)) { 
                printk("kmem_cache_create: duplicate cache %s\n",name); 
                up(&cache_chain_sem); 
                unlock_cpu_hotplug();
                BUG(); 
            }    
        }
        set_fs(old_fs);
    }

    /* cache setup completed, link it into the list */
    list_add(&cachep->next, &cache_chain);
    up(&cache_chain_sem);
    unlock_cpu_hotplug();
opps:
    if (!cachep && (flags & SLAB_PANIC))
        panic("kmem_cache_create(): failed to create slab `%s'\n", name);
    return cachep;
}

总结

至此，一个高速缓存就这么分配好了。通过上面分析我们可以看出，这个新分配的高速缓存并没有实际分配任何的slab和对象，它们会在以后真正需要使用时再被分配。这部分内容以及高速缓存的销毁、slab的释放等都会在之后的笔记中记录，这篇就到此结束。

Next Post

12. 【Slab】4 - slab的基本操作 - kmem_cache_alloc()
Previous Post

10. 【slab】2 - slab的基本操作 - kmem_cache_init()

CATALOG

1. 获取高速缓存的描述符
2. 确定slab内部结构以及计算其颜色变量
3. 初始化CPU本地缓存
1. 3.1. enable_cpucache()
  1. 3.1.1. do_tune_cpucache()
4. 最后的检查以及将高速缓存插入cache_chain
5. 总结