Flexible user-level implementation of SLAB-like free list

Ivan Matylitski, Minsk, Belarus

LVEE 2013

A project implementing SLAB-like objects free list is presented, targeted to be used in user-space code. The differences from another similar projects are flexibility and orientation to configuration features used at compilation and run-time stages. Code is written in pure C (C11) and allows end user to build library for embedded/low-power systems as well as for using in high-load server systems.

Introduction

Term “SLAB” and the mechanism was initially introduced by Jeff Bonwick in Solaris 2.4 kernel. For now, identical approach has been adopted at vast majority of Unix-like OSes kernels as well as at some not Unix-like systems. SLAB became the state of art in kernel-level object management because of noticeable effectiveness.

Synonym of word “slab” is “chunk”. This word is partially self-descriptive because it reflects how operations are performed inside SLAB structure.

The idea of slab-list (or just “slab”) is very close to the idea of free-list (or object pool). Free-lists are known to be used as a replacement of allocation-initialization-finalization-deallocation (AIFD) cycle for frequently used objects. An object can be placed into a special list of unused instances of particular class just after it is not needed anymore for future re-using.

SLAB works similarly to free-list. The main architectural difference is that SLAB is more self-contained and autonomous.

Firstly, SLAB represents object construction by itself according to parameters specified by user during SLAB-list initialization. Instance construction performs under the hood according to current needs (i.e. if no unused instanced have been found in internal lists).

Secondly, allocation and construction phases act under set of instances. It means that if no unused instances are available then logic will allocate several objects (number depends on particular policy embedded into the logic) at one fell swoop and will initialize them all at once. That’s where, probably, “slab” came from.

Thirdly, internal lists are lists of packages or magazines, which contain particular amount of class instances in array. In advance, all internal memory allocation-deallocation operations use mentioned packages as atomic unit.

And, finally, SLAB contains three lists:

free list – list of packages which don’t contain any currently used instance;
partial list – list of packages which contain both kinds of instances (used and unused ones);
full list – list of packages which contain only used objects and which are not available for further allocation.

About the project

The main goal of presented project is to provide user-space code with the same functionality, which has been available for kernel code.

Code is distributed under the terms of BSD license and published on GitHub:

https://github.com/buffovich/libreclaim – safe memory reclamation facility

https://github.com/buffovich/libmempool – SLAB itself

Implementation provides ability to choose:

particular back-end for memory allocation at compilation stage (ptmalloc, je_malloc, tcmalloc is currently available);
particular scheme of behavior in multi-threaded environment at run-time:
- local lists for each thread;
- global, per-SLAB lists;
scheme of protecting of lists from corruption under multi-threaded contention at run-time:
- simple thread-unaware scheme;
- coarse-grained locks for three lists;
- fine-grained locks;
- lock-less scheme.
optional cache-coloring

Use cases

As you may notice, mentioned SLAB implementation can be adopted for wide range of use cases. For example, if user needs scalable and fast solution with keeping in mind that objects won’t be transmitted to or shared with another threads then combination “per-thread list/thread-unaware” can be employed. If user needs the same level of scalability and thread-safety with keeping in mind that object exchanges are relatively rare then thread-unaware lists can be replaced with coarse-grained locked lists. If scalability is crucial and exchanges will be performed frequently then user can employ lock-less scheme or fine-grained locks.

Let’s consider all available options separately with their advantages and drawbacks:

cache coloring:
- introduce cache-awareness into SLAB and higher performance as a result;
- requires extra memory (do not suit for embedded solutions);
per-thread lists:
- play on local memory for thread and higher speed of memory access as a result;
- extra memory is needed (do not suit for devices with limited memory);
- particular thread cannot reuse objects freed by another thread (if threads created and terminated frequently and cost of object initialization is high then it will provide undesirable overhead);
- for single-core devices has no sense;
per-SLAB lists:
- low-memory footprint (good for embedded devices);
- unawareness about thread-local memory (higher costs of memory access for SMP systems);
simple thread-unaware scheme:
- fast and simple;
- not applicable in condition of concurrent access;
coarse-grained locks for three lists:
- safe in multi-threaded environment;
- simple and effective in sense of wasted CPU power and very fast under low or zero contention;
- slow and enforce extra re-scheduling under high contention;
- overall system progress isn’t guaranteed;
fine-grained locks:
- safe in multi-threaded environment;
- outperforms coarse-grained lock under high contention;
- increased memory and locking overhead because of issuing extra memory barriers (do not suit for cases with low level of contention)
- overall system progress isn’t guaranteed;
lock-less scheme:
- do not use locking under the hood (guarantee of system-wide progress, so it suits for realtime systems)
- re-scheduling is eliminated by design (sometimes can outperforms all lock-based schemes under very high contention on limited amount of cores, so it suits for some high-loaded and embedded systems)
- much more complex solution which can’t be more effective in sense of consumed CPU cycles (do not suit for low-power devices)
- not effective under low contention

Let’s consider available allocation backends and its use cases:

ptmalloc (standard libc6 allocator):
- has intermediate parameters (suits for all use cases);
- lock-based heap
je_malloc:
- relatively scalable because of using thread-local arenas for allocation;
- has lower memory footprint in comparing with other scalable allocators;
tcmalloc:
- very scalable and fast;
- noticeable amount of ancillary memory allocated just for internal usage;

Lock-less lists

It’s worth considering lock-less lists implementation separately. One of solved problems was safe memory reclamation. Scheme offered by Gidenstam, Papatriantafilou, Sundell, Tsigas ¹ is used for this purpose and it has its own sub-project libreclaim. This is an aggregate of two methods:

hazard pointers (M. Michael ²);
reference counting.

Deleted instances list and hazard pointers list are defined for each participating thread. These lists are available for each thread for reading.

Method was modified and simplified in some particular parts:

Original method requires deletion list and hazard list being available for each participating thread. Mentioned lists were integrated into per-thread structure. All structures are linked into double-linked list. Each thread set address of its structure in thread-specific data.

Double-linked list of thread structures is managed with help of RCU reclamation method. Writers are not block readers. But modification is exclusive. For tracking readers, thread structure has special numerical tag which is obtained from common counter. Counter is incremented by each thread with fetch-and-add primitive. Writing procedure is performed in context of thread creation and thread termination. Reading is performed each time thread needs to clean its deletion list. Writers can be readers as well during thread structure reclamation (during thread termination) so they are taken in account in common writers counter.

Lock-free double-linked list structure which is used in lock-less SLAB is constructed according to Sundell and Tsigas ³. They offered scheme where prev pointer plays auxiliary role. Hence we can employ single-word compare-and-swap for atomic insert/delete. One of the modifications of original algorithm is that deletion phase has two kinds:

moving into another list;
actual deletion and reclamation.

Moreover, actual deletion can be canceled if someone allocated object from chunk which is under reclamation. Also, all pointers are tagged according to the list they are supposed to be.

References

¹ Anders Gidenstam, Marina Papatriantafilou, Hakan Sundell, Phillipas Tsigas. Efficient and Reliable Lock-Free Memory Reclamation Based on Reference Counting. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

² Maged M. Michael. Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic Reads and Writes. IBM Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights NY 10598 USA

³ Hakan Sundell, Phillipas Tsigas. Lock-free deques and doubly linked lists. J. Parallel Distrib. Comput. 68 (2008) 1008–1020

Abstract licensed under Creative Commons Attribution-ShareAlike 3.0 license