A Scalable Memory Management Service Based on Microkernel OS in Multicore Environment
Abstract: New requirements on operating system are arising due to the growth of core counts which is increasing exponentially. Nevertheless, the overall performance of Operating System as a result of the lock contention of Operating System functions is decreasing. This paper focuses on scalability concerns of microkernel architecture by providing flexibility in resource management of computing and explicit data layout for avoiding locking. For this, a scalable memory management service (MMS) based on microkernel operating system is brought up. For eliminating lock contention through page pools, the physical memory is divided into servers. Besides this, some other new problems i.e. load balance and “distributed memory fragmentation” are highlighted. MMS follows one master multiple slave strategy, where master manages to adjust loads and routing requests with a global memory view, whereas slaves are responsible to manage distributed page zones and virtual memory areas. By comparing results, it is obtained that MMS acquires better scalability than Linux on a 32-core machine.
The increasing industry demand of process performance, requires development of multiple processor technologies. Single processor’s frequency has already met its limit, thus among the on-chip multicore processors, multicore hardware architecture design and many-core technology are showing up rapidly. It is expected in future for us to have 100 to 1000 of cores on single chip.
Nevertheless, changes along with advancement evolves new problems as during Linux, traditional monolithic Operating System, moving from single core processor era to symmetric multiprocessing (SMP) system, attached great kernel locks to secure shared kernel data. Due to growing cores, the lock contention might follow expandability declination.
This paper highlights memory management using buddy allocation algorithm. Besides operations over the page pool in Linux, still results propose high performance abasement.
Wentzlaff et al. conveyed some suggestions to recover the above expand ability issues as;
Factorizing the operating system functions into a variety of services, which in turn minimizes the probability of lock contention.
Shared data can be distributed into service processes, individually isolated in special to protect from locking.
Further, the need of clients of multiple concurrent MMS processes, still keeps expandability problem there. Its counter was achieved by separating physical memory of operating system and providing data structures of paging in their own boundary to tackle the allocation and freeing requests from clients. This drew some new issues regarding load balance, distributed memory fragmentation and independence of physical memory management (PMA). Thus master slave concept was introduced which in turn solved the problems by partitioning physical memory into some continuous areas, each was to be manipulated by its corresponding server. Page allocation/freeing and management of process’s virtual memory area (VMA) is kept for slave servers, while the master has the global view of physical memory of OS and is responsible for managing the procedures for slaves, adjusting and balancing their loads.
2. Microkernel architecture:
Microkernel and multiple servers are the constituents of our multicore operating system, where microkernel’s tasks are managing core resources, interrupt handling (i.e. page handler), inter-processes communication and process scheduling with time multiplexing. Rest of the tasks are handled by servers executed as service processes, bound to dedicated cores. Cores are classified as service cores and application cores for running servers and clients respectively.
Microkernel: processes communicate one at a time through messages passed by microkernel based on shared memory copying. In microkernel, the interrupt handling is obtained by a message sent to servers in terms of the type of interrupt or exception. Remaining work is handled by the servers, for instance asynchronous fashion of interrupt handling eases the work of microkernel.
Servers: in microkernel operating system, it operates as a webserver. Particularly, it repeatedly fetch messages from the message buffers and handle requests. The serverlib in fig.1 is used by servers as user level library, which is used to manipulate its thread queues and message buffers. Server has 3 ring message buffers for storing emergency messages, response messages and request messages. Fig.2 shows the working of server. Several continuous zones of pages partitions are manipulated by slaves. Slave just performs page allocation/freeing from its local page zone. Each slave shares its page zone with its master, who gets an overall view of whole physical memory in operating system.
Fig. 3 shows the framework of MMS and the locations and the views of page zones and mm_struct in different servers.
As master doesn’t have enough work so master also performs a small quantity of work of page allocation/freeing in our implementation.
Page allocation: Buddy algorithm is employed for page allocation/freeing in our implementation. The free pages in a zone are constructed with a multi-level linked list, and a list links a set of page blocks whose size is a power of two. For di?erent requirements, we implement two page allocation modes: immediate allocation and delay allocation. Both of the above two modes have their advantage and disadvantage. Immediate allocation mode has lower memory utilization e?ciency but higher execution e?ciency. Delay allocation can utilize physical memory more e?ectively since it allocates pages when the application needs memory actually. Nevertheless, it may generate large context switch overhead caused by page fault and message passing.
Page freeing: Fig.4 shows an example that the pages in an area are freed.
Discussion: In our master-slave structure, there is no message communication and little lock contention among slaves which is only introduced by the locks protecting the message bu?ers of servers. Hence ideally, the overall throughput tends to grow linearly with the number of slaves. The static routing policy also cuts down the overall message overhead, because the allocation request routing of the master only occurs when a client is forked or its fellow server changes. That makes the master be unlikely a bottleneck of MMS.
3. MMS with Master-Slave structure:
This section is dedicated to the discussion of MMS, service model and lock-less design. Then issues during page allocation/freeing are mentioned along with solution and analysis.
Introduction: MMS include many servers, each runs on dedicated service core. It manages physical memory and the VMA of processes in operating system. PM server forks every application. It also initializes the PCB and performs other tasks without locking. For global pages, distributed mode is used to avoid lock contention.
Service model: After complete hardware initialization, the master server is picked which is the first initialized server, while rest are slave servers.
Load balancing policy (LBP) and distributed memory fragmentation: We solve the two problems together via routing allocation requests. For the routing policy, we try to balance the free memory volume of the slaves but should avert distributed memory fragments. We set two thresholds: HighMem and LowMem. HighMem as a warning watermark indicates the free memory of a slave is de?cient. While LowMem indicates the memory is abundant. The master controls the estimated allocated memory ratio between LowMem and HighMem as much as possible while selecting a fellow server. Fig.5 describes how MMS routes allocation requests. The policies are demonstrated as below:
Choosing the slave with the minimum load when there are slaves whose estimated loads are below LowMem.
b) Choosing the one with the maximum load below HighMem when all loads exceed LowMem.
Experimental environment and methodology: In this section, we compare the scalability of MMS on microkernel and MM module in Linux and investigate the features of MMS, such as overhead, load balancing. The experimental platform is Dell PowerEdge R910 server, with four chips and each chip has eight cores. Each core is Intel Xeon E7-4820 2.00GHz. Eight cores on the same chip share the same L3 cache with 18MB. The physical memory size is 100GB, but our system only uses 4GB.
A memory micro-benchmark in Linux 188.8.131.52 calls mmap and unmmap repeatedly, to expand and shrink address space as quickly as possible. The results are collected by executing one independent copy of the micro-benchmark on each core.
For the evaluation of MMS, we prepare 4 cores to perform parallel MM servers, 24 cores as application cores, and the rest 4 cores are used to perform the other servers. We deploy one MM server on each service core. To try to keep the servers busy, we deploy multiple clients on each application core. Similarly with the memory micro-benchmark in Linux, each client makes allocation and freeing requests to stress MMS. Besides, in order to investigate the performance of real applications on our system, we transplant three programs in SPECcpu2006 benchmarks: Lbm, Sjeng and Bzip2. These selective applications have plenty of calculations and memory accesses but little I/O operations. So they are ?t for us to test memory allocation/freeing. They are also executed repeatedly like microbenchmark. But we modify their code further for testing memory behavior more simply:
Removing I/O cost. The input data reading comes from prewritten memory but not from ?les. The output ?le writing and result printing are removed.
b) Stressing MM fully. Note that functions malloc/free optimize the memory allocation/freeing to try to trap into kernel, such that they can be ?nished usually in user-space. In order to request MM more frequently, we replace the two library functions with the corresponding system calls.
Results and analysis:
Scalability comparison We set both LowMem and HighMem as 100% for simple load balancing without thinking about distributed memory fragmentation, and the master partitions the memory zones averagely since we are more concerned with the performance in the scalability comparison test. Figs.6 and 7 depict the throughput of the two page allocation modes of MMS: MMS-I (immediate) and MMS-D (delay) comparing with Linux with di?erent benchmarks. The performance is measured by the frequency of processing a pair of allocation/freeing per one million cycles. As shown in Fig.6(a), (b) and ( c), Linux reaches a bottleneck at 4C6 cores in most cases, then increasing the number of concurrent micro-benchmarks would not improve throughput further since more time is spent on contending over the locks protecting global page pool. As MMS performs allocation on partitioned physical memory without locking, its throughput increases with the number of servers. In addition, the throughput of MMS is also a?ected by the number of clients. Insu?cient requests may cause servers idle at small number of clients. But with rising number of clients, the throughput of MMS-I reaches a peak gradually. We think that the peak is due to the bottleneck of serviceability and the locks over the message bu?ers of servers. MMS-I(1) exceeds Linux when the number of application cores is 8, 4, 1 in Fig.6(a), (b) and ( c). But the superiority of MMS-I is attributed partially to immediate allocation manner. Because the overhead caused by the large amounts of page faults is a critical reason for signi?cant performance degradation in Linux. After our modi?cation on the three SPEC benchmarks, the primary bottleneck focuses on memory allocation and freeing. Bzip2 allocates about 1600 pages in the initialization phase for storing compressed or decompressed results. In the compression phase, it handles 5000 bytes data of the input at a time, allocates 2 pages for temporary data before handling and frees them afterwards. In Fig.6(e), the throughput of Linux drops at 10 cores, and MMSI exceeds Linux at 5 cores. Sjeng is a chess game. In Fig.6(f ), Sjeng in Linux and our system both achieve good scalability since the performance of a computation-intensive program would not be nearly impacted by memory management subsystem. From Fig.7, in contrast to MMS-I, we can observe that the overall performance of MMS-D degrades, because the lower ef?ciency of delay allocation mode causes it to pay much time on allocating single page one by one. Moreover, the extra message overhead of MMS-D is also a factor. But Linux allocates pages using delay allocation, so the results of MMS-D vs. Linux in Fig.7 have more convincingness for fairness. In Fig.7(e), the throughput trends of MMSD(3) and MMS-D(4) are close since 3 servers are enough for Bzip2 in 32-core case. In summary, MMS-D surpasses Linux on maximum throughput by 219%, 186%, 197% respectively for micro-benchmark, by 90% and 49% for the two memory-stress benchmarks, Lbm and Bzip2. And Similarly with Fig.6(f ), MMS-D shows the almost the same performance trends with Linux for Sjeng.
Finally, we ?nd an abnormal phenomenon from Figs.6 and 7 that the throughput of MMS doesn’t increase steadily with the number of clients and servers sometimes even ?uctuates a little. It may be caused by shared hardware resource contention, such as interconnection bus, last level cache, prefetcher, memory controller.
b) Overhead: This experiment measures the extra overhead caused by MMS and microkernel architecture. The whole overhead involves message passing, thread switching of servers, and the waste caused by idle service cores. Our goal is to check whether the master is the bottleneck and to evaluate the unavoidable overhead of our system. Table 1 shows the workload and overhead of operations in MMS.
By subtracting the useful time from the total time, we can get the duration of the extra works. Overall, the overhead is positively correlated with the number of messages. The allocation workload ratio of MMS-I increases with the number of pages, since it demands more time for allocating larger page blocks. When requesting larger number of pages, MMS-I small part since it generates few extra messages owing to static routing. Hence, the master wouldn’t become a bottleneck of the whole system.
c) Load balance: In this evaluation, we test the e?ect of LBP. As presented in Section III.3, our policy in master is implemented depending on estimated load, which indicates the allocated memory ratio if a slave handles out all the messages in bu?ers at present. In fact, the value predicts the future load. The consequence of load unbalance is that the overloaded servers cannot provide service temporarily. And distributed memory fragmentation causes the whole MMS to be unable to allocate large memory in time. An alternative simple allocation request routing method is Round-robin policy (RRP). RRP would distribute loads averagely to all servers in general. But it may fail when there are some programs demanding for large and irregular memory. And it is just a common case that there are programs with varied behaviors in real environment.
In order to know the impact on MMS serviceability of the two methods, we conduct an experiment to stress free memory amount. Sjeng is used in this test for two reasons: 1) it holds the allocated memory for a long time during computation; 2) The sizes of the hash tables it requests can be regulated optionally.
In Table 2, the rate in the parentheses is the LowMem value used in LBP. The overloaded time ratio is the average value that divided by 4. As can be seen, LBP using di?erent LowMem all achieve very low overloaded time ratio since the master tries to avoid estimated loads being greater than HighMem when routing requests. Overload only happens when all servers are short of free memory. RRP is unable to perceive load status, so that the master knows overload when an overloaded slave noti?es. The result means that servers cannot provide service for at least 1–HighMem allocation in average 12.76% period. Moreover, the exception frequency increases with LowMem. WhenLowMem and HighMem are both 95%, LBP becomes an absolute load balancing policy, such that the exception occurs most frequently due to distributed memory fragmentation.
5. Related Work:
So far, there have been a great deal of studies of performance bottleneck identi?cation and improving scalability for Linux in multicore system. Cui et al. use micro-benchmarks to stress separate parts of Linux kernel, and identify the scalability problems of memory-mapped ?le creation and deletion, ?le descriptor operation, system V semaphore operation. Veal and Foong evaluate the performance of Apache on 8-core machine by a modi?ed SPECweb2005 Support workload. Their experimental results show the bottlenecks in scheduling and directory lookup of Linux 184.108.40.206. During the long history of Linux development, the community and academia have both proposed some approaches to scale Linux on multiprocessors. Read-copy-update (RCU) creates a copy before writing shared data, then modi?es the data and updates it later. It makes reading operation do not need to aquire locks. Lockcontention-aware scheduler monitors the percentage of lock waiting time of all tasks, and migrates the lock intensive tasks to a special set of cores. It solves scaling problems by limiting the number of parallel streams which causes contention. Boyd-Wickizer et al. mitigate most of scalability bottlenecks in Linux kernel for a set of application benchmarks, MOSBENCH, by using some general techniques such as per-core data structures, lock-free protocol and avoiding unnecessary locking. They suggest that Linux 2.6.35 can still scale well on 48-core machine and there is no scalability reason to give up traditional monolithic organization just yet. However, there is no evidence that the scalability problems of monolithic kernel can be completely eliminated for other benchmarks, particularly with rapidly increasing cores in the future.
Especially for scaling memory allocation, Linux improves parallelism by using per-CPU free lists. It maintains a small list of free pages for each core, and takes from these lists when it serves single-page requests. Hoard employs local heaps for each CPU similarly to the per-CPU list in Linux and a global heap to balance the di?erent local heaps. The de?ciency of the two methods is the global heap must be accessed serially so that it becomes a source of contention. Some studies suggest that it is di?cult to scale monolithic kernel to multicore system just by optimizing code with core counts increasing. Yuan et al. present an asymmetric operating system GenerOS. It splits Linux kernel into multiple modules which collaborate via message passing to complete kernel functions. It mitigates lock contention by just restricting the scale of parallel streams of each module but the servers still share global data in kernel. Di?ering from GenerOS, we remove the lock contention among servers completely by distributing physical memory.
FOS is a microkernel-like system inspired from distributed system. It adopts all-distributed mode for scalability, and the servers are absolutely isolated so that it can be easily extended to unshared memory multicore system. Later, FOS develops with implementing various parallel servers called ?eets. The memory management ?eet of FOS utilizes a distributed buddy page allocator in a ring fashion. The major di?erence is that we use a master-slave structure to achieve more ?exible features, such as load balancing with centralized policies and supporting two page allocation modes for di?erent demands. Furthermore, we allow the contention-less data structures (i.e. mm struct of MMS) to be shared for the conveniences of shared memory system rather than all-distributed mode.
Core counts are growing explosively within a decade. Traditional monolithic OSs have become gradually unsuitable for large multicore hardware. In this work, we propose a masterslave model, and present a scalable memory management service based on microkernel OS. It solves a series of new problems caused by transplanting MM module to microkernel architecture. We conduct a set of experiments on a 32-core machine to validate its superiority on scalability over Linux and relatively low overhead. Note that the overall throughput of servers is related to the quantity of requests from clients, so we plan to study the elasticity of services next. The trouble in elasticity of OS functions exists in microkernel but not in monolithic kernel since monolithic scales proportionally with demands inherently. Our goal is to make the services be able to grow and shrink adaptively with the working loads of servers and the intensity of requests.