From: "Bulent Abali" <abali@us.ibm.com> To: linux-mm@kvack.org Subject: Linux MXT support code (long posting, part 2 of 2) Date: Mon, 16 Apr 2001 09:20:08 -0400 In this posting I will describe the operation of the kernel patch and the mxt driver http://oss.software.ibm.com/pub/mxt/patch-2.4.3-md7 The 2.4.3 kernel with MXT support code has been tested extensively with stress tools. If you have suggestions for a test please write to me. Comments and contributions are always welcome. Thanks Bulent Abali abali@us.ibm.com The software design ------------------- Memory compression is done entirely in hardware, transparent to the CPUs, I/O devices, peripherals and all software including apps, device drivers and the kernel. Then, why do we need a kernel patch and a device driver to support MXT? Here are the two main reasons: - The kernel patch, at boot time looks for an MXT signature in the bios and doubles the size of the address space if signature is found. - The driver handles the corner case when 2 to 1 compressibility assumption breaks down. Without the driver compressed memory may run out with incompressible data in memory. In MXT systems the memory is overcommitted. There exists X amount of memory in the system but the system runs with 2X the address space. For example, you may have 512MB installed but system boots and runs as if having 1 GB memory. What happens for example when you fill the 1 GB memory with incompressible data, for example a huge zip file which cannot be further compressed in to 512 MB by the MXT memory controller? This is handled by the MXT driver which manages the compressed memory as I explain here. One important point is that the kernel patch and the driver are unobtrusive: they can be compiled in to the any i386 kernel build, and will run on any hardware, MXT or not. If the system is not MXT-enabled, neither the patch nor the driver module will change the operation of the standard kernel. Which files? ------------ The mxt driver module is in arch/i386/kernel/mxt.c and include/asm-i386/mxt.h The kernel patch touches the following files, with the amount of lines touched/added indicated in parenthesis. arch/i386/kernel/setup.c (84 lines) arch/i386/mm/init.c (2 lines) mm/mmap.c (6 lines) mm/page_alloc.c (10 lines) Here are the additions to arch/i386/kernel/setup.c -------------------------------------------------- In the standard kernel, setup_memory_region() determines the size of the system memory as obtained by bios E820 calls. If the box is MXT enabled, we must double the size of the address space using setup_mxt_memory() --- v2.4.3/linux/arch/i386/kernel/setup.c Sun Mar 25 21:24:31 2001 +++ linux/arch/i386/kernel/setup.c Tue Apr 3 09:17:23 2001 setup_memory_region(); +#ifdef CONFIG_MXT + setup_mxt_memory(); +#endif At boot time, the MXT boxes have an address space 2X the size of the installed memory. But the BIOS reports only 1X in the E820 call. This for the safety of the unaware users and will avoid costly support and help-desk calls. If BIOS were to report the 2X address space in E820 and if a user installed a vanilla operating system (not MXT-enabled) and then started using 1 GB of memory, he could have run for long time without knowing that there was only 512MB in the system. And without the compressed memory management software, the system could eventually run out of compressed memory. So to avoid potential problems, we hide this additional memory from non MXT-aware OSes by reporting only the physical memory size in the E820 call. But MXT-aware kernels, for example this patch, look for an MXT signature in the BIOS EBDA area. MXT bios leaves a special table in the EBDA area which can be searched for by the MXT-aware kernels and drivers. The table format is mutually agreed upon by hardware vendors operating in the MXT space. The table basically indicates how much additional "memory" i.e. address space there exist due to MXT. Here is the format of the MXT table http://oss.software.ibm.com/developerworks/opensource/mxt/publications/mxt_boot_spec041201.pdf In summary, the setup_mxt_memory() function looks for an MXT-table in the EBDA area and if found, it makes add_memory_region() calls to increase the address space size from 1X to 2X. After this point for all practical purposes memory size will appear as 2X the installed memory. Now let's look at the compressed memory management code in arch/i386/kernel/mxt.c ---------------------------------------------------------- Compressed memory management is actually quite simple. The code needs to perform the following tasks: 1-Measure compressed memory utilization. 2-If utilization exceeds predetermined thresholds, reclaim pages from other processes and clear the pages. This will reduce utilization because a 4 KB page filled with zeros occupy only 64 bytes in the compressed memory. 3-Steal CPU cycles from processes trying to push up compressed memory utilization. First task is accomplished using an architected register called Sectors Used Register (SUR) found on the memory controller. SUR is memory mapped and can be conveniently read by the device driver as shown. + sectors = READ_CTRL(SUR); + numpages = SECTORS_TO_BYTES(sectors) >> PAGE_SHIFT; This value is different than the value kernel thinks the system is using. Kernel might see 800 MB being used but if the contents are 2x compressible, then SUR will read only 400 MB. The second task is accomplished by comparing the value of SUR to a set of predetermined thresholds. This is done by a combination of polling and interrupts. There exist a register called Sectors Used Threshold Low Register (SUTLR). When SUR value exceeds SUTLR, the memory controller will interrupt the driver to indicate that compressed memory utilization is too high. SUTLR threshold is calculated during driver load time. It is typically greater than 90% of the installed memory size. For example, assume a system with 512MB installed memory and 1 GB address space, and assume that 800 MB of memory is in use. When compressed memory utilization exceeds 0.9 x 512 = 460 MB, the memory controller will interrupt the driver which will start working to reduce the utilization. The driver will signal its "memory eater" threads named cmp_eatmem() to start allocating pages using alloc_clear() function. The alloc_clear function uses the kernel alloc_pages() function to get free pages and then uses the fast-page-op to clear the page contents. This will reduce compressed memory utilization. + page = alloc_pages(gfp_mask,0); + if ( ! page ) + break; + mxt_clear_page(page); + ++cleared; + ++(*held); + list_add( &page->list, head ); Allocating pages in this manner puts pressure on the memory manager which must find free pages to deliver to eatmem threads. So alloc_pages() and its friends kswapd and bdflush will start working to reclaim used pages from other processes. As it is reclaiming pages, alloc_pages might shrink the page cache, buffer cache and so forth. It might also force processes' pages to be swapped out. Reclaimed pages are then delivered to the eatmem threads which will hold on to them until the compressed memory pressure goes away. There are actually several thresholds. As each threshold is crossed the driver gets more aggresive in grabbing and holding more pages. The thresholds are found in struct mc_th; and named as release, acquire, danger, intr. release < acquire < danger < intr is implied. Below the release threshold eatmem threads release all the pages they have been holding. Above the acquire threshold eatmem threads allocate and clear some pages but they leave some free memory still available for allocation by other processes. Above danger threshold, eatmem threads allocate all the free pages in the system and even ask for more. Above intr threshold it is same as danger, and additionally cpu blocker threads start running (more on this later.) The danger threshold is the target compressed memory utilization. To summarize we don't want compressed mem util to ever exceed danger. Between acquire and danger we allocate some pages as a guard but leave some free pages. Above danger we work very hard to reduce utilization below danger. So how many pages eatmem threads must allocate and clear? This is calculated in the mc_adjust_check() function. The driver uses linear extrapolation to determine the number of pages that must be removed from the system. Assume that the maxmu is the target compressed memory util and mu is the current compressed memory util. And assume that usedpages is the amount of memory currently in use. Then the maximum desired used pages count is calculated as max_pages = (maxmu / mu) * usedpages; For example, say this is a 512MB installed and 1GB (2X) address space system. Say, the current compressed memory utilization is 90% which is above acquire (88%) but below danger (92%). And assume that 700 MB of memory is currently in use. Therefore max_pages= (0.92/0.90)x700 = 715 MB. So, the eatmem threads grab and hold 1024-715 = 309 MB of pages. This leaves 15 MB of free memory. Now assume that the contents of the 700 MB start becoming less compressible such that current utilization becomes 0.94 exceeding danger (0.92) threshold. Then the equation becomes max_pages= (0.92/0.94)x700 = 685 MB. It means that the memeater threads must allocate and clear more memory, a total of 1024-685 = 339MB. Since 685 MB is smaller than 700 MB, the memory allocator alloc_pages() must free up 15 MB from somewhere; this might be the page cache, buffer cache, or might be through swapping out of process pages. As pages are allocated by eatmem threads, they are zeroed and the compressed memory utilization decreases. CPU blockers ------------ CPU blockers are needed to stall processes increasing compressed memory utilization above the highest threshold, intr. There is one CPU blocker thread per CPU. As eatmem threads are allocating pages, they might be descheduled in the alloc_pages() function. Because allocating pages might require yielding to kswapd or bdflush to free up some pages. For example mm/page_alloc.c: wakeup_kswapd(); if (gfp_mask & __GFP_WAIT) { __set_current_state(TASK_RUNNING); current->policy |= SCHED_YIELD; schedule(); } Furthermore kswapd and bdflush may suspend waiting for disk I/O to complete (e.g. when swapping). When alloc_pages routine yields the CPU like shown above, the nasty process pushing up the compressed mem utilization may start executing and increase the utilization further. So, to avoid positive feedbacks like this, above the intr threshold the driver signals cpu blockers (called cmp_idle()) to start spinning and wasting CPU cycles at a high priority. Granted this approach is heavy handed and one might do a better job in kernel/sched.c, but my motivation here was to have the least lines of code change in the kernel. Also note that it is not always easy to identify processes pushing up the compressed mem utilization to suspend them. Furthermore, system rarely gets above the intr threshold during normal use. When we're running stress tests to push up the utilization, over a one day period the system goes above the intr threshold may be 5-6 times at most and only few seconds at a time. So the cpu blockers almost never run under normal use. I believe this approach is a good tradeoff away from code complexity. Swap reservation mm/mmap.c --------------------------- In MXT systems the memory is overcommitted. As I stated before, there exists X amount of memory in the system but the system runs with 2X the address space. In the worst case scenario 2X-1X=1X memory may need to be swapped out to the swap disk if contents of the memory becomes totally incompressible. Therefore, swap space must exist to cover the worst case. So, we "reserve" some swap space in the vm_enough_memory() as shown below. --- v2.4.3/linux/mm/mmap.c Wed Mar 28 15:55:34 2001 +++ linux/mm/mmap.c Tue Apr 3 09:17:23 2001 - free += nr_swap_pages; + free += nr_swap_pages - swap_reserve; This is not so wasteful. Disk space is cheap. It costs about $3/GB these days. If your system does not need swap, for example it is a webserver which uses the page cache mostly, then you can override this reservation thru /proc file system echo 0 > /proc/sys/mxt/swap_rsrv One thing worth mentioning is that swap_reserve is set to 0 on non-MXT systems. Therefore, the patch will not change the behaviour of vm_enough_memory() when the kernel is running on non-MXT systems. In conclusion, the MXT support patch and driver help keep the compressed memory utilization down when needed. The discussion on cpu blockers, eatmem threads and swapping overcommitted memory might have given you the impression that there is an ongoing fight to keep the compressed memory utilization down. However these do not happen during normal use. Most applications are compressible by a factor of 2 or even better. The MXT driver code almost never gets activated. If you want to see it with your own eyese try the graphical monitor xcompress on your system http://oss.software.ibm.com/developerworks/opensource/mxt/tools/xcompress.tar.gz Xcompress samples the memory and estimates compressibility. It produces a number between 0.0 to 1.0. Smaller is better. For example 0.5 means that memory is compressible by a factor of 2 (1/0.5). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/