From: "Bulent Abali" <abali@us.ibm.com> To: linux-mm@kvack.org Subject: Linux MXT support code (long posting, part 1 of 2) Date: Mon, 16 Apr 2001 09:18:28 -0400 Greetings, I am posting here a description of the Memory Expansion Technology (MXT) and the Linux MXT support patch for a review by the linux-mm community. To save bandwidth I am not posting the patch itself, but would rather point you to the URL http://oss.software.ibm.com/pub/mxt/patch-2.4.3-md7 This email is part 1 of 2 describing the MXT hardware. Part 2 describes the software design of the Linux MXT support patch. Comments and contributions are welcome. Please feel free to send me email if you have specific questions. There are also web pages dedicated to MXT related stuff at http://oss.software.ibm.com/developerworks/opensource/mxt/ Thanks Bulent Abali abali@us.ibm.com Intro --------- MXT is a hardware technology for compressing main memory contents. MXT doubles the effective size of the main memory. For example 512 MB installed memory appears as 1 GB. This is done entirely in hardware, transparent to the CPUs, I/O devices, peripherals and all software including apps, device drivers and the kernel with the exception of less than hundred lines of code additions to the base 2.4 kernel. A memory controller called the Pinnacle chipset from ServerWorks Inc incorporates MXT (for x86 platforms.) ServerWorks, IBM, another well-known computer company, and a motherboard company are working to create systems based on the Pinnacle chipset. Motivation ---------- Memory seems to be cheap. Why bother with MXT to double the sizeof memory? Simply put, MXT saves money and lots of money. Memory is not really cheap, especially when your system uses 512 MB or more. Try the on-line price configurators of Compaq, IBM, Dell, etc to double the size of your system memory. Going from 512MB to 1GB adds about $1200 U.S and going from 1GB to 2GB adds about $2500 to the cost of a typical low-end server. Memory prices are very volatile and depend on the phase of the moon and whom you buy memory from, but in any case MXT will save money because the incremental cost of the MXT chipset is estimated to be in the $100 range. Yet another consideration is the density of the DIMMs. A typical low-end system has 4 DIMM slots. If your system needs more memory, then you need to use higher density DIMMs which cost tons of money. 512MB and 1GB DIMMs are lot more expensive than 256 MB DIMMs on a price-per-megabyte basis. In another (rather extreme) example, there are no reasonably priced 2 GB DIMMs to get to a 8 GB system with 4 dimm slots. You need to buy a "high-end" server with more than 4 dimm slots but those boxes cost lot of money. MXT will give you the 8 GB effective size in only four slots with four 1 GB dimms. In summary, MXT gives you a price/performance advantage. You can either double the performance by 2x memory expansion OR you can use half the amount of memory you would have normally used and get the same kind of performance at a lesser cost. Hardware Description -------------------- MXT is incorporated in to the memory/cache controller chipset. The controller sits on the Intel 133 MHz front side bus and services memory requests just as any other memory controller. The difference here is that this memory controller has a built-in compression/decompression circuitry. It compresses data before writing to the main memory and decompresses in the other direction. The compression block size is 1 kilobytes. Since compression/decompression latency of 1 KB blocks is relatively long, to absorb this latency the controller is augmented with a 32 MB (yes, megabyte) third level (L3) cache which contains uncompressed data. Most of the memory accesses occur to L3; benchmarks show that L3 has a typical hit ratio of 98 percent or more. The L3 cache is made of double data rate (DDR) SDRAM. The main memory uses standard off-the-shelf 133 MHz SDRAM. The MXT compression scheme stores compressed data blocks to the (compressed) main memory in a variable length format. The unit of storage in the main memory is a 256 byte sector. Depending on its compressibility, a 1 KB block in the L3 cache may occupy 0 to 4 sectors in the main memory. An address translation is made to go from 1 KB block address to the main memory address by a lookup of the Compression Translation Table (CTT). CTT is in a reserved location in the main memory and it is maintained by the memory controller. Each 1 KB block's real address (i.e. address on the front side bus of PIII) maps to one CTT entry which is 16 bytes long. A CTT entry contains four address pointers each pointing to a 256 byte sector in the (compressed) main memory. For example, a 1 KB data block which compresses by a factor of two will occupy only two sectors in the compressed memory (512 bytes) and the CTT entry will contain two addresses pointing to those sectors. The remaining two pointers will be NULL. If the line is very compressible, for example if it contains all zeros, it will use no sectors at all. Instead, the compressed data will be stored in the CTT entry itself (which is 16 bytes long). So in the best case, for example a 4 KB page filled with zeros will occupy only 64 bytes in the memory! One good thing about this stuff is that all of the operations are performed by the memory controller with no software intervention at all. Standard hardware parts, peripherals and software run as-is with no changes at all. There is a set of nifty hardware functions called "fast page operations". The memory controller exposes them via some memory mapped control registers. A fast-page-op clears or moves a 4 KB page instantly, about 8 to 10 times faster than a Pentium III using memset or memcpy. This is possible because fast-page-ops do not really move bulk data as processor does. Instead, fast-page-ops update the pointers in the CTT entries. Clearing a 4KB page is merely setting few bits in the 16 byte CTT entry and replacing the sector pointers with a compression code-word which says "this 4KB page contains all zeros" :-) Likewise, moving a 4 KB page is merely updating few CTT entries. We benchmarked the fast zero page op in the 2.2 kernel, by re#defining clear_bigpage(), hence copy_cow_page() in mm/memory.c would use fast zeroing instead of memset. We then observed that kernel could fork large memory processes 2.5 times faster because kernel must clear cow pages before handing them out to the processes. There are quite a few places that the 2.4 kernels can be speeded up using fast-page-ops. However, I will leave its discussion to another time. Performance Impact and Compressibility -------------------------------------- We ran few CPU benchmarks to measure impact of L3 cache. We found that L3 has a negligible performance impact due to its relatively large size (32MB) Benchmarks showed that when working set fit in L3, they ran slighlty faster on MXT boxes than on the standard boxes due to the Double Data Rate (DDR) SDRAM used in the L3 cache. If the working set of the benchmark didn't fit in the L3 cache, then the MXT system ran slightly slower. However, in either case, the performance impact of L3 was negligible. There are few reports detailing these benchmarks at URLs [1,2] Benchmarks with large memory requirements benefitted from MXT significantly since memory size was doubled. A database benchmark ran in 66% shorter time on a 1 GB MXT enabled memory (2 GB effective) over a standard system with 1 GB memory. We also ran a well known webserver benchmark on an MXT box with the TUX 2.0 webserver. We observed that the webserver throughput doubled due to the 2x effective memory size. See the URL [3] below for a chart summarizing this benchmarking exercise and also see some cost comparisons. So, how compressible are the main memory contents? We have overwhelming data indicating that generally the main memory contents are 2x compressible or better. We sampled memory contents of quite a few systems running different apps and benchmarks and found that they compress better than 2X. We mirrored contents of about dozen web sites and found that 2x compression is generally the rule. See the report [2] below for this data. Those in the know immediately raise the issue of compressed GIF and JPEG files that exist on a typical webserver today. Since graphics files are already compressed, they jump to the conclusion that memory contents of a typical web server cannot be further compressed. Measurements do not support this conclusion. It all depends on the file sizes. GIFs and JPEGs found on web servers are typically small files, smaller than the 4 KB page size in i386. We think that due to fragmentation in the memory (since files are memory mapped) even a 100 byte file occupies 4 KB and ends up being very compressible. To verify this we populated file system with hundreds of thousands of 100 byte size incompressible files. We then copied these files to /dev/null to bring them in the page cache. We then measured that the compressibility of the main memory is 4.4x (compared to 1.0 expected compressibility!) So, it appears that when these small files are brought in to the memory they take up more than 100 bytes because of their associated filesystem structures, page fragmentation, and all the other crud that comes off the disk due to the minimum 1 KB disk transfer size. The report in URL [2] explains this observation in more detail. Perhaps someone with the filesystem expertise have a better explanation. If you want to see the compressibility of your own system try the graphical monitor xcompress available at URL [4]. Xcompress opens /dev/mem and samples the main memory contents and estimates system compressibility, as if this was an MXT box. It produces a number between 0.0 to 1.0. Smaller is better. For example 0.5 means that memory is compressible by a factor of 2 (1/0.5). Please post your results here or send it to me and I will post it on the website for everybody to see. [1] http://oss.software.ibm.com/developerworks/opensource/mxt/ [2] http://oss.software.ibm.com/developerworks/opensource/mxt/publications/mxtperformance.pdf [3] http://oss.software.ibm.com/developerworks/opensource/mxt/publications/mxtpriceperf.pdf [4] http://oss.software.ibm.com/developerworks/opensource/mxt/tools/xcompress.tar.gz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/