Date: Wed, 3 Nov 1999 17:25:34 +0100 (CET) From: Ingo Molnar <mingo@chiara.csoma.elte.hu> To: Neil Conway <njcmail@fusrs5a.culham.ukaea.org.uk> Subject: The 64GB memory thing On Wed, 3 Nov 1999, Neil Conway wrote: > The recent thread about >4GB surprised me, as I didn't even think >2GB > was very stable yet. Am I wrong? Are people out there using 4GB > boxes with decent stability? I presume it's a 2.3 feature, yes? the 64GB stuff got included recently. It's a significant rewrite of the lowlevel x86 MM and generic MM layer, here is a short description about it: my 'HIGHMEM patch' went into the 2.3 kernel starting at pre4-2.3.23. This means a heavily rewritten VM subsystem (to deal with pte's bigger than machine word size), and a much rewritten x86 memory and boot architecture. In fact there is no bigmem anymore, it has been replaced by (the i think more correct term) 'high memory'. It utilizes 3-level page tables on PPro+ CPUs called 'Physical Address Extension' (PAE) mode. In PAE mode the CPU uses a completely different and incompatible page table structure, which is 3-level and has 64-bit page table entries and cover up to 64GB physical RAM. Virtual space is still unchanged, 4GB. Highmem is completely transparent to user-space. There is a new 'High Memory Support' option under 'Processor type and features': High Memory Support ( ) off ( ) 4GB (X) 64GB 'off' is up to 1GB RAM, utilizing 2-level page tables and no highmem support. '4GB' is utilizing 2-level page tables and high memory support for any <4GB physical RAM that cannot be permanently mapped by the kernel. '64GB' mode utilizes 3-level page tables (for everything). The theoretical limit of high memory on IA32 boxes is 16 TB - there is lots of space in 64-bit PAE pte's, although current CPUs only support up to 64GB RAM. (the biggest current chipsets supports up to 32GB RAM) about the structure of the patch/feature itself, kernel internals: pgtable.h got split up into pgtable-2level.h and pgtable-3level.h, which should be the only 'global #ifdef' distincting 3-level from 2-level page tables on x86. There were lots of assumptions throughout the arch/i386 tree that assumed 2-level page tables, these places all had to be fixed and converted to 'generic 3-level page table code'. There are only a few CONFIG_X86_PAE #ifdefs left, i intend to cut down the number of these even more, to keep the x86 lowlevel MM/boot code easy to maintain. the generic kernel was almost safe wrt. 3-level page tables, but nevertheless it had bugs which only triggered in PAE mode. For example, one pgd entry in PAE mode covers 1GB of virtual memory, and some loops which iterated through virtual memory had buggy exit conditions and broke in subtle ways when they were running in the upper-most 1GB of virtual memory. (ie. kernel space) There were about 20 of such buggy loops throughout the MM code. the much bigger generic change was that pte's got 64-bit, although the architecture itself was still 32-bit. Lots of VM-internal code had to be reworked to never assume that 'sizeof(pte_t) == sizeof(unsigned long)'. Examples are the swapping code, IPC shared memory. Bigger than machine-word ptes were not supported by Linux previously. also i guess many of you have noticed the new mm/bootmem.c allocator - this was necessery because on my 8GB box mem_map is more than 100 MB (!), and the 'naive' boot-time allocation we did in earlier kernels simply did not work on 'slightly noncontinous' physical maps like my box has. (at 64MB there is an ACPI area which caused problems) [this short description should give you a scope of the changes, and i/we are still fixing some of the impact in 2.3.25. (Christoph just posted his problems with IPC shared memory)] Backporting to 2.2: while the bigmem patch was small and simple and got backported to 2.2, the highmem patch is basically impossible to be backported in a maintainable way as it touches some 60 files all over the kernel. 64 GB PAE mode works just fine on my 8GB RAM, 8-way Xeon box: 11:25pm up 5 min, 2 users, load average: 7.78, 4.30, 1.77 30 processes: 21 sleeping, 9 running, 0 zombie, 0 stopped CPU states: 0.0% user, 7.2% system, 92.8% nice, 0.0% idle Mem: 8241152K av,7720960K used, 520192K free, 0K shrd, 2168K buff Swap: 0K av, 0K used, 0K free 9756K cached PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND 215 root 19 19 1000M 1.0G 232 R N 1.0G 11.6 12.4 0:11 db_serv 180 root 19 19 1000M 1.0G 232 R N 1.0G 11.5 12.4 0:31 db_serv 182 root 19 19 1000M 1.0G 232 R N 1.0G 11.5 12.4 0:30 db_serv 183 root 19 19 1000M 1.0G 232 R N 1.0G 11.5 12.4 0:30 db_serv 184 root 19 19 1000M 1.0G 232 R N 1.0G 11.5 12.4 0:30 db_serv 185 root 19 19 1000M 1.0G 232 R N 1.0G 11.5 12.4 0:30 db_serv 186 root 19 19 1000M 1.0G 232 R N 1.0G 11.5 12.4 0:29 db_serv 247 root 19 19 500M 500M 232 R N 0 11.5 6.2 0:04 db_serv2 181 root 1 0 996 996 828 R 0 7.2 0.0 0:16 top 177 root 0 0 984 984 768 S 0 0.1 0.0 0:00 bash 1 root 0 0 476 476 408 S 0 0.0 0.0 0:00 init (these are 8x ~1G-RSS processes using up all 8GB physical RAM, one running on each CPU) future plans: right now high memory is seriously underused on typical servers due to the page cache still being in low memory. On 2.2 with bigmem the lowmem:highmem ratio is around 5:1, this means that the 'effective' size of my 8GB box in 2.2 is only ~2.4GB. The exception are workloads where most memory is allocated as shared memory or user memory, but this is not the case for a typical web or fileserver. On my box the pagecache is already in high memory (and we are ready to add 64-bit DMA to device drivers), and the lowmem:highmem ratio is up to 1:10. This means that 8GB RAM is already fully utilized on a typical server workload. Ingo ps. to have correct memory statistics (top, vmstat, free) with >4GB RAM you need the newest procps package (or i can send the patches). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/