[LWN Logo]

Date:   Mon, 16 Aug 1999 18:29:30 +0200 (CEST)
From:   Andrea Arcangeli <andrea@suse.de>
To:     Linus Torvalds <torvalds@transmeta.com>,
Subject: [bigmem-patch] 4GB with Linux on IA32

NOTE: all the code in the patch included in this email is been
co-developed by SuSE and Siemens (precisely by me at SuSE and Gerhard
Wichert at Siemens).

Object of the patch:

	Allow you to use close to 4giga of memory as anonymous and shm
	memory on IA32.

Performance degradation:

	Close to zero.

Missing feature:

	The page/buffer cache (and so all shared/private not-anonynous
	mappings) can grow up only close to 4giga-PAGE_OFFSET bytes of
	RAM (2giga if CONFIG_2G is selected, 1giga if CONFIG_1G is
	selected).

Implementation details:

	Basically we allow GFP to return addresses without a valid
	virtual->physical mapping. Such pages are the bigmem pages and they
	have a valid page_map as the regular-pages. The bigmem
	pages have the PG_BIGMEM bigflag set into the page->flags field.
	The bigmem pages are completly equivalent to the regular pages
	with the only difference that we can't access them by only touching
	the virtual address returned by GFP. So to do COW or clear_page with
	bigmem pages we need to first create a proper virt-to-phys mapping
	in the fixmap area and then we'll read or write to such phys-page
	by writing or reading in the virt-fixmap area. After the COW or
	after the shm/anonymous allocation the physical page will be mapped
	in the userspace pte and there won't be any difference for
	userspace between bigmem or regular pages.

	The only tiny performance degradation will be in the
	page-fault handler: a check for the bigmem page, if the page is a 
	bigmem page then remap of the fixmap pte and invlpg the fixmap
	virtual address. I believe this little performance degradation
	will be not even noticeable.
	And once the allocation will be complete there won't be any
	performance degradation at all.

	The reason we don't allow the bigmem pages to live in cache is
	because the cache must be read and written by the buffer/page
	cache code and by the block device lowlevel code. Allowing
	the bigmemory to live in the buffer/page/swap cache would be
	possible but we should change lots of kernel common code.
	Since we can just grow the cache up to 2giga of ram (with
	CONFIG_2G) and at the same time we may be running with 2giga of
	ram allocated in shm or malloced memory, I don't believe that it
	worth to change all such code adding further complexity and
	performance degradation to the I/O layer.

	To solve the swapout/swapin of bigmem pages I remap the bigmempage
	in a regular page or I replace the swapped-in regular page
	with a bigmem page when necessary. At the same time I alloc a
	page, I also release another page. If a page is not available
	in the freelist then the swapout will return as not succesfully
	and we'll continue trying to swapout or unmap some other page in the
	process space. The swapout/swapin of the bigmem pages will be a
	bit slower than the swapin/swapout of regular pages but since I/O is
	almost always far slower than memory I believe that even this
	swapin/swapout performance degradation won't be an issue at all
	(almost sure if the swap-blockdevice is DMA driven ;).

How to use the patch:

1)	Grab and extract the 2.3.13 kernel.
	(ftp://ftp.kernel.org/pub/linux/kernel/v2.3/linux-2.3.13.tar.gz)
2)	Apply the patch in attachment over it.
3)	configure the kernel with CONFIG_BIGMEM enabled.
4)	recompile, install the new binary kernel image, reboot and enjoy ;).

CONFIG_1G/CONFIG_2G settings:

o	If you want to allow a task to grow up to 3giga of shm or
	anonymous virtual memory then select CONFIG_1G. (the remaining giga
	of ram can be still used by the other tasks of course)
	NOTE: Selecting CONFIG_1G will allow you to alloc only 1giga of
	ram as cache and so as private/shared mmaps.
o	If you want to alloc up to 2giga of ram in cache then select
	CONFIG_2G. But then the maximum virtual size of a task will be
	limited to 2giga. (the other 2giga of RAM can be used by other
	processes as usual)

Testing:

	I personally did most of the testing with 32mbyte of ram 8-). To
	test the bigmem code with lowmemory machines you simply need to
	set CONFIG_BIGMEM and recompile, since if your machine have less then
	1giga of ram, then part of your memory will be considered as
	bigmemory even if it could have a valid virtual-physical mapping
	inside the 4mbyte kernel pagetables. So even if you have lowmemory
	machines you'll be able to test the code equally well.

	BTW, the patch has enabled some debugging code, so if you are
	going to run precise benchmarks please #undef KMAP_DEBUG in
	include/asm-i386/bigmem.h .

	Of course the code is been tested also on a 4giga amazing
	hardware:

>2.3.13 with bigmem patch is runnung on the 4GB machine. 
>Meminfo after boot shows
>
>        total:    used:    free:  shared: buffers:  cached:
>Mem:  4079742976 118927360 3960815616        0 12242944 66252800
>Swap: 134209536        0 134209536
>MemTotal:   3984124 kB
>MemFree:    3867984 kB
>MemShared:        0 kB
>Buffers:      11956 kB
>Cached:       64700 kB
>BigTotal:   3128320 kB
>BigFree:    3114564 kB
>SwapTotal:   131064 kB
>SwapFree:    131064 kB
>
>and after launching 55 "animate dna.miff" (did you ever do this on a linux
>machine? ;))
>
>        total:    used:    free:  shared: buffers:  cached:
>Mem:  4079742976 3357941760 721801216        0 12255232 68075520
>Swap: 134209536        0 134209536
>MemTotal:   3984124 kB
>MemFree:     704884 kB
>MemShared:        0 kB
>Buffers:      11968 kB
>Cached:       66480 kB
>BigTotal:   3128320 kB
>BigFree:          0 kB
>SwapTotal:   131064 kB
>SwapFree:    131064 kB
>
>Gerhard.

IMHO it would be nice if our bigmempatch would be included into the
official 2.3.x. It doesn't need heavy common code changes and we could
cleanup the code still more by putting some code out of the #ifdef
CONFIG_BIGMEM.

Comments, questions, or incremental patches are welcome of course :).

Thanks.

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/