Date: Thu, 29 Jun 2000 14:00:39 +0100 From: "Stephen C. Tweedie" <sct@redhat.com> To: Andrea Arcangeli <andrea@suse.de> Subject: Re: [PATCH] 2.2.17pre7 VM enhancement Re: I/O performance on 2.4.0-test2 Hi, On Thu, Jun 29, 2000 at 01:55:07PM +0200, Andrea Arcangeli wrote: > > I agree, the current swap_out design is too much fragile. > > btw, in such area we also have a subtle hack/magic: when we unmap a clean > page we consider that a "fail" ;), while instead we really did some kind > of progress. I really think we need to avoid such hacks entirely and just fix the design. The thing is, fixing the design isn't actually all that hard. Rik's multi-queue stuff is the place to start (this is not a coincidence --- we spent quite a bit of time talking this through). Aging process pages and unmapping them should be considered part of the same job. Removing pages from memory completely is a separate job. I can't emphasise this enough --- this separation just fixes so many problems in our current VM that we really, really need it for 2.4. Look at how such a separation affects the swap_out problem above. We now have two jobs to do --- the aging code needs to keep a certain number of pages freeable on the last-chance list (whatever you happen to call it), that number being dependent on current memory pressure. That list consists of nothing but unmapped, clean pages. (A separate list for unmapped, dirty pages is probably desirable for completely different reasons.) Do this and there is no longer any confusion in the swapper itself about whether a page has become freed or not. Either a foreground call to the swapout code, or a background kswapd loop, can keep populating the last chance lists; it doesn't matter, because we decouple the concept of swapout from the concept of freeing memory. When we actually want to free pages now, we can *always* tell how much cheap page reclaim can be done, just by looking at the length of the last-chance list. We can play all sorts of games with this, easily. For example, when the real free page count gets too low, we can force all normal page allocations to be done from the last-chance list instead of the free list, allowing only GFP_ATOMIC allocations to use up genuine free pages. That gives us proper flow control for non-atomic memory allocations without all of the current races between one process freeing a page and then trying to allocate it once try_to_free_page() has returned (right now, an interrupt may have gobbled the page in the mean time because we use the same list for the pages returned by swap_out as for allocations). I really think we need to forget about tuning the 2.4 VM until we have such fundamental structures in place. Until we have done that hard work, we're fine-tuning a system which is ultimately fragile. Any structural changes will make the fine-tuning obsolete, so we need to get the changes necessary for a robust VM in _first_, and then do the performance fine-tuning. One obvious consequence of doing this is that we need to separate out mechanisms from policy. With multiple queues in the VM for these different jobs --- aging, cleaning, reclaiming --- we can separate out the different mechanisms in the VM much more easily, which makes it far easier to tune the policy for performance optimisations later on. Right now, to do policy tuning we end up playing with core mechanisms like the flow control loops all over the place. Nasty. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/