Node Affine NUMA scheduler, updated

From:	 Erich Focht <efocht@ess.nec.de>
To:	 LSE <lse-tech@lists.sourceforge.net>
Subject: Node Affine NUMA scheduler, updated
Date:	 Wed, 1 May 2002 12:22:05 +0200
Cc:	 "linux-kernel" <linux-kernel@vger.kernel.org>

Hi,

an updated patch for the node affine NUMA scheduler extension based on the 
O(1) scheduler can be found at 
http://home.arcor.de/efocht/sched/Nod15_O1-2.4.18.patch

Detailed information on the implementation is at 
http://home.arcor.de/efocht/sched .

What's new:
The topology information has been updated and supports the following ccNUMA 
platforms:
 - IBM NUMA-Q - i386 (thanks to Matt Dobson),
 - SGI SN1/2 - ia64 (thanks to Jesse Barnes),
 - NEC AzusA - ia64
No other i386 platforms have been tested, yet.

The topology info now uses the notions of logical and physical node, also the 
variables are protected by a rw_lock. This was a must for the integration 
with the cpu-hotplug patch.

There are two configuration variables which control the way how the scheduler 
works:
 - CONFIG_NUMA_SCHED : switch on pooling scheduler (otherwise it behaves like 
the O(1) scheduler, though it looks different).
 - CONFIG_NODE_AFFINE_SCHED: tasks remember their homenode and are attracted 
back to it.
For platforms with a big node-level cache it might be better to only configure 
CONFIG_NUMA_SCHED=y and leave CONFIG_NODE_AFFINE_SCHED undefined. This is 
better if the penalty for trashing the node-level cache is bigger than the 
benefit of running on the right node (where the memory is allocated).

I added a variable node_policy to the task structure which is inheritable and 
decides on the initial load balancing. There is a prctl interface to change 
this from userland, a utility called nodpol is available on the web page. The 
possible values for node_policy are:
  0 (default) : select homenode in do_exec(),
  1           : select homenode in do_fork() only if CLONE_VM is unset,
  2           : select homenode in do_fork() (allways).
It's mainly meant for experiments and benchmarks. Some benchmarks (e.g. AIM7, 
which simulates large loads) only fork but don't exec, thus the default 
homenode selection mechanism doesn't apply and the load balance is bad right 
from the start. In real life one should just check whether multithreaded jobs 
need to be distributed across multiple nodes or better take their memory from 
one node and change the node_policy accordingly before starting them. The 
default behavior should be fine otherwise.

On the web page I included some results showing performance increase with the 
node affine scheduler and its functionality. Basically it works fine for 
medium and high loads but has some trouble with low loads. This is due to the 
fact that a task running on a remote node alone on its CPU cannot be stolen 
by CPUs on the homenode. load_balance() is called in such places that the 
only mechanism for moving a currently running task (migration_thread) cannot 
be used. Any ideas (besides a signal) are welcome. The initial load balancing 
is improveable, too, a better measure for load will help.

Thanks in advance for your feedback, I'm especially curious about results for 
the affinity_test on other platforms than NEC AzusA.

Best regards,
Erich
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/