From: Erich Focht <efocht@ess.nec.de> To: LSE <lse-tech@lists.sourceforge.net> Subject: Node Affine NUMA scheduler, updated Date: Wed, 1 May 2002 12:22:05 +0200 Cc: "linux-kernel" <linux-kernel@vger.kernel.org> Hi, an updated patch for the node affine NUMA scheduler extension based on the O(1) scheduler can be found at http://home.arcor.de/efocht/sched/Nod15_O1-2.4.18.patch Detailed information on the implementation is at http://home.arcor.de/efocht/sched . What's new: The topology information has been updated and supports the following ccNUMA platforms: - IBM NUMA-Q - i386 (thanks to Matt Dobson), - SGI SN1/2 - ia64 (thanks to Jesse Barnes), - NEC AzusA - ia64 No other i386 platforms have been tested, yet. The topology info now uses the notions of logical and physical node, also the variables are protected by a rw_lock. This was a must for the integration with the cpu-hotplug patch. There are two configuration variables which control the way how the scheduler works: - CONFIG_NUMA_SCHED : switch on pooling scheduler (otherwise it behaves like the O(1) scheduler, though it looks different). - CONFIG_NODE_AFFINE_SCHED: tasks remember their homenode and are attracted back to it. For platforms with a big node-level cache it might be better to only configure CONFIG_NUMA_SCHED=y and leave CONFIG_NODE_AFFINE_SCHED undefined. This is better if the penalty for trashing the node-level cache is bigger than the benefit of running on the right node (where the memory is allocated). I added a variable node_policy to the task structure which is inheritable and decides on the initial load balancing. There is a prctl interface to change this from userland, a utility called nodpol is available on the web page. The possible values for node_policy are: 0 (default) : select homenode in do_exec(), 1 : select homenode in do_fork() only if CLONE_VM is unset, 2 : select homenode in do_fork() (allways). It's mainly meant for experiments and benchmarks. Some benchmarks (e.g. AIM7, which simulates large loads) only fork but don't exec, thus the default homenode selection mechanism doesn't apply and the load balance is bad right from the start. In real life one should just check whether multithreaded jobs need to be distributed across multiple nodes or better take their memory from one node and change the node_policy accordingly before starting them. The default behavior should be fine otherwise. On the web page I included some results showing performance increase with the node affine scheduler and its functionality. Basically it works fine for medium and high loads but has some trouble with low loads. This is due to the fact that a task running on a remote node alone on its CPU cannot be stolen by CPUs on the homenode. load_balance() is called in such places that the only mechanism for moving a currently running task (migration_thread) cannot be used. Any ideas (besides a signal) are welcome. The initial load balancing is improveable, too, a better measure for load will help. Thanks in advance for your feedback, I'm especially curious about results for the affinity_test on other platforms than NEC AzusA. Best regards, Erich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/