[LWN Logo]

Date:	Sat, 29 Jan 2000 17:38:47 +0100 (CET)
From:	Rik van Riel <riel@nl.linux.org>
To:	Linux Kernel <linux-kernel@vger.rutgers.edu>
Subject: [PATCH] 2.3.41 scheduler change

Hi,

since the biggest penalty with scheduling is the repopulation
of the cache with (user) pages from the new process, I've
written a scheduler patch for 2.3.41 that tries to optimise
for that.

The patch doesn't use p->counter for priority purposes, this
means that we should incurr less cache misses in the following
(quite common?) situation.

- you have a number of cpus, N (N = 1, 2, whatever)
- you have N+1 processes using a fair amount of CPU
- you have vi, pine, mutt, whatever using very little CPU

When you press a key in vi and the currently running runnable
process gets descheduled, it will have a lower priority than
the other process (because it's used part of its timeslice)
and immediately after vi the _other_ running process gets
scheduled. This could result in unneeded cache thrashing.

My patch calculates a slower-moving priority so that short-term
use of CPU (within one timeslice) doesn't influence priority.
This means that the _same_ running process will be rescheduled
after vi goes back to sleep, this should reduce cache thrashing
and increase efficiency.

I haven't done a lot of performance tests with this scheduler
yet, but xmms seems to use about 20% less CPU with this patch
than without ... however, this is on a dual pentium with shared
L2 cache. I suspect that a dual celeron or Xeon will see the
biggest benefits of this patch.

I'm interested in any comments on and measurements on this patch.
p->counter and p->priority have been changed into shorts because
the short->int conversion most probably is much cheaper than an
extra cache miss ... any numbers on this are very much welcome too.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.


--- linux-2.3.41/fs/proc/array.c.orig	Sat Jan 29 03:55:49 2000
+++ linux-2.3.41/fs/proc/array.c	Sat Jan 29 14:47:58 2000
@@ -317,7 +317,7 @@
 
 	/* scale priority and nice values from timeslices to -20..20 */
 	/* to make it look like a "normal" Unix priority/nice value  */
-	priority = task->counter;
+	priority = (task->dynpriority >> (SHIFT_PRIORITY - 1));
 	priority = 20 - (priority * 10 + DEF_PRIORITY / 2) / DEF_PRIORITY;
 	nice = task->priority;
 	nice = 20 - (nice * 20 + DEF_PRIORITY / 2) / DEF_PRIORITY;
--- linux-2.3.41/kernel/sched.c.orig	Sat Jan 29 02:52:17 2000
+++ linux-2.3.41/kernel/sched.c	Sat Jan 29 15:42:46 2000
@@ -110,7 +110,7 @@
 
 static inline int goodness(struct task_struct * p, int this_cpu, struct mm_struct *this_mm)
 {
-	int weight;
+	int weight = 0;
 
 	/*
 	 * Realtime process, select the first one on the
@@ -123,16 +123,17 @@
 	}
 
 	/*
-	 * Give the process a first-approximation goodness value
-	 * according to the number of clock-ticks it has left.
-	 *
-	 * Don't do any other calculations if the time slice is
-	 * over..
+	 * Don't do any calculations if the time slice is over..
 	 */
-	weight = p->counter;
-	if (!weight)
+	if (!p->counter)
 		goto out;
 			
+	/*
+	 * Our basic priority, we can get some bonusses later.
+	 * All of this is done to be cache friendly.
+	 */
+	weight = p->dynpriority >> SHIFT_PRIORITY;
+
 #ifdef __SMP__
 	/* Give a largish advantage to the same processor...   */
 	/* (this is equivalent to penalizing other processors) */
@@ -141,9 +142,8 @@
 #endif
 
 	/* .. and a slight advantage to the current MM */
-	if (p->mm == this_mm)
+	if (p->active_mm == this_mm)
 		weight += 1;
-	weight += p->priority;
 
 out:
 	return weight;
@@ -173,7 +173,7 @@
  */
 static inline int preemption_goodness(struct task_struct * prev, struct task_struct * p, int cpu)
 {
-	return goodness(p, cpu, prev->mm) - goodness(prev, cpu, prev->mm);
+	return goodness(p, cpu, prev->active_mm) - goodness(prev, cpu, prev->active_mm);
 }
 
 /*
@@ -609,7 +609,11 @@
 		spin_unlock_irq(&runqueue_lock);
 		read_lock(&tasklist_lock);
 		for_each_task(p)
-			p->counter = (p->counter >> 1) + p->priority;
+			/* we won't dirty the cacheline when we don't have to */
+			if (p->dynpriority != (p->priority << SHIFT_PRIORITY) || p->counter != p->priority) {
+				p->dynpriority = p->dynpriority - (p->dynpriority >> SHIFT_PRIORITY) + p->counter;
+				p->counter = p->priority;
+			}
 		read_unlock(&tasklist_lock);
 		spin_lock_irq(&runqueue_lock);
 	}
--- linux-2.3.41/kernel/fork.c.orig	Sat Jan 29 13:04:43 2000
+++ linux-2.3.41/kernel/fork.c	Sat Jan 29 15:19:08 2000
@@ -714,6 +714,8 @@
 	 */
 	current->counter >>= 1;
 	p->counter = current->counter;
+	current->dynpriority -= (current->dynpriority >> SHIFT_PRIORITY);
+	p->dynpriority = current->dynpriority;
 
 	/*
 	 * Ok, add it to the run-queues and make it
--- linux-2.3.41/include/linux/sched.h.orig	Sat Jan 29 02:46:03 2000
+++ linux-2.3.41/include/linux/sched.h	Sat Jan 29 04:46:01 2000
@@ -273,8 +273,9 @@
 	cycles_t avg_slice;
 	int lock_depth;		/* Lock depth. We can context switch in and out of holding a syscall kernel lock... */	
 /* begin intel cache line */
-	long counter;
-	long priority;
+	short counter;
+	short priority;
+	long dynpriority;
 	unsigned long policy;
 /* memory management info */
 	struct mm_struct *mm, *active_mm;
@@ -385,6 +386,7 @@
 #define _STK_LIM	(8*1024*1024)
 
 #define DEF_PRIORITY	(20*HZ/100)	/* 200 ms time slices */
+#define SHIFT_PRIORITY	6		/* priority calculation falloff */
 
 /*
  *  INIT_TASK is used to set up the first task table, touch at
@@ -393,7 +395,7 @@
 #define INIT_TASK(name) \
 /* state etc */	{ 0,0,0,KERNEL_DS,&default_exec_domain,0, \
 /* avg_slice */	0, -1, \
-/* counter */	DEF_PRIORITY,DEF_PRIORITY,SCHED_OTHER, \
+/* counter */	DEF_PRIORITY,DEF_PRIORITY,(DEF_PRIORITY << SHIFT_PRIORITY),SCHED_OTHER, \
 /* mm */	NULL, &init_mm, \
 /* has_cpu */	0,0, \
 /* run_list */	LIST_HEAD_INIT(init_task.run_list), \
--- linux-2.3.41/arch/i386/kernel/process.c.orig	Sat Jan 29 13:31:30 2000
+++ linux-2.3.41/arch/i386/kernel/process.c	Sat Jan 29 13:33:08 2000
@@ -90,6 +90,7 @@
 	init_idle();
 	current->priority = 0;
 	current->counter = -100;
+	current->dynpriority = 0;
 
 	while (1) {
 		void (*idle)(void) = acpi_idle;


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/