[LWN Logo]
[LWN.net]
From:	 Hubertus Franke <frankeh@watson.ibm.com>
To:	 "Rajan Ravindran" <rajancr@us.ibm.com>,
	 OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Subject: [PATCH] get_pid() performance fix
Date:	 Thu, 14 Mar 2002 18:18:35 -0500
Cc:	 linux-kernel@vger.kernel.org, lse-tech@lists.sourceforge.net

I implemented an alternative version of getpid, that for large thread counts
( > 210000), provides "significantly" better performance as shown in attached
performance analysis. This is particulary viable for PID_MAX=32768.

-- Hubertus Franke <frankeh@watson.ibm.com>

---------------------------------------------------------------------------------

Currently the getpid algorithm works as follows:
At any given time an interval of [ last_pid .. next_safe ) is known to hold   
  
unused pids. Initially the interval is set to [0 .. 32767]

Hence to allocate a new pid the following is sufficient:

       if (++last_pid < next_safe) return last_pid;

However, if we move out of the safe interval, the next safe interval needs 
to be established first.
This is currently done by a repetive search 

repeat:
     foralltasks(p) {
        if (p uses lastpid) { last_pid++; goto repeat; }
	/* narrow [ last_pid .. next_safe ) */
	if (p->pids in [ last_pid .. next_safe ) ) next_safe = p->pid
     }

Particulary for large number of tasks, this can lead to frequent exercise of
the repeat resulting in a O(N^2) algorithm. We call this : <algo-0>.

Instead I have provided an alternative mechanism that at the time
of determining the next interval marks a bitmask by walking the tasklist
once [ O(N) ] and then finding the proper bit offsets to mark the next free
interval starting from last_pid. The bitmap requires 4096 bytes.
This is <algo-1>.

An optimization to this to keep the last bitmap instead of clearing it
with every search. Only if we fail to obtain a free range, then we have
to go back and clear the bitmap and redo the search one more time.
This is <algo-2>.

I dragged the various algorithms into a userlevel test program to figure
out where the cut off points are with PID_MAX=32768. In this testprogram
I maintain A tasks, and for 10 rounds (delete D random tasks and 
reallocate D tasks again) resulting in T=10*D total measured allocations.

Si states how many interval searches where needed for algo-i.
Gi states the average overhead per get_pid for algo-i in usecs.

Based on that one should use the current algorithm until ~ 22K tasks and
beyond that use algo-2. Only the last 15 tasks are a bit faster under algo-1.
We can safely ignore that case.

Based on that providing an adaptive implementation seems the right choice.
The patch for 2.5.7-pre1 is attached.

testprogram is attached:  
executed program example: getpid -c 2 -s 10 -e 100 -D 10 -t <0,1> 
	0 is old 1 is new algo 2.

   A     D      T |      S0           G0 |    S1        G1 |    S2        G2 
----------------------------------------------------------------------------
   10    10    80 |       1         0.34 |     1      0.59 |    1       0.81 
   20    10   100 |       1         0.30 |     1      0.49 |    1       0.64 
   30    10   100 |       1         0.29 |     1      0.55 |    1       0.65 
   40    10   100 |       1         0.35 |     1      0.51 |    1       0.65 
   50    10   100 |       1         0.35 |     1      0.54 |    1       0.67 
   60    10   100 |       2         0.38 |    21      1.95 |    2       0.79 
   70    10   100 |       1         0.39 |     1      0.59 |    1       0.76 
   80    10   100 |       1         0.41 |     1      0.62 |    1       0.76 
  100    50   500 |       2         0.22 |    63      1.26 |    2       0.30 
  150    50   500 |       3         0.24 |    12      0.56 |    4       0.36 
  200    50   500 |       3         0.27 |    56      2.26 |    5       0.46 
  250    50   500 |       2         0.26 |   119      5.63 |    6       0.54 
  300    50   500 |       3         0.32 |   148      8.73 |    9       0.76 
  350    50   500 |       5         0.45 |   168     11.51 |    6       0.76 
  400    50   500 |       4         0.44 |    90      7.28 |   10       1.10 
  450    50   500 |       6         0.61 |   143     13.08 |    7       0.97 
  500    50   500 |       6         0.65 |   100     10.47 |    7       1.06 
  550    50   500 |       5         0.63 |    71      8.10 |    9       1.34 
  600    50   500 |       7         0.86 |   115     14.32 |   14       2.04 
  650    50   500 |       8         1.00 |   112     15.08 |   13       2.07 
  700    50   500 |       8         1.06 |   127     18.12 |   10       1.79 
  750    50   500 |       8         1.26 |    62      9.73 |   15       2.73 
  800    50   500 |      11         1.68 |    92     15.14 |   12       2.42 
  850    50   500 |      14         2.03 |    78     13.73 |   13       2.67 
  900    50   500 |      21         3.17 |   102     18.74 |   27       5.18 
 1000  1000  9980 |       1         0.18 |     4      0.19 |    1       0.18 
 2000  1000 10000 |      76         1.22 |  3604     53.03 |  322       4.81 
 3000  1000 10000 |     161         3.84 |  4502    112.24 |  606      15.49 
 4000  1000 10000 |     359        11.17 |  4912    183.37 |  901      33.76 
 5000  1000 10000 |     539        23.33 |  4949    257.35 | 1165      59.91 
 6000  1000 10000 |     724        43.42 |  4918    349.59 | 1498     104.36 
 7000  1000 10000 |    1026        85.38 |  4886    447.58 | 1835     165.08 
 8000  1000 10000 |    1228       126.45 |  4870    565.29 | 2084     234.73 
 9000  1000 10000 |    1516       193.62 |  4826    699.85 | 2489     354.27 
10000  1000 10000 |    1818       289.32 |  4910    843.32 | 2763     472.47 
11000  1000 10000 |    2093       389.33 |  5005   1023.08 | 3095     629.70 
12000  1000 10000 |    2305       506.23 |  5095   1194.71 | 3277     773.06 
13000  1000 10000 |    2680       683.66 |  5289   1424.81 | 3711    1003.67 
14000  1000 10000 |    2959       853.10 |  5358   1602.05 | 3878    1172.70 
15000  1000 10000 |    3167      1037.79 |  5539   1835.64 | 4301    1436.40 
16000  1000 10000 |    3466      1272.80 |  5669   2087.03 | 4485    1635.48 
17000  1000 10000 |    3743      1539.06 |  5932   2338.50 | 4844    1924.27 
18000  1000 10000 |    4069      1869.63 |  6097   2613.60 | 5218    2232.52 
19000  1000 10000 |    4293      2183.98 |  6242   2866.34 | 5501    2519.60 
20000  1000 10000 |    4616      2607.10 |  6508   3175.90 | 5770    2823.98 
21000  1000 10000 |    4974      3119.34 |  6700   3460.95 | 6161    3183.73 
22000  1000 10000 |    5177      3609.28 |  6944   3788.19 | 6389    3492.97 
23000  1000 10000 |    5483      4214.03 |  7183   4136.25 | 6665    3823.38 
24000  1000 10000 |    5838      4971.60 |  7404   4460.62 | 6982    4199.61 
25000  1000 10000 |    6183      5880.92 |  7736   4891.80 | 7209    4546.18 
26000  1000 10000 |    6413      6829.07 |  7890   5210.85 | 7533    4939.12 
27000  1000 10000 |    6748      8132.96 |  8148   5598.19 | 7959    5442.25 
28000  1000 10000 |    7139     10065.52 |  8445   6047.42 | 8140    5767.13 
29000  1000 10000 |    7638     12967.20 |  8736   6475.23 | 8501    6250.86 
30000  1000 10000 |    8178     16991.05 |  8994   6907.40 | 8911    6791.97 
32000    50   500 |     482     26446.69 |   488   7405.63 |  487    7494.39 
32050    50   500 |     488     34769.89 |   488   7463.11 |  486    7541.61 
32100    50   500 |     489     44564.86 |   493   7593.99 |  486    7589.02 
32150    50   500 |     486     58150.58 |   487   7549.96 |  492    7731.18 
32200    50   500 |     490     64875.38 |   495   7721.82 |  497    7854.59 
32250    50   500 |     491     81790.21 |   491   7697.57 |  490    7795.12 
32300    50   500 |     489     88975.62 |   493   7763.04 |  495    7909.77 
32350    50   500 |     489    115797.38 |   492   7782.34 |  495    7967.86 
32400    50   500 |     490    120958.50 |   497   7898.45 |  496    8018.98 
32450    50   500 |     492    147541.84 |   493   7874.27 |  492    7982.34 
32500    50   500 |     493    175498.39 |   495   7940.18 |  495    8060.97 
32550    50   500 |     492    207229.29 |   496   7973.88 |  498    8134.02 
32600    50   500 |     495    267057.05 |   498   8028.86 |  498    8171.97 
32650    50   500 |     492    375722.28 |   500   8088.30 |  498    8213.85 
32700    50   500 |     497    528321.07 |   500   8110.51 |  499    8267.67 
32751     1    10 |      10    259785.80 |    10   7851.50 |   10    8549.30 
32752     1    10 |      10   1121285.60 |    10   7846.30 |   10    8556.10 
32753     1    10 |      10    383729.50 |    10   7848.60 |   10    8562.20 
32754     1    10 |      10   1061467.50 |    10   7849.80 |   10    8564.40 
32755     1    10 |      10    612726.50 |    10   7853.00 |   10    8553.90 
32756     1    10 |      10   1725559.90 |    10   7851.90 |   10    8553.00 
32757     1    10 |      10   1679818.50 |    10   7847.80 |   10    8552.10 
32758     1    10 |      10   2991838.60 |    10   7865.70 |   10    8557.20 
32759     1    10 |      10   883388.90  |    10   7859.40 |   10    8562.00 
32760     1    10 |      10   4830405.90 |    10   7850.50 |   10    9336.60 
32761     1    10 |      10   7105809.20 |    10   7863.90 |   10    9337.20 
32762     1    10 |      10   7919703.40 |    10   7867.10 |   10    9340.70 
32763     1    10 |      10   1537522.50 |    10   7869.40 |   10    9340.70 
32764     1    10 |      10   6173019.20 |    10   7866.60 |   10    9340.00 
32765     1    10 |      10   8104105.00 |    10   7876.20 |   10   10112.80 
32766     1    10 |      10  16145415.40 |    10   7880.80 |   10   10893.50 
32767     1    10 |      10  16135267.10 |    10   7878.60 |   10   11674.40 


Other variants are possible, for instance if 4096 bytes is too much
(hell I don't know how that could be), one can break it up into smaller
search chunks (e.g. 256 bytes).

Another alternative is to allocate the page on the first occasion of
getting into get_pid_my_map....

In the following I give a comparative result between algo-2 and
algo-2 with a max interval size of 256. The times are very comparative.
Also the search count values are identical, but in 2 cases suggesting
that a interval size particular for large thread counts of 256 is certainly 
sufficient, but it brings some small overhead. Question to answer is 
wether setting aside an extra page is such a crime..... 

   A     D      T |    S2        G2 |  S2-256     G2-256
-------------------------------------------------------
   10    10    80 |    1       0.81 |      1       0.84
   20    10   100 |    1       0.64 |      1       0.67
   30    10   100 |    1       0.65 |      1       0.68
   40    10   100 |    1       0.65 |      1       0.69
   50    10   100 |    1       0.67 |      1       0.71
   60    10   100 |    2       0.79 |      2       0.82
   70    10   100 |    1       0.76 |      1       0.76
   80    10   100 |    1       0.76 |      1       0.79
  100    50   500 |    2       0.30 |      2       0.31
  150    50   500 |    4       0.36 |      5       0.39  <=
  200    50   500 |    5       0.46 |      5       0.46
  250    50   500 |    6       0.54 |      6       0.55
  300    50   500 |    9       0.76 |      9       0.76
  350    50   500 |    6       0.76 |      6       0.75
  400    50   500 |   10       1.10 |     10       1.10
  450    50   500 |    7       0.97 |      7       0.97
  500    50   500 |    7       1.06 |      7       1.06
  550    50   500 |    9       1.34 |      9       1.35
  600    50   500 |   14       2.04 |     14       2.06
  650    50   500 |   13       2.07 |     13       2.09
  700    50   500 |   10       1.79 |     10       1.82
  750    50   500 |   15       2.73 |     15       2.69
  800    50   500 |   12       2.42 |     12       2.38
  850    50   500 |   13       2.67 |     13       2.66
  900    50   500 |   27       5.18 |     27       5.25
 1000  1000  9980 |    1       0.18 |      3       0.19 <=
 2000  1000 10000 |  322       4.81 |    322       4.84
 3000  1000 10000 |  606      15.49 |    606      15.55
 4000  1000 10000 |  901      33.76 |    901      34.42
 5000  1000 10000 | 1165      59.91 |   1165      62.35
 6000  1000 10000 | 1498     104.36 |   1498     105.55
 7000  1000 10000 | 1835     165.08 |   1835     174.82
 8000  1000 10000 | 2084     234.73 |   2084     244.18
 9000  1000 10000 | 2489     354.27 |   2489     372.11
10000  1000 10000 | 2763     472.47 |   2763     486.73
11000  1000 10000 | 3095     629.70 |   3095     648.31
12000  1000 10000 | 3277     773.06 |   3277     784.75
13000  1000 10000 | 3711    1003.67 |   3711    1006.94
14000  1000 10000 | 3878    1172.70 |   3878    1168.71
15000  1000 10000 | 4301    1436.40 |   4301    1429.89
16000  1000 10000 | 4485    1635.48 |   4485    1620.90
17000  1000 10000 | 4844    1924.27 |   4844    1904.92
18000  1000 10000 | 5218    2232.52 |   5218    2218.80
19000  1000 10000 | 5501    2519.60 |   5501    2508.83
20000  1000 10000 | 5770    2823.98 |   5770    2895.66
21000  1000 10000 | 6161    3183.73 |   6161    3307.54
22000  1000 10000 | 6389    3492.97 |   6389    3620.53
23000  1000 10000 | 6665    3823.38 |   6665    3995.63
24000  1000 10000 | 6982    4199.61 |   6982    4347.95
25000  1000 10000 | 7209    4546.18 |   7209    4701.95
26000  1000 10000 | 7533    4939.12 |   7533    5088.13
27000  1000 10000 | 7959    5442.25 |   7959    5599.85
28000  1000 10000 | 8140    5767.13 |   8140    5817.86
29000  1000 10000 | 8501    6250.86 |   8501    6250.30
30000  1000 10000 | 8911    6791.97 |   8911    6788.51
32000    50   500 |  487    7494.39 |    487    7493.47
32050    50   500 |  486    7541.61 |    486    7541.05
32100    50   500 |  486    7589.02 |    486    7586.12
32150    50   500 |  492    7731.18 |    492    7728.76
32200    50   500 |  497    7854.59 |    497    7854.94
32250    50   500 |  490    7795.12 |    490    7783.10
32300    50   500 |  495    7909.77 |    495    7902.70
32350    50   500 |  495    7967.86 |    495    7946.20
32400    50   500 |  496    8018.98 |    496    7999.34
32450    50   500 |  492    7982.34 |    492    7962.93
32500    50   500 |  495    8060.97 |    495    8048.18
32550    50   500 |  498    8134.02 |    498    8122.08
32600    50   500 |  498    8171.97 |    498    8169.34
32650    50   500 |  498    8213.85 |    498    8209.95
32700    50   500 |  499    8267.67 |    499    8266.13
32751     1    10 |   10    8549.30 |     10    8629.00
32752     1    10 |   10    8556.10 |     10    8636.30
32753     1    10 |   10    8562.20 |     10    8632.00
32754     1    10 |   10    8564.40 |     10    8633.40
32755     1    10 |   10    8553.90 |     10    8635.40
32756     1    10 |   10    8553.00 |     10    8637.60
32757     1    10 |   10    8552.10 |     10    8640.90
32758     1    10 |   10    8557.20 |     10    8644.90
32759     1    10 |   10    8562.00 |     10    8644.10
32760     1    10 |   10    9336.60 |     10    9436.10
32761     1    10 |   10    9337.20 |     10    9435.60
32762     1    10 |   10    9340.70 |     10    9439.10
32763     1    10 |   10    9340.70 |     10    9433.60
32764     1    10 |   10    9340.00 |     10    9440.60
32765     1    10 |   10   10112.80 |     10   10228.40
32766     1    10 |   10   10893.50 |     10   11023.50
32767     1    10 |   10   11674.40 |     10   11813.70


-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)


diff -urbN linux-2.5.7-pre1/kernel/fork.c linux-2.5.7-pre1pid/kernel/fork.c
--- linux-2.5.7-pre1/kernel/fork.c	Thu Mar  7 21:18:12 2002
+++ linux-2.5.7-pre1pid/kernel/fork.c	Thu Mar 14 16:03:23 2002
@@ -125,24 +125,101 @@
 /* Protects next_safe and last_pid. */
 spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
 
+/* this should be provided in every architecture */
+#ifndef SHIFT_PER_LONG
+#if BITS_PER_LONG == 64
+#   define SHIFT_PER_LONG 6
+#elif BITS_PER_LONG == 32
+#   define SHIFT_PER_LONG 5
+#else
+#error "SHIFT_PER_LONG"
+#endif
+#endif
+
+#define RESERVED_PIDS       (300)
+#define GETPID_THRESHOLD    (21000)  /* when to switch to a mapped algo */
+#define PID_MAP_SIZE        (PID_MAX >> SHIFT_PER_LONG)
+static unsigned long pid_map[PID_MAP_SIZE];
+static int next_safe = PID_MAX;
+
+static inline void mark_pid(int pid)
+{
+	pid_map[(pid) >> SHIFT_PER_LONG] |= (1 << ((pid) & (BITS_PER_LONG-1)));
+}
+
+static int get_pid_by_map(int last_pid) 
+{
+        struct task_struct *p;
+        int i;
+	int again = 1;
+	unsigned long mask;
+	int fpos;
+
+repeat:
+	for_each_task(p) {
+		mark_pid(p->pid);
+		mark_pid(p->pgrp);
+		mark_pid(p->tgid);
+		mark_pid(p->session);
+	}
+
+	/* find next free pid */
+	i = (last_pid >> SHIFT_PER_LONG);
+	mask = pid_map[i] | (((last_pid & (BITS_PER_LONG-1)) << 1) - 1);
+	
+	while ((mask == ~0) && (++i < PID_MAP_SIZE)) 
+		mask = pid_map[i];
+	
+	if (i == PID_MAP_SIZE) { 
+		if (again) {
+			/* we didn't find any pid , sweep and try again */
+			again = 0;
+			memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
+			last_pid = RESERVED_PIDS;
+			goto repeat;
+		}
+		next_safe = RESERVED_PIDS;
+		return 0; 
+	}
+
+	fpos = ffz(mask);
+	i &= (PID_MAX-1);
+	last_pid = (i << SHIFT_PER_LONG) + fpos;
+
+	/* find next save pid */
+	mask &= ~((1 << fpos) - 1);
+
+	while ((mask == 0) && (++i < PID_MAP_SIZE)) 
+		mask = pid_map[i];
+
+	if (i==PID_MAP_SIZE) 
+		next_safe = PID_MAX;
+	else 
+		next_safe = (i << SHIFT_PER_LONG) + ffs(mask) - 1;
+	return last_pid;
+}
+
 static int get_pid(unsigned long flags)
 {
-	static int next_safe = PID_MAX;
 	struct task_struct *p;
-	int pid;
+	int pid,beginpid;
 
 	if (flags & CLONE_PID)
 		return current->pid;
 
 	spin_lock(&lastpid_lock);
+	beginpid = last_pid;
 	if((++last_pid) & 0xffff8000) {
-		last_pid = 300;		/* Skip daemons etc. */
+		last_pid = RESERVED_PIDS;		/* Skip daemons etc. */
 		goto inside;
 	}
 	if(last_pid >= next_safe) {
 inside:
 		next_safe = PID_MAX;
 		read_lock(&tasklist_lock);
+		if (nr_threads > GETPID_THRESHOLD) {
+			last_pid = get_pid_by_map(last_pid);
+		} else {
 	repeat:
 		for_each_task(p) {
 			if(p->pid == last_pid	||
@@ -151,9 +228,11 @@
 			   p->session == last_pid) {
 				if(++last_pid >= next_safe) {
 					if(last_pid & 0xffff8000)
-						last_pid = 300;
+							last_pid = RESERVED_PIDS;
 					next_safe = PID_MAX;
 				}
+					if(unlikely(last_pid == beginpid))
+						goto nomorepids;
 				goto repeat;
 			}
 			if(p->pid > last_pid && next_safe > p->pid)
@@ -162,6 +241,9 @@
 				next_safe = p->pgrp;
 			if(p->session > last_pid && next_safe > p->session)
 				next_safe = p->session;
+				if(p->tgid > last_pid && next_safe > p->tgid)
+					next_safe = p->tgid;
+			}
 		}
 		read_unlock(&tasklist_lock);
 	}
@@ -169,6 +251,10 @@
 	spin_unlock(&lastpid_lock);
 
 	return pid;
+nomorepids:
+	read_unlock(&tasklist_lock);
+	spin_unlock(&lastpid_lock);
+	return 0;
 }
 
 static inline int dup_mmap(struct mm_struct * mm)
[2. Test Program --- text/x-c; getpid1.c]...