[LWN Logo]
[LWN.net]
From:	 Ingo Molnar <mingo@elte.hu>
To:	 <linux-kernel@vger.kernel.org>
Subject: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
Date:	 Tue, 2 Oct 2001 00:16:06 +0200 (CEST)
Cc:	 Linus Torvalds <torvalds@transmeta.com>,
	 Alan Cox <alan@lxorguk.ukuu.org.uk>,
	 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>,
	 Andrea Arcangeli <andrea@suse.de>,
	 Simon Kirby <sim@netnation.com>


to sum things up, we have three main problem areas that are connected to
hardirq and softirq processing:

- a little utility written by Simon Kirby proved that no matter how much
  softirq throttling, it's easy to lock up a pretty powerful Linux
  box via a high rate of network interrupts, from relatively low-powered
  clients as well. 2.4.6, 2.4.7, 2.4.10 all lock up. Alexey said it as
  well that it's still easy to lock up low-powered Linux routers via more
  or less normal traffic.

- prior 2.4.7 we used to 'leak' softirq handling => we ended up missing
  softirqs in a number of circumstances. Stock 2.4.10 still has a number
  of places that do this too.

- a number of people have reported gigabit performance problems (some
  people reported a 10-20% drop in performance under load) since
  ksoftirqd was added - which was added to fix some of the 2.4.6-
  softirq-handling latency problems.

we also have another problem that often pops up when the BIOS goes bad or
a device driver does some mistake:

- Linux often 'locks up' if it gets into a 'interrupt storm' - when
  interrupt sources that send a very high rate of interrupts. This can be
  seen as boot-time hangs and module-insert-time hangs as well.

the attached patch, while a bit radical, is i believe a robust solution to
all four problems. It gives gigabit performance back, avoids the lockups
and attempts to reach as short softirq-processing latency as possible.

the new mechanizm:

- the irq handling code has been extended to support 'soft mitigation',
  ie. to mitigate the rate of hardware interrupts, without support from
  the actual hardware. There is a reasonable default, but the value can
  also be decreased/increased on a per-irq basis via /proc/irq/NR/max_rate.

the method is the following. We count the number of interrupts serviced,
and if within a jiffy there are more than max_rate interrupts, the code
disables the IRQ source and marks it as IRQ_MITIGATED. On the next timer
interrupt the irq_rate_check() function is called, which makes sure that
'blocked' irqs are restarted & handled properly. The interrupt is disabled
in the interrupt controller, which has the nice side-effect of fixing and
blocking interrupt storms. (The support code for 'soft mitigation' is
designed to be very lightweight, it's a decrement and a test in the IRQ
handling hot path.)

(note that in case of shared interrupts, another 'innocent' device might
stay disabled for some short amount of time as well - but this is not an
issue because this mitigation does not make that device inoperable, it
just delays its interrupt by up to 10 msecs. Plus, modern systems have
properly distributed interrupts.)

- softirq code got simplified significantly. The concept is to 'handle all
  pending softirqs' - just as the hardware IRQ code 'handles all hardware
  interrupts that were passed to it'. Since most of the time there is a
  direct relationship between softirq work and hardirq work, the
  mitigation of hardirqs mitigates softirq load as well.

- ksoftirqd is gone, there is never any softirq pending while
  softirq-unaware code is executing.

- the tasklet code needed some cleanup along the way, and it also won some
  restart-on-enable and restart-on-unlock properties that it lacked
  before. (but which is desired.)

due to these changes, the linecount in softirq.c got smaller by 25%.
[i dropped the unwakeup change - but that one could be useful in the VM,
to eg. unwakeup bdflush or kswapd.]

- drivers can optionally use the set_irq_rate(irq, new_rate) call to
  change the current IRQ rate. Drivers are the ones who know best what
  kind of loads to expect from the hardware, so they might want to
  influence this value. Also, drivers that implement IRQ mitigation
  themselves in hardware, can effectively disable the soft-mitigation code
  by using a very high rate value.

what is the concept behind all this? Simplicity, and concept. We were
clearly heading in the wrong direction: putting more complexity into the
core softirq code to handle some really extreme and unusual cases. Also,
softirqs were slowly morphing into something process-ish - but in Linux we
already have a concept of processes, so we'd have two dualling concepts.
(We still have tasklets, which are not really processes - they are
single-threaded paths of execution.)

with this patch, softirqs can again be what they should be: lightweight
'interrupt code' that processes hard-IRQ events but still does this with
interrupts enabled, to allow for low hard-IRQ latencies. Anything that is
conceptually heavyweight IMO does not belong into softirqs, it should be
moved into process contexts. That will take care of CPU-time usage
accounting and CPU-time-limiting and priority issues as well.

(the patch also imports the latency and softirq-restart fixes from my
previous softirq patches.)

i've tested the patch on both UP, SMP, XT-PIC and APIC systems, it
correctly limits network interrupt rates (and other device interrupt
rates) to the given limit. I've done stress-testing as well. The patch is
against 2.4.11-pre1, but it applies just fine to the -ac tree as well.

with a high irq-rate limit set, ping flooding has this effect on the
test-system:

 [root@mars /root]# vmstat 1
    procs                      memory    swap          io
  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in
  0  0  0      0 877024   1140  11364   0   0    12     0 30960
  0  0  0      0 877024   1140  11364   0   0     0     0 30950
  0  0  0      0 877024   1140  11364   0   0     0     0 30520

ie. 30k interrupts/sec. With the max_rate set to 1000 interrupts/sec:

 [root@mars /root]# echo 1000 > /proc/irq/21/max_rate
 [root@mars /root]# vmstat 1
    procs                      memory    swap          io
  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in
  0  0  0      0 877004   1144  11372   0   0     0     0 1112
  0  0  0      0 877004   1144  11372   0   0     0     0 1111
  0  0  0      0 877004   1144  11372   0   0     0     0 1111

so it works just fine here. Interactive tasks are still snappy over the
same interface.

Comments, reports, suggestions and testing feedback is more than welcome,

	Ingo

--- linux/kernel/ksyms.c.orig	Mon Oct  1 21:52:32 2001
+++ linux/kernel/ksyms.c	Mon Oct  1 21:52:43 2001
@@ -538,8 +538,6 @@
 EXPORT_SYMBOL(tasklet_kill);
 EXPORT_SYMBOL(__run_task_queue);
 EXPORT_SYMBOL(do_softirq);
-EXPORT_SYMBOL(raise_softirq);
-EXPORT_SYMBOL(cpu_raise_softirq);
 EXPORT_SYMBOL(__tasklet_schedule);
 EXPORT_SYMBOL(__tasklet_hi_schedule);
 
--- linux/kernel/softirq.c.orig	Mon Oct  1 21:52:32 2001
+++ linux/kernel/softirq.c	Mon Oct  1 21:53:52 2001
@@ -44,26 +44,11 @@
 
 static struct softirq_action softirq_vec[32] __cacheline_aligned;
 
-/*
- * we cannot loop indefinitely here to avoid userspace starvation,
- * but we also don't want to introduce a worst case 1/HZ latency
- * to the pending events, so lets the scheduler to balance
- * the softirq load for us.
- */
-static inline void wakeup_softirqd(unsigned cpu)
-{
-	struct task_struct * tsk = ksoftirqd_task(cpu);
-
-	if (tsk && tsk->state != TASK_RUNNING)
-		wake_up_process(tsk);
-}
-
 asmlinkage void do_softirq()
 {
 	int cpu = smp_processor_id();
 	__u32 pending;
 	long flags;
-	__u32 mask;
 
 	if (in_interrupt())
 		return;
@@ -75,7 +60,6 @@
 	if (pending) {
 		struct softirq_action *h;
 
-		mask = ~pending;
 		local_bh_disable();
 restart:
 		/* Reset the pending bitmask before enabling irqs */
@@ -95,152 +79,130 @@
 		local_irq_disable();
 
 		pending = softirq_pending(cpu);
-		if (pending & mask) {
-			mask &= ~pending;
+		if (pending)
 			goto restart;
-		}
 		__local_bh_enable();
-
-		if (pending)
-			wakeup_softirqd(cpu);
 	}
 
 	local_irq_restore(flags);
 }
 
-/*
- * This function must run with irq disabled!
- */
-inline void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
-{
-	__cpu_raise_softirq(cpu, nr);
-
-	/*
-	 * If we're in an interrupt or bh, we're done
-	 * (this also catches bh-disabled code). We will
-	 * actually run the softirq once we return from
-	 * the irq or bh.
-	 *
-	 * Otherwise we wake up ksoftirqd to make sure we
-	 * schedule the softirq soon.
-	 */
-	if (!(local_irq_count(cpu) | local_bh_count(cpu)))
-		wakeup_softirqd(cpu);
-}
-
-void raise_softirq(unsigned int nr)
-{
-	long flags;
-
-	local_irq_save(flags);
-	cpu_raise_softirq(smp_processor_id(), nr);
-	local_irq_restore(flags);
-}
-
 void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
 {
 	softirq_vec[nr].data = data;
 	softirq_vec[nr].action = action;
 }
 
-
 /* Tasklets */
 
 struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned;
 struct tasklet_head tasklet_hi_vec[NR_CPUS] __cacheline_aligned;
 
-void __tasklet_schedule(struct tasklet_struct *t)
+static inline void __tasklet_enable(struct tasklet_struct *t,
+					struct tasklet_head *vec, int softirq)
 {
 	int cpu = smp_processor_id();
-	unsigned long flags;
 
-	local_irq_save(flags);
-	t->next = tasklet_vec[cpu].list;
-	tasklet_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
-	local_irq_restore(flags);
+	smp_mb__before_atomic_dec();
+	if (!atomic_dec_and_test(&t->count))
+		return;
+
+	local_irq_disable();
+	/*
+	 * Being able to clear the SCHED bit from 1 to 0 means
+	 * we got the right to handle this tasklet.
+	 * Setting it from 0 to 1 means we can queue it.
+	 */
+	if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state) && !t->next) {
+		if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) {
+
+			t->next = (vec + cpu)->list;
+			(vec + cpu)->list = t;
+			__cpu_raise_softirq(cpu, softirq);
+		}
+	}
+	local_irq_enable();
+	rerun_softirqs(cpu);
 }
 
-void __tasklet_hi_schedule(struct tasklet_struct *t)
+void tasklet_enable(struct tasklet_struct *t)
+{
+	__tasklet_enable(t, tasklet_vec, TASKLET_SOFTIRQ);
+}
+
+void tasklet_hi_enable(struct tasklet_struct *t)
+{
+	__tasklet_enable(t, tasklet_hi_vec, HI_SOFTIRQ);
+}
+
+static inline void __tasklet_sched(struct tasklet_struct *t,
+					struct tasklet_head *vec, int softirq)
 {
 	int cpu = smp_processor_id();
 	unsigned long flags;
 
 	local_irq_save(flags);
-	t->next = tasklet_hi_vec[cpu].list;
-	tasklet_hi_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, HI_SOFTIRQ);
+	t->next = (vec + cpu)->list;
+	(vec + cpu)->list = t;
+	__cpu_raise_softirq(cpu, softirq);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
-static void tasklet_action(struct softirq_action *a)
+void __tasklet_schedule(struct tasklet_struct *t)
 {
-	int cpu = smp_processor_id();
-	struct tasklet_struct *list;
-
-	local_irq_disable();
-	list = tasklet_vec[cpu].list;
-	tasklet_vec[cpu].list = NULL;
-	local_irq_enable();
-
-	while (list) {
-		struct tasklet_struct *t = list;
-
-		list = list->next;
-
-		if (tasklet_trylock(t)) {
-			if (!atomic_read(&t->count)) {
-				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
-					BUG();
-				t->func(t->data);
-				tasklet_unlock(t);
-				continue;
-			}
-			tasklet_unlock(t);
-		}
+	__tasklet_sched(t, tasklet_vec, TASKLET_SOFTIRQ);
+}
 
-		local_irq_disable();
-		t->next = tasklet_vec[cpu].list;
-		tasklet_vec[cpu].list = t;
-		__cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
-		local_irq_enable();
-	}
+void __tasklet_hi_schedule(struct tasklet_struct *t)
+{
+	__tasklet_sched(t, tasklet_hi_vec, HI_SOFTIRQ);
 }
 
-static void tasklet_hi_action(struct softirq_action *a)
+static inline void __tasklet_action(struct softirq_action *a,
+					struct tasklet_head *vec)
 {
 	int cpu = smp_processor_id();
 	struct tasklet_struct *list;
 
 	local_irq_disable();
-	list = tasklet_hi_vec[cpu].list;
-	tasklet_hi_vec[cpu].list = NULL;
+	list = (vec + cpu)->list;
+	(vec + cpu)->list = NULL;
 	local_irq_enable();
 
 	while (list) {
 		struct tasklet_struct *t = list;
 
 		list = list->next;
+		t->next = NULL;
 
-		if (tasklet_trylock(t)) {
-			if (!atomic_read(&t->count)) {
-				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
-					BUG();
-				t->func(t->data);
-				tasklet_unlock(t);
-				continue;
-			}
+repeat:
+		if (!tasklet_trylock(t))
+			continue;
+		if (atomic_read(&t->count)) {
 			tasklet_unlock(t);
+			continue;
 		}
-
-		local_irq_disable();
-		t->next = tasklet_hi_vec[cpu].list;
-		tasklet_hi_vec[cpu].list = t;
-		__cpu_raise_softirq(cpu, HI_SOFTIRQ);
-		local_irq_enable();
+		if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) {
+			t->func(t->data);
+			tasklet_unlock(t);
+			if (test_bit(TASKLET_STATE_SCHED, &t->state))
+				goto repeat;
+			continue;
+		}
+		tasklet_unlock(t);
 	}
 }
 
+static void tasklet_action(struct softirq_action *a)
+{
+	__tasklet_action(a, tasklet_vec);
+}
+
+static void tasklet_hi_action(struct softirq_action *a)
+{
+	__tasklet_action(a, tasklet_hi_vec);
+}
 
 void tasklet_init(struct tasklet_struct *t,
 		  void (*func)(unsigned long), unsigned long data)
@@ -268,8 +230,6 @@
 	clear_bit(TASKLET_STATE_SCHED, &t->state);
 }
 
-
-
 /* Old style BHs */
 
 static void (*bh_base[32])(void);
@@ -325,7 +285,7 @@
 {
 	int i;
 
-	for (i=0; i<32; i++)
+	for (i = 0; i < 32; i++)
 		tasklet_init(bh_task_vec+i, bh_action, i);
 
 	open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
@@ -358,61 +318,3 @@
 			f(data);
 	}
 }
-
-static int ksoftirqd(void * __bind_cpu)
-{
-	int bind_cpu = *(int *) __bind_cpu;
-	int cpu = cpu_logical_map(bind_cpu);
-
-	daemonize();
-	current->nice = 19;
-	sigfillset(&current->blocked);
-
-	/* Migrate to the right CPU */
-	current->cpus_allowed = 1UL << cpu;
-	while (smp_processor_id() != cpu)
-		schedule();
-
-	sprintf(current->comm, "ksoftirqd_CPU%d", bind_cpu);
-
-	__set_current_state(TASK_INTERRUPTIBLE);
-	mb();
-
-	ksoftirqd_task(cpu) = current;
-
-	for (;;) {
-		if (!softirq_pending(cpu))
-			schedule();
-
-		__set_current_state(TASK_RUNNING);
-
-		while (softirq_pending(cpu)) {
-			do_softirq();
-			if (current->need_resched)
-				schedule();
-		}
-
-		__set_current_state(TASK_INTERRUPTIBLE);
-	}
-}
-
-static __init int spawn_ksoftirqd(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < smp_num_cpus; cpu++) {
-		if (kernel_thread(ksoftirqd, (void *) &cpu,
-				  CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
-			printk("spawn_ksoftirqd() failed for cpu %d\n", cpu);
-		else {
-			while (!ksoftirqd_task(cpu_logical_map(cpu))) {
-				current->policy |= SCHED_YIELD;
-				schedule();
-			}
-		}
-	}
-
-	return 0;
-}
-
-__initcall(spawn_ksoftirqd);
--- linux/kernel/timer.c.orig	Tue Aug 21 14:26:19 2001
+++ linux/kernel/timer.c	Mon Oct  1 21:52:43 2001
@@ -674,6 +674,7 @@
 void do_timer(struct pt_regs *regs)
 {
 	(*(unsigned long *)&jiffies)++;
+	irq_rate_check();
 #ifndef CONFIG_SMP
 	/* SMP process accounting uses the local APIC timer */
 
--- linux/include/linux/netdevice.h.orig	Mon Oct  1 21:52:28 2001
+++ linux/include/linux/netdevice.h	Mon Oct  1 23:07:44 2001
@@ -486,8 +486,9 @@
 		local_irq_save(flags);
 		dev->next_sched = softnet_data[cpu].output_queue;
 		softnet_data[cpu].output_queue = dev;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
@@ -535,8 +536,9 @@
 		local_irq_save(flags);
 		skb->next = softnet_data[cpu].completion_queue;
 		softnet_data[cpu].completion_queue = skb;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
--- linux/include/linux/interrupt.h.orig	Mon Oct  1 21:52:32 2001
+++ linux/include/linux/interrupt.h	Mon Oct  1 23:07:33 2001
@@ -74,9 +74,15 @@
 asmlinkage void do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data);
 extern void softirq_init(void);
-#define __cpu_raise_softirq(cpu, nr) do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
-extern void FASTCALL(cpu_raise_softirq(unsigned int cpu, unsigned int nr));
-extern void FASTCALL(raise_softirq(unsigned int nr));
+extern void show_stack(unsigned long* esp);
+#define __cpu_raise_softirq(cpu, nr) \
+		do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
+
+#define rerun_softirqs(cpu) 					\
+do {								\
+	if (!(local_irq_count(cpu) | local_bh_count(cpu)))	\
+		do_softirq();					\
+} while (0);
 
 
 
@@ -182,18 +188,8 @@
 	smp_mb();
 }
 
-static inline void tasklet_enable(struct tasklet_struct *t)
-{
-	smp_mb__before_atomic_dec();
-	atomic_dec(&t->count);
-}
-
-static inline void tasklet_hi_enable(struct tasklet_struct *t)
-{
-	smp_mb__before_atomic_dec();
-	atomic_dec(&t->count);
-}
-
+extern void tasklet_enable(struct tasklet_struct *t);
+extern void tasklet_hi_enable(struct tasklet_struct *t);
 extern void tasklet_kill(struct tasklet_struct *t);
 extern void tasklet_init(struct tasklet_struct *t,
 			 void (*func)(unsigned long), unsigned long data);
@@ -263,5 +259,6 @@
 extern unsigned long probe_irq_on(void);	/* returns 0 on failure */
 extern int probe_irq_off(unsigned long);	/* returns 0 or negative on failure */
 extern unsigned int probe_irq_mask(unsigned long);	/* returns mask of ISA interrupts */
+extern void irq_rate_check(void);
 
 #endif
--- linux/include/linux/irq.h.orig	Mon Oct  1 21:52:32 2001
+++ linux/include/linux/irq.h	Mon Oct  1 23:07:19 2001
@@ -31,6 +31,7 @@
 #define IRQ_LEVEL	64	/* IRQ level triggered */
 #define IRQ_MASKED	128	/* IRQ masked - shouldn't be seen again */
 #define IRQ_PER_CPU	256	/* IRQ is per CPU */
+#define IRQ_MITIGATED	512	/* IRQ got rate-limited */
 
 /*
  * Interrupt controller descriptor. This is all we need
@@ -62,6 +63,7 @@
 	struct irqaction *action;	/* IRQ action list */
 	unsigned int depth;		/* nested irq disables */
 	spinlock_t lock;
+	unsigned int count;
 } ____cacheline_aligned irq_desc_t;
 
 extern irq_desc_t irq_desc [NR_IRQS];
--- linux/include/asm-i386/irq.h.orig	Mon Oct  1 23:06:53 2001
+++ linux/include/asm-i386/irq.h	Mon Oct  1 23:07:06 2001
@@ -33,6 +33,7 @@
 extern void disable_irq(unsigned int);
 extern void disable_irq_nosync(unsigned int);
 extern void enable_irq(unsigned int);
+extern void set_irq_rate(unsigned int irq, unsigned int rate);
 
 #ifdef CONFIG_X86_LOCAL_APIC
 #define ARCH_HAS_NMI_WATCHDOG		/* See include/linux/nmi.h */
--- linux/include/asm-mips/softirq.h.orig	Mon Oct  1 21:52:32 2001
+++ linux/include/asm-mips/softirq.h	Mon Oct  1 21:52:43 2001
@@ -40,6 +40,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-#define __cpu_raise_softirq(cpu, nr)	set_bit(nr, &softirq_pending(cpu))
-
 #endif /* _ASM_SOFTIRQ_H */
--- linux/include/asm-mips64/softirq.h.orig	Mon Oct  1 21:52:32 2001
+++ linux/include/asm-mips64/softirq.h	Mon Oct  1 21:52:43 2001
@@ -39,19 +39,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-extern inline void __cpu_raise_softirq(int cpu, int nr)
-{
-	unsigned int *m = (unsigned int *) &softirq_pending(cpu);
-	unsigned int temp;
-
-	__asm__ __volatile__(
-		"1:\tll\t%0, %1\t\t\t# __cpu_raise_softirq\n\t"
-		"or\t%0, %2\n\t"
-		"sc\t%0, %1\n\t"
-		"beqz\t%0, 1b"
-		: "=&r" (temp), "=m" (*m)
-		: "ir" (1UL << nr), "m" (*m)
-		: "memory");
-}
-
 #endif /* _ASM_SOFTIRQ_H */
--- linux/net/core/dev.c.orig	Mon Oct  1 21:52:32 2001
+++ linux/net/core/dev.c	Mon Oct  1 21:52:43 2001
@@ -1218,8 +1218,9 @@
 			dev_hold(skb->dev);
 			__skb_queue_tail(&queue->input_pkt_queue,skb);
 			/* Runs from irqs or BH's, no need to wake BH */
-			cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+			__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 			local_irq_restore(flags);
+			rerun_softirqs(this_cpu);
 #ifndef OFFLINE_SAMPLE
 			get_sample_stats(this_cpu);
 #endif
@@ -1529,8 +1530,9 @@
 	local_irq_disable();
 	netdev_rx_stat[this_cpu].time_squeeze++;
 	/* This already runs in BH context, no need to wake up BH's */
-	cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+	__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 	local_irq_enable();
+	rerun_softirqs(this_cpu);
 
 	NET_PROFILE_LEAVE(softnet_process);
 	return;
--- linux/arch/i386/kernel/irq.c.orig	Mon Oct  1 21:52:28 2001
+++ linux/arch/i386/kernel/irq.c	Mon Oct  1 23:06:26 2001
@@ -18,6 +18,7 @@
  */
 
 #include <linux/config.h>
+#include <linux/compiler.h>
 #include <linux/ptrace.h>
 #include <linux/errno.h>
 #include <linux/signal.h>
@@ -68,7 +69,24 @@
 irq_desc_t irq_desc[NR_IRQS] __cacheline_aligned =
 	{ [0 ... NR_IRQS-1] = { 0, &no_irq_type, NULL, 0, SPIN_LOCK_UNLOCKED}};
 
-static void register_irq_proc (unsigned int irq);
+#define DEFAULT_IRQ_RATE 20000
+
+/*
+ * Maximum number of interrupts allowed, per second.
+ * Individual values can be set via echoing the new
+ * decimal value into /proc/irq/IRQ/max_rate.
+ */
+static unsigned int irq_rate [NR_IRQS] =
+		{ [0 ... NR_IRQS-1] = DEFAULT_IRQ_RATE };
+
+/*
+ * Print warnings only once. We reset it to 1 if rate
+ * limit has been changed.
+ */
+static unsigned int rate_warning [NR_IRQS] =
+		{ [0 ... NR_IRQS-1] = 1 };
+
+static void register_irq_proc(unsigned int irq);
 
 /*
  * Special irq handlers.
@@ -230,35 +248,8 @@
 	show_stack(NULL);
 	printk("\n");
 }
-	
-#define MAXCOUNT 100000000
 
-/*
- * I had a lockup scenario where a tight loop doing
- * spin_unlock()/spin_lock() on CPU#1 was racing with
- * spin_lock() on CPU#0. CPU#0 should have noticed spin_unlock(), but
- * apparently the spin_unlock() information did not make it
- * through to CPU#0 ... nasty, is this by design, do we have to limit
- * 'memory update oscillation frequency' artificially like here?
- *
- * Such 'high frequency update' races can be avoided by careful design, but
- * some of our major constructs like spinlocks use similar techniques,
- * it would be nice to clarify this issue. Set this define to 0 if you
- * want to check whether your system freezes.  I suspect the delay done
- * by SYNC_OTHER_CORES() is in correlation with 'snooping latency', but
- * i thought that such things are guaranteed by design, since we use
- * the 'LOCK' prefix.
- */
-#define SUSPECTED_CPU_OR_CHIPSET_BUG_WORKAROUND 0
-
-#if SUSPECTED_CPU_OR_CHIPSET_BUG_WORKAROUND
-# define SYNC_OTHER_CORES(x) udelay(x+1)
-#else
-/*
- * We have to allow irqs to arrive between __sti and __cli
- */
-# define SYNC_OTHER_CORES(x) __asm__ __volatile__ ("nop")
-#endif
+#define MAXCOUNT 100000000
 
 static inline void wait_on_irq(int cpu)
 {
@@ -276,7 +267,7 @@
 				break;
 
 		/* Duh, we have to loop. Release the lock to avoid deadlocks */
-		clear_bit(0,&global_irq_lock);
+		clear_bit(0, &global_irq_lock);
 
 		for (;;) {
 			if (!--count) {
@@ -284,7 +275,8 @@
 				count = ~0;
 			}
 			__sti();
-			SYNC_OTHER_CORES(cpu);
+			/* Allow irqs to arrive */
+			__asm__ __volatile__ ("nop");
 			__cli();
 			if (irqs_running())
 				continue;
@@ -467,6 +459,13 @@
  * controller lock. 
  */
  
+inline void __disable_irq(irq_desc_t *desc, unsigned int irq)
+{
+	if (!desc->depth++) {
+		desc->status |= IRQ_DISABLED;
+		desc->handler->disable(irq);
+	}
+}
 /**
  *	disable_irq_nosync - disable an irq without waiting
  *	@irq: Interrupt to disable
@@ -485,10 +484,7 @@
 	unsigned long flags;
 
 	spin_lock_irqsave(&desc->lock, flags);
-	if (!desc->depth++) {
-		desc->status |= IRQ_DISABLED;
-		desc->handler->disable(irq);
-	}
+	__disable_irq(desc, irq);
 	spin_unlock_irqrestore(&desc->lock, flags);
 }
 
@@ -516,23 +512,8 @@
 	}
 }
 
-/**
- *	enable_irq - enable handling of an irq
- *	@irq: Interrupt to enable
- *
- *	Undoes the effect of one call to disable_irq().  If this
- *	matches the last disable, processing of interrupts on this
- *	IRQ line is re-enabled.
- *
- *	This function may be called from IRQ context.
- */
- 
-void enable_irq(unsigned int irq)
+static inline void __enable_irq(irq_desc_t *desc, unsigned int irq)
 {
-	irq_desc_t *desc = irq_desc + irq;
-	unsigned long flags;
-
-	spin_lock_irqsave(&desc->lock, flags);
 	switch (desc->depth) {
 	case 1: {
 		unsigned int status = desc->status & ~IRQ_DISABLED;
@@ -551,9 +532,69 @@
 		printk("enable_irq(%u) unbalanced from %p\n", irq,
 		       __builtin_return_address(0));
 	}
+}
+
+/**
+ *	enable_irq - enable handling of an irq
+ *	@irq: Interrupt to enable
+ *
+ *	Undoes the effect of one call to disable_irq().  If this
+ *	matches the last disable, processing of interrupts on this
+ *	IRQ line is re-enabled.
+ *
+ *	This function may be called from IRQ context.
+ */
+ 
+void enable_irq(unsigned int irq)
+{
+	irq_desc_t *desc = irq_desc + irq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&desc->lock, flags);
+	__enable_irq(desc, irq);
 	spin_unlock_irqrestore(&desc->lock, flags);
 }
 
+void set_irq_rate(unsigned int irq, unsigned int rate)
+{
+	if (rate < 2*HZ)
+		rate = 2*HZ;
+	if (irq_rate[irq] != rate)
+		rate_warning[irq] = 1;
+	irq_rate[irq] = rate;
+}
+
+static inline void __handle_mitigated(irq_desc_t *desc, unsigned int irq)
+{
+	desc->status &= ~IRQ_MITIGATED;
+	__enable_irq(desc, irq);
+}
+
+/*
+ * This function, provided by every architecture, resets
+ * the irq-limit counters in every jiffy. Overhead is
+ * fairly small, since it gets the spinlock only if the IRQ
+ * got mitigated.
+ */
+
+void irq_rate_check(void)
+{
+	unsigned long flags;
+	irq_desc_t *desc;
+	int i;
+
+	for (i = 0; i < NR_IRQS; i++) {
+		desc = irq_desc + i;
+		if (desc->count <= 1) {
+			spin_lock_irqsave(&desc->lock, flags);
+			if (desc->status & IRQ_MITIGATED)
+				__handle_mitigated(desc, i);
+			spin_unlock_irqrestore(&desc->lock, flags);
+		}
+		desc->count = irq_rate[i] / HZ;
+	}
+}
+
 /*
  * do_IRQ handles all normal device IRQ's (the special
  * SMP cross-CPU interrupts have their own specific
@@ -585,6 +626,13 @@
 	   WAITING is used by probe to mark irqs that are being tested
 	   */
 	status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING);
+	/*
+	 * One decrement and one branch (test for zero) into
+	 * an unlikely-predicted branch. It cannot be cheaper
+	 * than this.
+	 */
+	if (unlikely(!--desc->count))
+		goto mitigate;
 	status |= IRQ_PENDING; /* we _want_ to handle it */
 
 	/*
@@ -639,6 +687,27 @@
 	if (softirq_pending(cpu))
 		do_softirq();
 	return 1;
+
+mitigate:
+	/*
+	 * We take a slightly longer path here to not put
+	 * overhead into the IRQ hotpath:
+	 */
+	desc->count = 1;
+	if (status & IRQ_MITIGATED)
+		goto out;
+	/*
+	 * Disable interrupt source. It will be re-enabled
+	 * by the next timer interrupt - and possibly be
+	 * restarted if needed.
+	 */
+	desc->status |= IRQ_MITIGATED | IRQ_PENDING;
+	__disable_irq(desc, irq);
+	if (rate_warning[irq]) {
+		printk(KERN_WARNING "Rate limit of %d irqs/sec exceeded for IRQ%d! Throttling irq source.\n", irq_rate[irq], irq);
+		rate_warning[irq] = 0;
+	}
+	goto out;
 }
 
 /**
@@ -809,7 +878,7 @@
 	 * something may have generated an irq long ago and we want to
 	 * flush such a longstanding irq before considering it as spurious. 
 	 */
-	for (i = NR_IRQS-1; i > 0; i--)  {
+	for (i = NR_IRQS-1; i > 0; i--) {
 		desc = irq_desc + i;
 
 		spin_lock_irq(&desc->lock);
@@ -1030,9 +1099,49 @@
 static struct proc_dir_entry * root_irq_dir;
 static struct proc_dir_entry * irq_dir [NR_IRQS];
 
+#define DEC_DIGITS 9
+
+/*
+ * Parses from 0 to 999999999. More than enough for IRQ purposes.
+ */
+static unsigned int parse_dec_value(const char *buffer,
+		unsigned long count, unsigned long *ret)
+{
+	unsigned char decnum [DEC_DIGITS];
+	unsigned long value;
+	int i;
+
+	if (!count)
+		return -EINVAL;
+	if (count > DEC_DIGITS)
+		count = DEC_DIGITS;
+	if (copy_from_user(decnum, buffer, count))
+		return -EFAULT;
+
+	/*
+	 * Parse the first 9 characters as a decimal string,
+	 * any non-decimal char is end-of-string.
+	 */
+	value = 0;
+
+	for (i = 0; i < count; i++) {
+		unsigned int c = decnum[i];
+
+		switch (c) {
+			case '0' ... '9': c -= '0'; break;
+		default:
+			goto out;
+		}
+		value = value * 10 + c;
+	}
+out:
+	*ret = value;
+	return 0;
+}
+
 #define HEX_DIGITS 8
 
-static unsigned int parse_hex_value (const char *buffer,
+static unsigned int parse_hex_value(const char *buffer,
 		unsigned long count, unsigned long *ret)
 {
 	unsigned char hexnum [HEX_DIGITS];
@@ -1071,18 +1180,17 @@
 
 #if CONFIG_SMP
 
-static struct proc_dir_entry * smp_affinity_entry [NR_IRQS];
-
 static unsigned long irq_affinity [NR_IRQS] = { [0 ... NR_IRQS-1] = ~0UL };
-static int irq_affinity_read_proc (char *page, char **start, off_t off,
+
+static int irq_affinity_read_proc(char *page, char **start, off_t off,
 			int count, int *eof, void *data)
 {
-	if (count < HEX_DIGITS+1)
+	if (count <= HEX_DIGITS)
 		return -EINVAL;
 	return sprintf (page, "%08lx\n", irq_affinity[(long)data]);
 }
 
-static int irq_affinity_write_proc (struct file *file, const char *buffer,
+static int irq_affinity_write_proc(struct file *file, const char *buffer,
 					unsigned long count, void *data)
 {
 	int irq = (long) data, full_count = count, err;
@@ -1109,16 +1217,16 @@
 
 #endif
 
-static int prof_cpu_mask_read_proc (char *page, char **start, off_t off,
+static int prof_cpu_mask_read_proc(char *page, char **start, off_t off,
 			int count, int *eof, void *data)
 {
 	unsigned long *mask = (unsigned long *) data;
-	if (count < HEX_DIGITS+1)
+	if (count <= HEX_DIGITS)
 		return -EINVAL;
 	return sprintf (page, "%08lx\n", *mask);
 }
 
-static int prof_cpu_mask_write_proc (struct file *file, const char *buffer,
+static int prof_cpu_mask_write_proc(struct file *file, const char *buffer,
 					unsigned long count, void *data)
 {
 	unsigned long *mask = (unsigned long *) data, full_count = count, err;
@@ -1132,10 +1240,45 @@
 	return full_count;
 }
 
+static int irq_rate_read_proc(char *page, char **start, off_t off,
+			int count, int *eof, void *data)
+{
+	int irq = (int) data;
+	if (count <= DEC_DIGITS)
+		return -EINVAL;
+	return sprintf (page, "%d\n", irq_rate[irq]);
+}
+
+static int irq_rate_write_proc(struct file *file, const char *buffer,
+					unsigned long count, void *data)
+{
+	int irq = (int) data;
+	unsigned long full_count = count, err;
+	unsigned long new_value;
+
+	/* do not allow the timer interrupt to be rate-limited ... :-| */
+	if (!irq)
+		return -EINVAL;
+	err = parse_dec_value(buffer, count, &new_value);
+	if (err)
+		return err;
+
+	/*
+	 * Do not allow a frequency to be lower than 1 interrupt
+	 * per jiffy.
+	 */
+	if (!new_value)
+		return -EINVAL;
+
+	set_irq_rate(irq, new_value);
+	return full_count;
+}
+
 #define MAX_NAMELEN 10
 
-static void register_irq_proc (unsigned int irq)
+static void register_irq_proc(unsigned int irq)
 {
+	struct proc_dir_entry *entry;
 	char name [MAX_NAMELEN];
 
 	if (!root_irq_dir || (irq_desc[irq].handler == &no_irq_type) ||
@@ -1148,28 +1291,32 @@
 	/* create /proc/irq/1234 */
 	irq_dir[irq] = proc_mkdir(name, root_irq_dir);
 
-#if CONFIG_SMP
-	{
-		struct proc_dir_entry *entry;
+	/* create /proc/irq/1234/max_rate */
+	entry = create_proc_entry("max_rate", 0600, irq_dir[irq]);
 
-		/* create /proc/irq/1234/smp_affinity */
-		entry = create_proc_entry("smp_affinity", 0600, irq_dir[irq]);
+	if (entry) {
+		entry->nlink = 1;
+		entry->data = (void *)irq;
+		entry->read_proc = irq_rate_read_proc;
+		entry->write_proc = irq_rate_write_proc;
+	}
 
-		if (entry) {
-			entry->nlink = 1;
-			entry->data = (void *)(long)irq;
-			entry->read_proc = irq_affinity_read_proc;
-			entry->write_proc = irq_affinity_write_proc;
-		}
+#if CONFIG_SMP
+	/* create /proc/irq/1234/smp_affinity */
+	entry = create_proc_entry("smp_affinity", 0600, irq_dir[irq]);
 
-		smp_affinity_entry[irq] = entry;
+	if (entry) {
+		entry->nlink = 1;
+		entry->data = (void *)(long)irq;
+		entry->read_proc = irq_affinity_read_proc;
+		entry->write_proc = irq_affinity_write_proc;
 	}
 #endif
 }
 
 unsigned long prof_cpu_mask = -1;
 
-void init_irq_proc (void)
+void init_irq_proc(void)
 {
 	struct proc_dir_entry *entry;
 	int i;
@@ -1181,7 +1328,7 @@
 	entry = create_proc_entry("prof_cpu_mask", 0600, root_irq_dir);
 
 	if (!entry)
-	    return;
+		return;
 
 	entry->nlink = 1;
 	entry->data = (void *)&prof_cpu_mask;