[LWN Logo]
[LWN.net]
From:	schwidefsky@de.ibm.com
To:	linux-kernel@vger.kernel.org
Date:	Mon, 9 Apr 2001 17:54:37 +0200
Subject: No 100 HZ timer !

Hi,
seems like my first try with the complete patch hasn't made it through
to the mailing list. This is the second try with only the common part of
the patch. Here we go (again):
---

I have a suggestion that might seem unusual at first but it is important
for Linux on S/390. We are facing the problem that we want to start many
(> 1000) Linux images on a big S/390 machine. Every image has its own
100 HZ timer on every processor the images uses (normally 1). On a single
image system the processor use of the 100 HZ timer is not a big deal but
with > 1000 images you need a lot of processing power just to execute the
100 HZ timers. You quickly end up with 100% CPU only for the timer
interrupts of otherwise idle images. Therefore I had a go at the timer
stuff and now I have a system running without the 100 HZ timer. Unluckly
I need to make changes to common code and I want you opinion on it.

The first problem was how to get rid of the jiffies. The solution is
simple. I simply defined a macro that calculates the jiffies value from
the TOD clock:
  #define jiffies ({ \
          uint64_t __ticks; \
          asm ("STCK %0" : "=m" (__ticks) ); \
          __ticks = (__ticks - init_timer_cc) >> 12; \
          do_div(__ticks, (1000000/HZ)); \
          ((unsigned long) __ticks); \
  })
With this define you are independent of the jiffies variable which is no
longer needed so I ifdef'ed the definition. There are some places where a
local variable is named jiffies. You may not replace these so I renamed
them to _jiffies. A kernel compiled with only this change works as always.

The second problem is that you need to be able to find out when the next
timer event is due to happen. You'll find a new function "next_timer_event"
in the patch which traverses tv1-tv5 and returns the timer_list of the next
timer event. It is used in timer_bh to indicate to the backend when the
next interrupt should happen. This leads us to the notifier functions.
Each time a new timer is added, a timer is modified, or a timer expires
the architecture backend needs to reset its timeout value. That is what the
"timer_notify" callback is used for. The implementation on S/390 uses the
clock comparator and looks like this:
  static void s390_timer_notify(unsigned long expires)
  {
          S390_lowcore.timer_event =
                  ((__u64) expires*CLK_TICKS_PER_JIFFY) + init_timer_cc;
          asm volatile ("SCKC %0" : : "m" (S390_lowcore.timer_event));
  }
This causes an interrupt on the cpu which executed s390_timer_notify after
"expires" has passed. That means that timer events are spread over the cpus
in the system. Modified or deleted timer events do not cause a deletion
notification. A cpu might be errornously interrupted to early because of a
timer event that has been modified or deleted. But that doesn't do any
harm,
it is just unnecessary work.

There is a second callback "itimer_notify" that is used to get the per
process timers right. We use the cpu timer for this purpose:
  void set_cpu_timer(void)
  {
          unsigned long min_ticks;
          __u64 time_slice;
          if (current->pid != 0 && current->need_resched == 0) {
                  min_ticks = current->counter;
                  if (current->it_prof_value != 0 &&
                      current->it_prof_value < min_ticks)
                          min_ticks = current->it_prof_value;
                  if (current->it_virt_value != 0 &&
                      current->it_virt_value < min_ticks)
                          min_ticks = current->it_virt_value;
                  time_slice = (__u64) min_ticks*CLK_TICKS_PER_JIFFY;
                  asm volatile ("spt %0" : : "m" (time_slice));
          }
  }
The cpu timer is a one shot timer that interrupts after the specified
amount
of time has passed. Not a 100% accurate because VM can schedule the virtual
processor before the "spt" has been done but good enough for per process
timers.

The remaining changes to common code parts deal with the problem that many
ticks may be accounted at once. For example without the 100 HZ timer it is
possible that a process runs for half a second in user space. With the next
interrupt all the ticks between the last update and the interrupt have to
be added to the tick counters. This is why update_wall_time and do_it_prof
have changed and update_process_times2 has been introduced.

That leaves three problems: 1) you need to check on every system entry if
a tick or more has passed and do the update if necessary, 2) you need to
keep track of the elapsed time in user space and in kernel space and 3) you
need to check tq_timer every time the system is left and setup a timer
event for the next timer tick if there is work to do on the timer queue.
These three problems are related and have to be implemented architecture
dependent. A nice thing we get for free is that the user/kernel elapsed
time
measurement gets much more accurate.

The number of interrupts in an idle system due to timer activity drops from
from 100 per second on every cpu to about 5-6 on all (!) cpus if this patch
is used. Exactly what we want to have.

All this new timer code is only used if the config option
CONFIG_NO_HZ_TIMER
is set. Without it everything works as always, especially for architectures
that will not use it.

Now what do you think?

(See attached file: timer_common)

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/