In a June of 1992 posting to the linux-activists mailing list, Linus Torvalds described the original Linux scheduler noting, "the scheduler in linux is pretty simple, but does a reasonably good job at giving good IO response while not being too unfair against cpu-bound processes." A year later, Linus posted a more detailed description of the scheduler noting, "the linux scheduling algorithm is one of the simplest ones possible". Comments in the original 254 line
sched.c file read, "'schedule()' is the scheduler function. This is GOOD CODE! There probably won't be any reason to change this, as it should work well in all circumstances (ie gives IO-bound processes good response etc). The one thing you might take a look at is the signal-handler code here."
Comments in the current 6,709 line
sched.c file show the first changes being made in 1996 by Dave Grothe, "to fix bugs in semaphores and make semaphores SMP safe". Two years later Andrea Arcangeli is credited with implementing "schedule_timeout() and related stuff". It was not until 2002, ten years after Linus' original code was written, that the scheduler received a complete rewrite, "new ultra-scalable O(1) scheduler by Ingo Molnar: hybrid priority-list and round-robin design with an array-switch method of distributing timeslices and per-CPU runqueues." Con Kolivas is credited with "interactivity tuning" in 2003, and Nick Piggin added "scheduler domains" in 2004. A more recent rewrite of the scheduler happened in April, again by Ingo Molnar, this time with his Completely Fair Scheduler.
From: Linus Benedict Torvalds [email blocked] Subject: Re: linux scheduling Date: June 15, 1992 - 4:13pm In article Paul Nakada writes: > >Could someone (linus?) please give a high level description of the >scheduling mechanism in Linux? The scheduler in linux is pretty simple, but does a reasonably good job at giving good IO response while not being too unfair against cpu-bound processes. Simply put: Linux gets a timer-interrupt at every jiffy (100 Hz), and decrements the current->counter. If current->counter is zero, and the process isn't running in kernel mode, the scheduler() is called. Also, when a process is woken up and it's counter is greater than the current process, a flag is set, which forces the re-scheduling. The scheduling is pretty simple: when schedule() is called, the process that has the highest non-zero counter and is runnable is run. If no runnable process exists, task ("swapper") gets to run, but it currently doesn't do anything. In the future, the swapper task might do page-aging or similar. If all runnable processes have a zero counter, the schedule() algorithm goes through all processes once more, giving them a new counter-value of "priority+counter/2". It might seem useless to add the counter/2, as counter is 0, but in fact this is done to /all/ processes, including sleeping ones, which thereby get higher counter-values for when they wake up. The above algorithm seems to be pretty fair, and works well while being very easy to understand. It also gives much snappier response than many systems I've seen. And it's all done without any queues at all, which I find clean and simple. In fact, the scedule() algorithm isn't more than a couple of lines in C, and going to sleep or waking up is just a matter of setting the process status and calling schedule(). >a) a process calls a kernel trap; >b) process enters kernel mode, current processor state saved. In fact, the current processor state is never saved anywhere as in "normal" unices. A kernel trap under linux is very much like a system call: registers get saved on the stack as needed, but we don't mess around with updating task-structures etc. >c) pop return address from stack and save No, let it lie on the stack: when the process returns it gets restored. Why save it anywhere else? >d) kernel initiates disk i/o and places process on i/o queue >e) get next active process >f) push return address on stack and return from handler to new process >g) process times out > etc etc etc. It's all much simpler than this: the process, running in kernel mode, does what it wants, and if it wants to sleep, it just does current->state = TASK_INTERRUPTIBLE; schedule(); assuming it has made sure it will wake up some way (due to signals or the kernel time-out feature or by putting the task-struct address in a wait-variable). The other processes run happily in between, and when something causes the process to wake up, it just continues after the schedule() call. When it's done all the things the system call (or page-exception or whetever) required it to do, it just does a return, and it's back in user mode. As to how schedule() switches tasks, it's builtin in the 386, so it's actually just a couple machine-code instructions. It's slightly more complicated than just a jump through a task-segment, as it checks that it's not the same task, and for minimal 387-saving by testing the "last_task_used_math" variable and doing a clts if possible, but it's still less than 10 insns. To understand what /really/ goes on, you'd better read a 386 book, but it's not too complicated if you just care about the general idea. >why do i ask this? I've been studying the scheduling code for a >couple days, trying to absorb enough to take a stab at improving >response time under heavy disk i/o. I can understand the parts as >separate modules, but it's not obvious to me how they work together as >a system. I imagine it's second nature to Linus and other veteran >kernel hackers, but all I've got is an OS class from a few years back. I doubt the scheduler is the problem when it comes to response-time under heavy disk i/o. It's probably at least partly because of the way buffer-cache blocks are locked, which sometimes forces the buffer algorithms to sleep on them unnecessarily. Thus if a buffer-request comes through when blocks are locked due to actual IO, the request sometimes blocks just to make sure there are no races. I'd take a look at fs/buffer.c before doing anything else. But be /very/ careful when changing buffer.c: it's very easy to add a race-condition if you don't think of all the possible things that can happen due to interrupts etc while a buffer is searced for. Linus
From: [email blocked] (Linus Torvalds) Subject: Re: LINUX Scheduling Date: 27 Jun 1993 15:19:05 +0300 In article Gordon Russell writes: > I have been working on a scheduler for my lightweight >threads algorithm (it uses pre-emptive round-robin at the moment), >and I am interested in finding out how the >LINUX kernal does its scheduling. Could someone post/mail me >the algorithm? If anyone knows other scheduling algorithms I would be >delighted to here about them. The linux scheduling algorithm is one of the simplest ones possible: what essentially happens is roughly: - every task gets a fixed number of ticks (100 Hz), defaulting to 15. At each clocktick the timer interrupt decreases the current task counter, and when the counter reaches zero, it sets the 'need_resched' flag. Note that this doesn't actually result in a task switch yet. - at every return to user mode, the low-level routines check the 'need_resched' flag, and call the scheduler if it is set, otherwise just do a normal return. - the scheduler just checks all processes, selects the one that has the highest tick-number, and lets it run. If all runnable processes have zero ticks remaining, the scheduler does an aging algorithm and tries again. - the aging algorithm just does a divide-by-two on the tick counter (0 for all runnable processes, but may be non-zero for sleeping processes), and adds the base timer value (15 by default, but changed by 'nice' etc) to the result. The above is the basic logic - very simple, and results in round-robin behaviour when all tasks are CPU-bound. If something sleeps for a longer time, it will get a higher tick-counter due to the aging mechanism, so that when it wakes up it can have up to 2*default ticks to do any work (although I doubt the aging actually makes much difference). In addition to the above, the linux 'wake_up()' functions also set the 'need_resched' flag if the task awakened has a higher tick counter than the currently running task. This is done to get better IO throughput and response times. Note that linux does not keep any state information in the scheduler: this makes the actual scheduling a bit inefficient, but gives more freedom in the sleep/wakup code to do some things that would otherwise be hard to do cleanly. The linux scheduler can be bogged down under heaveir load, but the logic hasn't changed for some time, mainly due to: - it's simple, and *very* flexible. Anybody interested in why things are done as they are, should probably take a look at the "add_wait_queue()" and "remove_wait_queue()" functions, especially the way they are used by select() and the tty routines. It can be confusing to see a process set it's own state to TASK_INTERRUPTIBLE (ie sleeping), yet still continue to run, but there are very good reasons for it (races). - it actually gives reasonable performance and good responsetimes under most circumstances, and has little overhead due to various priority calculations. Jim Wissner actually did some testing on the algorithm and implemented a more clever version (with priority queues), but he also said that the simple version seemed to work very well in almost all circumstances. Hope that gave you some idea about it, Linus
sched.c header, August 15'th 2007:
/* * kernel/sched.c * * Kernel scheduler and related syscalls * * Copyright (C) 1991-2002 Linus Torvalds * * 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and * make semaphores SMP safe * 1998-11-19 Implemented schedule_timeout() and related stuff * by Andrea Arcangeli * 2002-01-04 New ultra-scalable O(1) scheduler by Ingo Molnar: * hybrid priority-list and round-robin design with * an array-switch method of distributing timeslices * and per-CPU runqueues. Cleanups and useful suggestions * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. * 2004-04-02 Scheduler domains code by Nick Piggin * 2007-04-15 Work begun on replacing all interactivity tuning with a * fair scheduling design by Con Kolivas. * 2007-05-05 Load balancing (smp-nice) and other improvements * by Peter Williams * 2007-05-06 Interactivity improvements to CFS by Mike Galbraith * 2007-07-01 Group scheduling enhancements by Srivatsa Vaddagiri */