Linux: Kernel Scheduling

Submitted by Jeremy
on April 3, 2002 - 7:59am

A recent thread on the lkml discussed the addition of another scheduler priority. Currently, referring to 'linux/include/schedule.h' in 2.4 (follows), there are three such priorities available. SCHED_FIFO, SCHED_RR and SCHED_OTHER. The additional priority in question is SCHED_IDLE.

This first two, SCHED_FIFO and SCHED_RR, are for real time processing. The former handles processes "first in, first out". Once a SCHED_FIFO process gets the CPU, it keeps it until finished, or until a higher priority process comes along. The latter handles processes "round robin", sharing time slices among all SCHED_RR process of equal priority. The third, SCHED_OTHER, is the standard, aka "time-share processing".

As for the discussed potential SCHED_IDLE, it would schedule a process to run only when the CPU was idle, not busy with other tasks. In the following thread, Robert Love says, "There is just a lot more to SCHED_IDLE than 'make the task only run when nothing else wants to'", explaining that as simple as it sounds, a lot of work is involved in the implementation.

More information about scheduling can be found in O'Reilly's book Understanding the Linux Kernel. You can refer to the relevant chapter 10 online.


From linux/include/sched.h:
/*
* Scheduling policies
*/
#define SCHED_OTHER 0
#define SCHED_FIFO 1
#define SCHED_RR 2

/*
* This is an additional bit set when we want to
* yield the CPU for one re-schedule..
*/
#define SCHED_YIELD 0x10

From: Nuno Miguel Rodrigues
To: linux-kernel mailing list
Subject: Scheduler priorities
Date: Tue, 26 Mar 2002 11:41:39 +0000 (WET)

Hi,

Does Linux support a fixed process scheduling priority, in the 2.4.x
releases?
If not, are there any plans to support it?

Thanks.

--
Nuno M. Rodrigues
Systems Administrator

From: Mike Fedyk
Subject: Re: Scheduler priorities
Date: Tue, 26 Mar 2002 09:55:11 -0800

Can you elaborate? What do you mean by "fixed process"? Minimum percentage
of CPU?

From: Frank Schaefer
Subject: Re: Scheduler priorities
Date: 27 Mar 2002 10:41:41 +0100

Hi,

I'd read this ...
a fixed prority of some process ...

From: Nuno Miguel Rodrigues
Subject: Re: Scheduler priorities
Date: Wed, 27 Mar 2002 13:06:01 +0000 (WET)

Indeed. In other words a priority that is not dynamically adjusted over
time. Like a real-time scheduling priority.

From: Robert Love
Subject: Re: Scheduler priorities
Date: 27 Mar 2002 08:41:51 -0500

Yes, Linux has two - SCHED_FIFO and SCHED_RR.

Robert Love

From: Wessel Dankers
Subject: Re: Scheduler priorities
Date: Wed, 27 Mar 2002 21:23:35 +0100

Any plans for a SCHED_IDLE?

From: Robert Love
Subject: Re: Scheduler priorities
Date: 27 Mar 2002 16:14:55 -0500

I think Ingo Molnar has mentioned lately doing one.

The problem is, it is not easy to implement right - there are priority
inversion issues to deal with ...

Robert Love

From: Wessel Dankers
Subject: Re: Scheduler priorities
Date: Thu, 28 Mar 2002 08:08:55 +0100

Well evidently it should be root-only, just like SCHED_RR and SCHED_FIFO.
If the priority inversion issues are worked out this restriction could be
removed. I remember discussing this problem with Rik van Riel.
The kernel-preempt patch seems to be able to detect when a process holds a
lock; perhaps the process scheduler can temporarily revert to SCHED_NORMAL
when this is the case? Preferably with a large nice value.

From: Robert Love
Subject: Re: Scheduler priorities
Date: 28 Mar 2002 02:29:34 -0500

The preempt-kernel patch does keep track of the lock count, but it does
not include semaphores and those are what we need to worry about.

I also don't think it is enough that SCHED_IDLE only be settable by
root. Regardless of what permissions it takes to set the scheduling
class, a "SCHED_IDLE" task should never be capable of harming an RT
task.

One solution I have come across is checking whether the task is
returning to kernel or user mode and acting appropriately. As needed,
the task can be scheduled as SCHED_NORMAL. This situation could even be
special-cased like ptrace and not impact normal scheduling. Perhaps
this is what Ingo had in mind ... I hope he is still interested and
presents some code.

I know all this because I tried to implement SCHED_IDLE about a year
ago. There were arguments against every approach, and SCHED_IDLE will
never be accepted until they are all satisfied. If we want it, it needs
to be done right.

Robert Love

From: Pavel Machek
Subject: Re: Scheduler priorities
Date: Fri, 29 Mar 2002 22:42:52 +0100

Hi!

> I also don't think it is enough that SCHED_IDLE only be settable by
> root. Regardless of what permissions it takes to set the scheduling
> class, a "SCHED_IDLE" task should never be capable of harming an RT
> task.

On each entry of kernel, promote SCHED_IDLE task to SCHED_NORMAL, and
demote it at exit. This can be done with 0 overhad on hot paths.

> One solution I have come across is checking whether the task is
> returning to kernel or user mode and acting appropriately. As needed,
> the task can be scheduled as SCHED_NORMAL. This situation could even be
> special-cased like ptrace and not impact normal scheduling. Perhaps
> this is what Ingo had in mind ... I hope he is still interested and
> presents some code.
>
> I know all this because I tried to implement SCHED_IDLE about a year
> ago. There were arguments against every approach, and SCHED_IDLE will
> never be accepted until they are all satisfied. If we want it, it needs
> to be done right.

What's the problem with "promote at enter" approach? Using ptrace
trick, it can be 0 overhead. [Was that your code that cleverly used
ptrace?] What is problem with it?
Pavel

From: Robert Love
Subject: Re: Scheduler priorities
Date: 29 Mar 2002 18:52:21 -0500

On Fri, 2002-03-29 at 16:42, Pavel Machek wrote:
> On each entry of kernel, promote SCHED_IDLE task to SCHED_NORMAL, and
> demote it at exit. This can be done with 0 overhad on hot paths.

Agreed.

> What's the problem with "promote at enter" approach? Using ptrace
> trick, it can be 0 overhead. [Was that your code that cleverly used
> ptrace?] What is problem with it?

There is no problem with the ptrace approach, it is good - I have
experimented with that solution myself. There is just a lot more to
SCHED_IDLE than "make the task only run when nothing else wants to" and
even the ptrace solution may involve a bit of work.

Robert Love

priority-inversion

jcopenha
on
April 3, 2002 - 12:27pm

rml :
The preempt-kernel patch does keep track of the lock count, but it does
not include semaphores and those are what we need to worry about.

Does this mean that it is possible for priority inversion to occur with the preempt patch??

Re: priority-inversion

rml
on
April 3, 2002 - 1:28pm

Does this mean that it is possible for priority inversion to occur with the preempt patch??

No, the comment was in reference to SCHED_IDLE, i.e. the topic of the post. When implementing a SCHED_IDLE class, there is a priority inversion issue. Basically, picture the problem where one grabs a semaphore, sleeps, and is blocked by the normal processes on a system. Then a RT task comes along, wants the semaphores, but can't get it. The SCHED_IDLE task needs to run to release the task, but it won't.

Someone suggested using the preemptive kernel's preempt_count to check if locks were held but preempt_count does not track semaphores which would be the type of lock we would typically return to userspace holding.

Robert Love

ok..

jcopenha
on
April 3, 2002 - 5:44pm

Ok.. so how is priority inversion prevented in a regularly running linux system? Just the fact that we have fairness and all tasks are eventually run?

Re: ok..

rml
on
April 3, 2002 - 6:52pm

Ok.. so how is priority inversion prevented in a regularly running linux system? Just the fact that we have fairness and all tasks are eventually run?

Yep - dead on.

The every-task-makes-headway rule is an important one and is the only real assurance we have that prevents priority inversion. With semaphores that can be held across schedules, it is possible a contended semaphore could be held by a lower priority process, inverting a higher priority process. The guarantee the lower priority process will eventually run is all we have.

Note that even though the lower priority process will eventually run, this is still a case of priority inversion. Every nonpreemptible kernel represents a cause of priority inversion since lower priority processes do not "answer" to higher priority processes while in the kernel. The ideal solution (i.e. no priority inversion ever) is preemptive kernel + priority-inheriting mutexes. This is the solution Solaris has ...

Robert Love

Priority inversion still happens, I think.

Anonymous
on
April 8, 2002 - 6:02am

The every-task-makes-headway rule is an important one and is the only real assurance we have that prevents priority inversion.

Just a minor nit: As I understand the term "priority inversion", it means only that a lower-priority task is blocking a higher-priority task in the following circumstance:


  • A low priority task grabs a resource (lock, semaphore, whatever)
  • A high priority task requests the resource and is blocked by the lower priority task.

By this definition (which I may have wrong), the essence of the priority inversion is that the low priority task has blocked the high priority task. What makes it really bad is that in a hard-priority system, a mid-priority task could become active, blocking the low-priority task (due to hard priorities), and by extension blocking the high-priority task. If the mid-priority task never blocks, both low and high priority tasks are completely taken out, which is the really bad case that nobody wants in their system. But the essense of the problem is that the low-priority task blocked the high-priority task to begin with -- that's the priority inversion.

Thus, by that definition, priority inversions can still happen (eg. a lower priority task can block a higher priority task) even if all processes make progress. The difference is that you eventually start making progress again -- no process ever gets completely starved out, just delayed a bit.

A SCHED_IDLE that truly only lets a task run when nothing else wants to allows the "really bad" case above to happen if anything in SCHED_OTHER wants to run when a SCHED_IDLE has a lock on something. And since most of the system lives in SCHED_OTHER, it's a system deadlock waiting to happen. :-)

So am I correct in my concept of the definition, or is the essence truly that the mid-priority task can block the high-priority task by way of the low-priority task?

On a different note: I agree completely that priority inheritance is the correct technique to provide no inversions ever. While the low-priority task can still block the high-priority task, it can only do so for as long as a similarly high-priority task could have, due to the inherited priority.

duh, I'm half awake. I read

Anonymous
on
April 8, 2002 - 6:04am

duh, I'm half awake. I read your first paragraph, and fired off the above post. I then read your second paragraph (which says essentially what I said above, in many fewer words). I'm going to go get coffee now. Happy hacking.

SCHED_OTHER timeslice and priority

Anonymous
on
March 3, 2004 - 6:20am

I want to ask you about SCHED_NORMAL.Is it the same as SCHED_OTHER in 2.4 (is it time-sharing?)?In 2.4 SCHED_OTHER timeslice is calculated by counter,I don't know how SCHED_NORMAL calculates its timeslice and its process priority (it doesn't use the variable counter).Can you give me the expression to calculate them ? Thanks

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.