"Really, i have never seen a _single_ mainstream app where the use of sched_yield() was the right choice," stated Ingo Molnar during a continuing discussion about the Completely Fair Scheduler. He went on to ask if anyone could point to specific code that illustrates the proper usage of sched_yield(). In response to a theory of how it could potentially optimize userland locking, Ingo challenged, "these are generic statements, but I'm _really_ interested in the specifics. Real, specific code that i can look at. The typical Linux distro consists of in excess of 500 millions of lines of code, in tens of thousands of apps, so there really must be some good, valid and 'right' use of sched_yield() somewhere in there, in some mainstream app, right? (because, as you might have guessed it, in the past decade of sched_yield() existence i _have_ seen my share of sched_yield() utilizing user-space code, and at the moment i'm not really impressed by those examples.)" Ingo went on to explain:
"sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right?
"For example, if you asked me whether pipes are the best thing for certain apps, i could immediately show you tons of examples where they are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or just about any other core kernel concept or API. Your notion that showing a good example of an API would be 'difficult' because it's hard to determine 'smart' use is not tenable i believe and does not adequately refute my pretty plain-meaning 'it does not exist' assertion."
From: Ingo Molnar Subject: Re: Network slowdown due to CFS Date: Oct 1, 9:25 am 2007 * Jarek Poplawski <jarkao2@o2.pl> wrote: > BTW, it looks like risky to criticise sched_yield too much: some > people can misinterpret such discussions and stop using this at all, > even where it's right. Really, i have never seen a _single_ mainstream app where the use of sched_yield() was the right choice. Fortunately, the sched_yield() API is already one of the most rarely used scheduler functionalities, so it does not really matter. [ In my experience a Linux scheduler is stabilizing pretty well when the discussion shifts to yield behavior, because that shows that everything else is pretty much fine ;-) ] But, because you assert it that it's risky to "criticise sched_yield() too much", you sure must know at least one real example where it's right to use it (and cite the line and code where it's used, with specificity)? Ingo -
From: David Schwartz Subject: RE: Network slowdown due to CFS Date: Oct 1, 9:49 am 2007 > * Jarek Poplawski <jarkao2@o2.pl> wrote: > > > BTW, it looks like risky to criticise sched_yield too much: some > > people can misinterpret such discussions and stop using this at all, > > even where it's right. > Really, i have never seen a _single_ mainstream app where the use of > sched_yield() was the right choice. It can occasionally be an optimization. You may have a case where you can do something very efficiently if a lock is not held, but you cannot afford to wait for the lock to be released. So you check the lock, if it's held, you yield and then check again. If that fails, you do it the less optimal way (for example, dispatching it to a thread that *can* afford to wait). It is also sometimes used in the implementation of spinlock-type primitives. After spinning fails, yielding is tried. I think it's also sometimes appropriate when a thread may monopolize a mutex. For example, consider a rarely-run task that cleans up some expensive structures. It may need to hold locks that are only held during this complex clean up. One example I know of is a defragmenter for a multi-threaded memory allocator, and it has to lock whole pools. When it releases these locks, it calls yield before re-acquiring them to go back to work. The idea is to "go to the back of the line" if any threads are blocking on those mutexes. There are certainly other ways to do these things, but I have seen cases where, IMO, yielding was the best solution. Doing nothing would have been okay too. > Fortunately, the sched_yield() API is already one of the most rarely > used scheduler functionalities, so it does not really matter. [ In my > experience a Linux scheduler is stabilizing pretty well when the > discussion shifts to yield behavior, because that shows that everything > else is pretty much fine ;-) ] Can you explain what the current sched_yield behavior *is* for CFS and what the tunable does to change it? The desired behavior is for the current thread to not be rescheduled until every thread at the same static priority as this thread has had a chance to be scheduled. Of course, it's not clear exactly what a "chance" is. The semantics with respect to threads at other static priority levels is not clear. Ditto for SMP issues. It's also not clear whether threads that yield should be rewarded or punished for doing so. DS -
From: Ingo Molnar Subject: Re: Network slowdown due to CFS Date: Oct 1, 10:31 am 2007 * David Schwartz <davids@webmaster.com> wrote: > > > BTW, it looks like risky to criticise sched_yield too much: some > > > people can misinterpret such discussions and stop using this at > > > all, even where it's right. > > > Really, i have never seen a _single_ mainstream app where the use of > > sched_yield() was the right choice. > > It can occasionally be an optimization. You may have a case where you > can do something very efficiently if a lock is not held, but you > cannot afford to wait for the lock to be released. So you check the > lock, if it's held, you yield and then check again. If that fails, you > do it the less optimal way (for example, dispatching it to a thread > that *can* afford to wait). These are generic statements, but i'm _really_ interested in the specifics. Real, specific code that i can look at. The typical Linux distro consists of in execess of 500 millions of lines of code, in tens of thousands of apps, so there really must be some good, valid and "right" use of sched_yield() somewhere in there, in some mainstream app, right? (because, as you might have guessed it, in the past decade of sched_yield() existence i _have_ seen my share of sched_yield() utilizing user-space code, and at the moment i'm not really impressed by those examples.) Preferably that example should show that the best quality user-space lock implementation in a given scenario is best done via sched_yield(). Actual code and numbers. (And this isnt _that_ hard. I'm not asking for a full RDBMS implementation that must run through SQL99 spec suite. This is about a simple locking primitive, or a simple pointer to an existing codebase.) > It is also sometimes used in the implementation of spinlock-type > primitives. After spinning fails, yielding is tried. (user-space spinlocks are broken beyond words for anything but perhaps SCHED_FIFO tasks.) > One example I know of is a defragmenter for a multi-threaded memory > allocator, and it has to lock whole pools. When it releases these > locks, it calls yield before re-acquiring them to go back to work. The > idea is to "go to the back of the line" if any threads are blocking on > those mutexes. at a quick glance this seems broken too - but if you show the specific code i might be able to point out the breakage in detail. (One underlying problem here appears to be fairness: a quick unlock/lock sequence may starve out other threads. yield wont solve that fundamental problem either, and it will introduce random latencies into apps using this memory allocator.) > > Fortunately, the sched_yield() API is already one of the most rarely > > used scheduler functionalities, so it does not really matter. [ In my > > experience a Linux scheduler is stabilizing pretty well when the > > discussion shifts to yield behavior, because that shows that everything > > else is pretty much fine ;-) ] > > Can you explain what the current sched_yield behavior *is* for CFS and > what the tunable does to change it? sure. (and i described that flag on lkml before) The sched_yield flag does two things: - if 0 ("opportunistic mode"), then the task will reschedule to any other task that is in "bigger need for CPU time" than the currently running task, as indicated by CFS's ->wait_runtime metric. (or as indicated by the similar ->vruntime metric in sched-devel.git) - if 1 ("agressive mode"), then the task will be one-time requeued to the right end of the CFS rbtree. This means that for one instance, all other tasks will run before this task will run again - after that this task's natural ordering within the rbtree is restored. > The desired behavior is for the current thread to not be rescheduled > until every thread at the same static priority as this thread has had > a chance to be scheduled. do you realize that this "desired behavior" you just described is not achieved by the old scheduler, and that this random behavior _is_ the main problem here? If yield was well-specified then we could implement it in a well-specified way - even if the API was poor. But fact is that it is _not_ well-specified, and apps grew upon a random scheduler implementation details in random ways. (in the lkml discussion about this topic, Linus offered a pretty sane theoretical definition for yield but it's not simple to implement [and no scheduler implements it at the moment] - nor will it map to the old scheduler's yield behavior so we'll end up breaking more apps.) Ingo -
From: David Schwartz Subject: RE: Network slowdown due to CFS Date: Oct 1, 11:23 am 2007 > These are generic statements, but i'm _really_ interested in the > specifics. Real, specific code that i can look at. The typical Linux > distro consists of in execess of 500 millions of lines of code, in tens > of thousands of apps, so there really must be some good, valid and > "right" use of sched_yield() somewhere in there, in some mainstream app, > right? (because, as you might have guessed it, in the past decade of > sched_yield() existence i _have_ seen my share of sched_yield() > utilizing user-space code, and at the moment i'm not really impressed by > those examples.) Maybe, maybe not. Even if so, it would be very difficult to find. Simply grepping for sched_yield is not going to help because determining whether a given use of sched_yield is smart is not going to be easy. > (user-space spinlocks are broken beyond words for anything but perhaps > SCHED_FIFO tasks.) User-space spinlocks are broken so spinlocks can only be implemented in kernel-space? Even if you use the kernel to schedule/unschedule the tasks, you still have to spin in user-space. > > One example I know of is a defragmenter for a multi-threaded memory > > allocator, and it has to lock whole pools. When it releases these > > locks, it calls yield before re-acquiring them to go back to work. The > > idea is to "go to the back of the line" if any threads are blocking on > > those mutexes. > at a quick glance this seems broken too - but if you show the specific > code i might be able to point out the breakage in detail. (One > underlying problem here appears to be fairness: a quick unlock/lock > sequence may starve out other threads. yield wont solve that fundamental > problem either, and it will introduce random latencies into apps using > this memory allocator.) You are assuming that random latencies are necessarily bad. Random latencies may be significantly better than predictable high latency. > > Can you explain what the current sched_yield behavior *is* for CFS and > > what the tunable does to change it? > sure. (and i described that flag on lkml before) The sched_yield flag > does two things: > - if 0 ("opportunistic mode"), then the task will reschedule to any > other task that is in "bigger need for CPU time" than the currently > running task, as indicated by CFS's ->wait_runtime metric. (or as > indicated by the similar ->vruntime metric in sched-devel.git) > > - if 1 ("agressive mode"), then the task will be one-time requeued to > the right end of the CFS rbtree. This means that for one instance, > all other tasks will run before this task will run again - after that > this task's natural ordering within the rbtree is restored. Thank you. Unfortunately, neither of these does what sched_yiled is really supposed to do. Opportunistic mode does too little and agressive mode does too much. > > The desired behavior is for the current thread to not be rescheduled > > until every thread at the same static priority as this thread has had > > a chance to be scheduled. > do you realize that this "desired behavior" you just described is not > achieved by the old scheduler, and that this random behavior _is_ the > main problem here? If yield was well-specified then we could implement > it in a well-specified way - even if the API was poor. > But fact is that it is _not_ well-specified, and apps grew upon a random > scheduler implementation details in random ways. (in the lkml discussion > about this topic, Linus offered a pretty sane theoretical definition for > yield but it's not simple to implement [and no scheduler implements it > at the moment] - nor will it map to the old scheduler's yield behavior > so we'll end up breaking more apps.) I don't have a problem with failing to emulate the old scheduler's behavior if we can show that the new behavior has saner semantics. Unfortunately, in this case, I think CFS' semantics are pretty bad. Neither of these is what sched_yield is supposed to do. Note that I'm not saying this is a particularly big deal. And I'm not calling CFS' behavior a regression, since it's not really better or worse than the old behavior, simply different. I'm not familiar enough with CFS' internals to help much on the implementation, but there may be some simple compromise yield that might work well enough. How about simply acting as if the task used up its timeslice and scheduling the next one? (Possibly with a slight reduction in penalty or reward for not really using all the time, if possible?) DS -
From: Ingo Molnar Subject: Re: yield API Date: Oct 1, 11:46 pm 2007 * David Schwartz <davids@webmaster.com> wrote: > > These are generic statements, but i'm _really_ interested in the > > specifics. Real, specific code that i can look at. The typical Linux > > distro consists of in execess of 500 millions of lines of code, in > > tens of thousands of apps, so there really must be some good, valid > > and "right" use of sched_yield() somewhere in there, in some > > mainstream app, right? (because, as you might have guessed it, in > > the past decade of sched_yield() existence i _have_ seen my share of > > sched_yield() utilizing user-space code, and at the moment i'm not > > really impressed by those examples.) > > Maybe, maybe not. Even if so, it would be very difficult to find. > Simply grepping for sched_yield is not going to help because > determining whether a given use of sched_yield is smart is not going > to be easy. sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right? For example, if you asked me whether pipes are the best thing for certain apps, i could immediately show you tons of examples where they are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or just about any other core kernel concept or API. Your notion that showing a good example of an API would be "difficult" because it's hard to determine "smart" use is not tenable i believe and does not adequately refute my pretty plain-meaning "it does not exist" assertion. If then this is one more supporting proof for the fundamental weakness of the sched_yield() API. Rarely are we able to so universally condemn an API: real-life is usually more varied and even for theoretically poorly defined APIs _some_ sort of legitimate use does grow up. APIs that are not in any real, meaningful use, despite a decade of presence are not really interesting to me personally. (especially in this case where we know exactly _why_ the API is used so rarely.) Sure we'll continue to support it in the best possible way, with the usual kernel maintainance policy: without hurting other, more commonly used APIs. That was the principle we followed in previous schedulers too. And if anyone has a patch to make sched_yield() better than it is today, i'm of course interested in it. Ingo -
From: Chris Friesen Subject: Re: Network slowdown due to CFS Date: Oct 1, 9:55 am 2007 Ingo Molnar wrote: > But, because you assert it that it's risky to "criticise sched_yield() > too much", you sure must know at least one real example where it's right > to use it (and cite the line and code where it's used, with > specificity)? It's fine to criticise sched_yield(). I agree that new apps should generally be written to use proper completion mechanisms or to wait for specific events. However, there are closed-source and/or frozen-source apps where it's not practical to rewrite or rebuild the app. Does it make sense to break the behaviour of all of these? Chris -
From: Ingo Molnar Subject: Re: Network slowdown due to CFS Date: Oct 1, 10:09 am 2007 * Chris Friesen <cfriesen@nortel.com> wrote: > Ingo Molnar wrote: > > >But, because you assert it that it's risky to "criticise sched_yield() > >too much", you sure must know at least one real example where it's right > >to use it (and cite the line and code where it's used, with > >specificity)? > > It's fine to criticise sched_yield(). I agree that new apps should > generally be written to use proper completion mechanisms or to wait > for specific events. yes. > However, there are closed-source and/or frozen-source apps where it's > not practical to rewrite or rebuild the app. Does it make sense to > break the behaviour of all of these? See the background and answers to that in: http://lkml.org/lkml/2007/9/19/357 http://lkml.org/lkml/2007/9/19/328 there's plenty of recourse possible to all possible kinds of apps. Tune the sysctl flag in one direction or another, depending on which behavior the app is expecting. Ingo -


So what *should* sched_yield() do?
As best as I can tell, David keeps saying "it's useful, but you're not implementing it right" without ever saying what "right" is. Did I miss it somewhere?
--
Program Intellivision and play Space Patrol!
He uses semantics similar to
He uses semantics similar to those defined by Linus:
Can you explain what the current sched_yield behavior *is* for CFS and what
the tunable does to change it?
The desired behavior is for the current thread to not be rescheduled until
every thread at the same static priority as this thread has had a chance to
be scheduled.
Of course, it's not clear exactly what a "chance" is.
The semantics with respect to threads at other static priority levels is not
clear. Ditto for SMP issues. It's also not clear whether threads that yield
should be rewarded or punished for doing so.
No definite semantics
sched_yield() has no definite semantics.
It means: "I have nothing useful to do right now, but execute me again in a few timeslices"
The duration of the delay is undefined, and load-dependant.
The expectation is that all other runnable tasks will have a timeslice, before the yelding task is run again.
It is equivalent to a vey short delay, of variable duration.
The CFS seems to have trouble modelizing it, because not consuming your timeslices tends to *increase* your likelihood to be picked up to run next.
I suggested that the yelding() task be temporarily set at the lowest priority, and all its time credits erased. Since the CFS ensures that no starvation occurs, this would have approximately the same effect than in the old scheduler.
What if it already is at the
What if it already is at the lowest priority?
I don't really know how CFS works specifically, but why is simulating that this task used a full 'timeslice' (or whatever CFS tracks) hard?
Except...
That's fine, except people complain "Put it at the end of the ready queue" (which is one of the existing options, BTW) is too harsh. So... where should it put it?
--
Program Intellivision and play Space Patrol!
Except...
Who said that placing a yelding task at the end of the ready queue was "too harsh" ?
No one I can think of.
Currently the CFS problems with sched_yield is that yielding tasks tend to get too much CPU, never not enough. I surmise the CFS priority algorithm reschedule them much too often (because they never consume their whole allotted CPU share, as per their nice() priority).
Theres no ready queue as such with the CFS; all the runnable tasks are sorted using a red-black tree, using a metric based on the number of nanoseconds they are entitled to run by their nice level.
CFS doesn't demote yielding tasks enough, causing them to hog the CPU, by yelding and being rescheduled repeatedly within the same scheduling "epoch".
Could someone seing a loss of network performance caused by iperf try to manually change the nice() level of iperf to the level of a background task, and see if it "fixes" the problem ?
If so, the CFS has just to implement each yield() as a *temporary* lowering of priority.
Should work fine, the CFS does not allow starvation.
Right here:
Did you read the thread?
From: David Schwartz Subject: RE: Network slowdown due to CFS Date: Oct 1, 9:23 am 2007 [...] > - if 0 ("opportunistic mode"), then the task will reschedule to any > other task that is in "bigger need for CPU time" than the currently > running task, as indicated by CFS's ->wait_runtime metric. (or as > indicated by the similar ->vruntime metric in sched-devel.git) > > - if 1 ("agressive mode"), then the task will be one-time requeued to > the right end of the CFS rbtree. This means that for one instance, > all other tasks will run before this task will run again - after that > this task's natural ordering within the rbtree is restored. Thank you. Unfortunately, neither of these does what sched_yiled is really supposed to do. Opportunistic mode does too little and agressive mode does too much.I seem to recall David wasn't the only person to make that comment.
At any rate, there are two modes at opposite ends of the spectrum, and there's call for something in between them. Exactly what in between no one seems to specify very well.
--
Program Intellivision and play Space Patrol!
My expections of yield()
I would expect yield() in Java, and any other programming environment, to indicate to the system that I am quite happy for the current thread to give up its time slice to whatever other processes are ready to run - regardless of what their priority is relative to my thread.
To me this is 'obvious'...
My thread might normally have a high priority, but for now I want to let other possibly lower priority threads have a chance to run. This is simpler than explicitly lowering my priority, then explicitly raising it again later. Also, by using yield() this way, I don't get locked out by higher priority threads in operating systems that never give lower priority threads a chance to run when there is a higher priority thread able to run.
-Nivag
Then why not sleep?
If you have nothing to do, then why not sleep? There are APIs that say "Wake me up in about X microseconds."
--
Program Intellivision and play Space Patrol!
'cause I don't want to
A case where I used sched_yield() is in a cooperative muti-process application. I had ~10-20 processes all of which were cpu-bound. (AI related) By yielding, I ensured that all the CPU time was utilized, but that each process got a chance to run it's main logic loop in more of a round robin fasion than proccess a gets 90 iterations, process b gets 20 iterations, proccess c gets 2 iterations, etc.
The amount of time each process should "sleep" would not be simple to figure out and it would lead to under-utilization of the processor. Decreasing the time-slice was also not a good idea because occasionaly the loop would need to perform significant computation. The check to see whether the significant computation was quite cheap and I wanted to make the checks in all processes happen as often as possible.
yielding does have a place, but I don't think most mainstream apps are where yielding is useful. Not many people have many cooperating cpu-bound processes with little to no IO to wait on.
-BitShifter
I'm confused
If all the threads are CPU bound, then how do you waste CPU time? Getting 100% CPU utilization sounds easy in that case.
--
Program Intellivision and play Space Patrol!
He's not talking about CPU
He's not talking about CPU utilization; he is using sched_yield() to get better interactivity -- each thread terminates its timeslice as soon as one unit of work has been done.
The obvious problem with this approach is that any non-yielding application will immediately dominate the CPU time, starving AI threads from interactivity and CPU time. But it probably does work if the threads are running exclusively.
As far as I can tell, this could probably be written as a single-threaded application.
Could? Should!
Could? Should!
... or should not
Only if you only have one CPU.
If you have multiple, then you need to have as many threads as you have CPU's. So you need to figure out how many CPU's you have; good luck figuring that out with pthreads or in Java.
And if there is a higher-priority task that wakes up occasionally, hogging one of your CPU's completely, then one of your single-threaded tasks, bound to a CPU, may not get any action any more. If it's not bound to a CPU, then your threads are all getting penalised in round-robin, possibly breaking your per-CPU cache all the time because the threads are migrating around.
That case aside, I agree though; the use of sched_yield is a result of programmer laziness and/or ineptitude. Threads are treated by most programmers as a way of organising trains of thought, not as a way of taking advantage of processor concurrency. Slicing up your tasks into short subtasks and deciding which task queue should get more attention is hard compared to just firing off threads, and shoving in sched_yields whenever one thread's constant-readiness is causing unacceptable reaction times in other threads. It is just the easy thing to do if you're in a mode of just hacking at it until it satifies the requirements.
An interactive app blocked
An interactive app blocked on user input will wake up with a full time-slice at the head of the ready queue.
If you have N CPU hogs, one of which is polling for user events, then really a sufficiently small time slice with a notion of fairness is really what you want. The thread polling input should perhaps be at higher priority than the others, too.
--
Program Intellivision and play Space Patrol!
..using sched_yield, at least under Linux, is always a good idea
Quick googling came up with:
http://www-unix.mcs.anl.gov/mpi/mpich1/micronotes/yield/
Relevant quotes:
"Even when only two processes are used on a (shared) dual processor, using sched_yield is advantageous"
"These results suggest (but don't prove, since this is on a shared system) that using sched_yield, at least under Linux, is always a good idea."
Someone volunteering to mail it to mingo? I don't want to get into the flames.
He is not benchmarking the
He is not benchmarking the sched_yield() call vs the "correct approach". He is benchmarking the use of sched_yield() against the following select() hack:
(see mpid/ch_shmem/p2p.c line 555)
Naturally, sched_yield() will do the job of yielding better than select(). However, note that benchmarks are normally ran on an otherwise idle system. If there were other processes contending for the CPU, they would inherently get an unfairly large share of the CPU, because sched_yield() inherently penalizes the calling thread more than a normal blocking call would (regardless of the scheduler).
So: yielding is the wrong approach in the first place. Instead, threads should be waiting on a set of semaphores, and the code should be signaling a particular semaphore to tell the scheduler exactly which thread(s) need to be scheduled next — he is using shared memory after all.
I would bet that this code would get a similar speedup just like the improved Iperf, and likewise greater fairness without any drawbacks. Another benefit is that the semantics of semaphores are well-defined and portable: in contrast to sched_yield(), for which the closest thing on platforms like Windows is Sleep(0) — essentially equivalent to select(0,0,0,0,{0,0}) on Unix.
Does anyone know if Java's
Does anyone know if Java's Thread.yield() is implemented using sched_yield(), or does it only yield to threads running within the same VM? Although most uses of this are probably just as broken as sched_yield().
Although the description, "temporarily pause and allow other threads to execute", could be interpreted simply as "give up the current timeslice". As far as I can tell, Windows does not have a yield-equivalent function, so any cross-platform code would not depend on those semantics anyway. Sleep(0) will simply give up the current timeslice, but the running thread will still get re-scheduled without any prejudice.
Question
I've been reading this CFS-soap for a while now, but one thing that hasn't been clear to me is whether this yield-business affect just the program that uses the yield (which is perfectly acceptable in my book) or whether it affects performance of the whole system (which is bad as it is not acceptable, at least this is what I believe, that an application can mess up a whole system, not even if the application shouldn't have been written that way).
Mostly just the caller.
The sched_yield() stuff mostly just affects the caller. Now, it can affect the rest of the system to the extent that the caller uses more or fewer resources, and the extent that other tasks block waiting for that task to free them up.
For instance, if a CPU hog doesn't get throttled as much by sched_yield() as it did previously, then other tasks will run more slowly. If a CPU hog gets throttled more heavily, then other tasks will run faster. If the sched_yield() guy holds locks (futexes, for instance, or flocks() on files), it could block other tasks that it's coordinating with by more or less accordingly.
--
Program Intellivision and play Space Patrol!
Punish apps that can't rebuild?
> there's plenty of recourse possible to all possible kinds of apps. Tune
> the sysctl flag in one direction or another, depending on which behavior
> the app is expecting.
So the solution for closed-source / frozen-source apps is to re-tune the ENTIRE KERNEL so that ONE APP performs better? And when somebody depends upon two such apps, they are just out of luck?
(They are missing another category, too: cross-platform apps, which either target older linux systems that don't have futexes and all the necessary events, or target POSIX-standard systems and thus can't use the futexes and such.)
This "I am smarter than you, fix it yourself" attitude from a very influential kernel developer is scary. Very scary. Ingo is not right because he is right about locking; he is "right" because he has driven off or shouted louder than anyone who disagrees with him.
So the solution for
Name a case of someone relying on two applications that depended on the previous sched_yield() behaviour to perform correctly, and they can't be rebuilt. If you can't find it, you're inventing the problem.
Note that POSIX has pthread mutexes and semaphores, and they can be implemented using futexes. Again, point to a specific application that has problems and can't have the sched_yield() part rewritten. Else, this means nothing.
A good inflamatory way of finishing a post full of supposed problems, isn't it?
Thread time slice
Why not just reschedule it as if it used all his time slice based on its nice level? I think it is fair and would not penalize nor the thread that yield nor the others.