"Really, i have never seen a _single_ mainstream app where the use of sched_yield() was the right choice," stated Ingo Molnar during a continuing discussion about the Completely Fair Scheduler. He went on to ask if anyone could point to specific code that illustrates the proper usage of
sched_yield(). In response to a theory of how it could potentially optimize userland locking, Ingo challenged, "these are generic statements, but I'm _really_ interested in the specifics. Real, specific code that i can look at. The typical Linux distro consists of in excess of 500 millions of lines of code, in tens of thousands of apps, so there really must be some good, valid and 'right' use of sched_yield() somewhere in there, in some mainstream app, right? (because, as you might have guessed it, in the past decade of sched_yield() existence i _have_ seen my share of sched_yield() utilizing user-space code, and at the moment i'm not really impressed by those examples.)" Ingo went on to explain:
"sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right?
"For example, if you asked me whether pipes are the best thing for certain apps, i could immediately show you tons of examples where they are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or just about any other core kernel concept or API. Your notion that showing a good example of an API would be 'difficult' because it's hard to determine 'smart' use is not tenable i believe and does not adequately refute my pretty plain-meaning 'it does not exist' assertion."
A potential bug reported against the Completely Fair Scheduler suggested that it was causing a network slowdown, measured with the 'Iperf' bandwidth performance benchmarking tool. The performance hit was quickly tracked to the previously discussed changes in how CFS handles sched_yield(). When it was suggested that this was a bug in the new process scheduler, Ingo explained:
"I had a quick look at the source code, and the reason for that weird yield usage was that there's a locking bug in iperf's 'Reporter thread' abstraction and apparently instead of fixing the bug it was worked around via a horrible yield() based user-space lock."
He then submit a small patch to fix the bug and remove the call to sched_yield() resulting in, "iperf uses _much_ less CPU time. On my Core2Duo test system, before the patch it used up 100% CPU time to saturate 1 gigabit of network traffic to another box. With the patch applied it now uses 9% of CPU time." He added playfully, "sched_yield() is almost always the symptom of broken locking or other bug. In that sense CFS does the right thing by exposing such bugs =B-)" Stephen Hemminger pointed out that a similar patch had been submitted to the Iperf project last month as it caused an identical problem with FreeBSD's scheduler.
"sched_yield() is not - and should not be - about 'recalculating the position in the scheduler queue' like you do now in CFS," Linus Torvalds stated in a discussion with Completely Fair Scheduler author Ingo Molnar, pointing to the man pages to back up his argument that sched_yield should instead move a thread to the end of its queue, adding, "quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the 'right place' in the rbtree. It should move it *last*."
Ingo described how it worked with the pre-2.6.23 scheduler, "the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task." He went on to compare this to the new process scheduler , "so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. In CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_." He also noted that this would change the behavior for some desktop applications that call sched_yield(), "there will be lots of regression reports about lost interactivity during load."