Ingo Molnar pushed a series of patches to his Completely Fair Scheduler code upstream that were merged into the mainline kernel. He explained the reason for so many small patches, "the main reason is the safe and gradual elimination of a widely used 64-bit function argument: the 64-bit 'now' timestamp." In addition to the many small patches removing the 64-bit "now" timestamp, he noted:
"These changes are necessary for 3 reasons: firstly they address the 'there's too much 64-bit stuff in the scheduler' observation. Secondly, it's not directly visible but these changes also act as a correctness fix for an obscure (and minor) but not-too-pretty-to-fix accounting bug: idle_balance() had its own internal notion of 'now', separate from that of schedule(). Thirdly, this debloats sched.o quite significantly."
From: Ingo Molnar [email blocked] To: Linus Torvalds [email blocked] Cc: Andrew Morton [email blocked], linux-kernel@vger.kernel.org Subject: [git pull request] scheduler updates Date: Wed, 8 Aug 2007 22:30:08 +0200 Linus, please pull the latest scheduler git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git the high commit count is scary, but it's all low-risk items: the main reason is the safe and gradual elimination of a widely used 64-bit function argument: the 64-bit "now" timestamp. About 40 of those commits are identity transformations that prepare the real change in a safe way, and the rest is obvious and safe as well. Besides the obvious and nice cleanup factor, these changes are necessary for 3 reasons: firstly they address the "there's too much 64-bit stuff in the scheduler" observation. Secondly, it's not directly visible but these changes also act as a correctness fix for an obscure (and minor) but not-too-pretty-to-fix accounting bug: idle_balance() had its own internal notion of 'now', separate from that of schedule(). Thirdly, this debloats sched.o quite significantly: on 32-bit (smp, nondebug), it's almost 1k less code: text data bss dec hex filename 34869 3066 20 37955 9443 sched.o.before 33972 3066 24 37062 90c6 sched.o.after but even on 64-bit platforms it's noticeable: text data bss dec hex filename 28652 4162 24 32838 8046 sched.o.before 28064 4162 24 32250 7dfa sched.o.after and that's a speedup as well, because these parameters were passed all around the fastpath. It was the safest to do it this way (considering that we are post -rc2 already), together in one commit these changes would have been much less obvious to validate and apply. (It's of course all fully bisectable and every step builds and boots fine.) besides this elimination of the 64-bit timestamp parameter passing between (almost all) scheduler functions, there are 8 other fixes that are not identity transformations: - Peter Williams reviewed the smpnice load-balancer and noticed a few leftover items that are unnecessary now (i have re-tested load-balancing behavior and it's all still fine) - binary sysctl cleanup from Alexey Dobriyan - two small accounting fixes - reniced tasks fixes: a key calculation fix (i re-checked key nice workloads and this has no real impact [other than improving them slightly] - the other side of the branch fixed up the effects of this - otherwise we'd have noticed this sooner), and two rounding precision improvements that act against error accumulation. - sleeper_bonus should be batched by sched_granularity and not by stat_granularity. (this has almost no effect in practice, but a speedup that pushes the only 64-bit division in CFS into a slowpath.) then are are also two non-code documentation updates and minor cleanups and uninlining. Nevertheless, to be safe i have also done over 200 'make randconfig; make -j bzImage' build tests: #define UTS_VERSION "#231 SMP Wed Aug 8 21:34:24 CEST 2007" all of which passed fine. Booted (and extensively tested) on x86-32 and x64-32 as well, both UP and SMP - UP, 2-way to 8-way systems. Ingo ------------------> Alexey Dobriyan (1): sched: remove binary sysctls from kernel.sched_domain Josh Triplett (1): sched: mark print_cfs_stats static Peter Williams (2): sched: simplify move_tasks() sched: fix bug in balance_tasks() Thomas Voegtle (1): sched: mention CONFIG_SCHED_DEBUG in documentation Ulrich Drepper (1): sched: clean up sched_getaffinity() Ingo Molnar (55): sched: batch sleeper bonus sched: reorder update_cpu_load(rq) with the ->task_tick() call sched: uninline rq_clock() sched: schedule() speedup sched: clean up delta_mine sched: delta_exec accounting fix sched: document nice levels sched: add [__]update_rq_clock(rq) sched: eliminate rq_clock() use sched: remove rq_clock() sched: eliminate __rq_clock() use sched: remove __rq_clock() sched: remove 'now' use from assignments sched: remove the 'u64 now' parameter from print_cfs_rq() sched: remove the 'u64 now' parameter from update_curr() sched: remove the 'u64 now' parameter from update_stats_wait_start() sched: remove the 'u64 now' parameter from update_stats_enqueue() sched: remove the 'u64 now' parameter from __update_stats_wait_end() sched: remove the 'u64 now' parameter from update_stats_wait_end() sched: remove the 'u64 now' parameter from update_stats_curr_start() sched: remove the 'u64 now' parameter from update_stats_dequeue() sched: remove the 'u64 now' parameter from update_stats_curr_end() sched: remove the 'u64 now' parameter from __enqueue_sleeper() sched: remove the 'u64 now' parameter from enqueue_sleeper() sched: remove the 'u64 now' parameter from enqueue_entity() sched: remove the 'u64 now' parameter from dequeue_entity() sched: remove the 'u64 now' parameter from set_next_entity() sched: remove the 'u64 now' parameter from pick_next_entity() sched: remove the 'u64 now' parameter from put_prev_entity() sched: remove the 'u64 now' parameter from update_curr_rt() sched: remove the 'u64 now' parameter from ->enqueue_task() sched: remove the 'u64 now' parameter from ->dequeue_task() sched: remove the 'u64 now' parameter from ->pick_next_task() sched: remove the 'u64 now' parameter from pick_next_task() sched: remove the 'u64 now' parameter from ->put_prev_task() sched: remove the 'u64 now' parameter from ->task_new() sched: remove the 'u64 now' parameter from update_curr_load() sched: remove the 'u64 now' parameter from inc_load() sched: remove the 'u64 now' parameter from dec_load() sched: remove the 'u64 now' parameter from inc_nr_running() sched: remove the 'u64 now' parameter from dec_nr_running() sched: remove the 'u64 now' parameter from enqueue_task() sched: remove the 'u64 now' parameter from dequeue_task() sched: remove the 'u64 now' parameter from deactivate_task() sched: remove the 'u64 now' local variables sched debug: remove the 'u64 now' parameter from print_task()/_rq() sched: move the __update_rq_clock() call to scheduler_tick() sched: remove __update_rq_clock() call from entity_tick() sched: clean up set_curr_task_fair() sched: optimize activate_task() sched: optimize update_rq_clock() calls in the load-balancer sched: make the multiplication table more accurate sched: round a bit better sched: fix update_stats_enqueue() reniced codepath sched: refine negative nice level granularity Documentation/sched-design-CFS.txt | 2 Documentation/sched-nice-design.txt | 108 +++++++++++ include/linux/sched.h | 20 -- kernel/sched.c | 339 ++++++++++++++++++------------------ kernel/sched_debug.c | 16 - kernel/sched_fair.c | 212 ++++++++++------------ kernel/sched_idletask.c | 10 - kernel/sched_rt.c | 48 +---- 8 files changed, 421 insertions(+), 334 deletions(-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email blocked] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Don't use unsigned type, please.
Don't use unsigned type, please.
The sign's extension is important to good maintainance of the same source code in both 32-bit and 64-bit platforms.
To avoid many copies of variants of functions, i recommend to understand it:
Re: Don't use unsigned type, please.
The B) option is the most appropiate for maximum performance.
Its main reason is that with the B) option, the C/C++ compiler generates the minimum number of code's instructions of the corresponding architecture.
Smaller code => Less instructions to execute => Faster code.
Lucky! ;)
The CFS changes described
The CFS changes described in the article were not about the width of a variable, they were about the elimination of a 64-bit parameter from lots of function calls. That helps both 32-bit and 64-bit architectures, regardless of the width of the variable.
Why not?
If you know that you have it under control, then why not use unsigned?
A huge benefit with unsigned is that the compiler will do alot more to speed up the maths.
Re: why not use unsigned?
why not use unsigned?
Unsigned datatype complicates much in the castings of uint32 -> ulong64 that adds more code to control this transition depending in the functionality of the code.
The castings of int32 -> long64 are simple that the microprocessor realizes it in one instruction using its own extension of sign.
See this example of casting that uses unsigned:
and this example of casting that uses signed:
Lucky! ;)
Signed arithmetics is
Signed arithmetics is inherently more complex if there is division (or multiplication) in the picture. Look at gcc assembly output.
If used judiciously, unsigned can be a very rewarding data type, especially for something as performance-critical as a scheduler - but signed data types indeed have their advantages too. CFS uses both signed and unsigned, 32-bit and 64-bit variables, depending on which is the most appropriate for the given task.
No problem!
Imagine that in the source code there are:
0.01% of slowest divisions.
0.05% of little slow multiplications.
99.94% of fast adds, subtracts, shifts, rotations, etc.
The divisions and multiplications are not used much in the kernel's source code.
Take a look at the CFS code
Take a look at the CFS code again. Isnt that supposed to be the topic here? :-)
Shifts?
You only do shifts on unsigned datatypes.
Since if you don't then you have to manually handle the sign bit.
Use unsigned and gcc will automagically do shifts where it can.
You're missing something.
You can replace both signed and unsigned divides by powers of 2 with shifts. (It's a little trickier with signed types, but not dramatically so.) For both signed and unsigned types, other divides need to be implemented either with a divide instruction, multiply by the reciprocal, or a call out to a support function. The book "Hacker's Delight" (which I heartily recommend) actually covers a lot of this and more. Computer arithmetic is full of trickery under the hood. :-)
Anyway to divide by a power of two with a signed type, you need multiple instructions. This can still be a win on architectures that have a barrel shifter, but no divide instruction. The key is to handle the "round towards zero" behavior correctly. For example (-1)/2 == 0, but (-1) >> 1 == -1. The division gets broken up into multiple steps:
tmp1 = (signed) dividend >> (log2(divisor) - 1); /* Create rounding value from sign bits via sign extension */ tmp2 = (unsigned) tmp1 >> (32 - log2(divisor)); /* Move these sign bits into portion that will shift away */ tmp3 = dividend + tmp2; /* Round towards zero */ quotient = tmp3 >> log2(divisor); /* Final result */This presumes "divisor" is known at compile time, so log2(divisor) is a known constant.
--
Program Intellivision and play Space Patrol!
I know =)
Yeah i know, but when it comes to gcc it won't auto expand signed divisions (yet) afair. As to in the kernel, i'd think they'd avoid it.
But yes, i know =), and thanks for the book recomondation, ordering it now actually =)
You remember wrong. On my
You remember wrong. On my gcc (gcc version 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)) this program:
int div2(int i) { return i / 2; }
int div4(int i) { return i / 4; }
int div8(int i) { return i / 8; }
int div16(int i) { return i / 16; }
int div32(int i) { return i / 32; }
unsigned int div2u(unsigned int i) { return i / 2; }
unsigned int div4u(unsigned int i) { return i / 4; }
unsigned int div8u(unsigned int i) { return i / 8; }
unsigned int div16u(unsigned int i) { return i / 16; }
unsigned int div32u(unsigned int i) { return i / 32; }
compiled with "gcc -S -O3" gives (relevant parts):
div2:
srawi 3,3,1
addze 3,3
blr
div4:
srawi 3,3,2
addze 3,3
blr
div8:
srawi 3,3,3
addze 3,3
blr
div16:
srawi 3,3,4
addze 3,3
blr
div2u:
srwi 3,3,1
blr
div4u:
srwi 3,3,2
blr
div8u:
srwi 3,3,3
blr
div16u:
srwi 3,3,4
blr
div32u:
srwi 3,3,5
blr
Morale: always check your facts before posting (especially when it only takes 5 minutes to do so).
You're lucky srawi is
You're lucky srawi is available on the PowerPC because normally, gcc certainly *cannot* replace a (power-of-two) div by a shift in the general case where the numerator is signed. Unless it can prove that the input is positive, it has to do the div. Why? Think what happens when you do -1/2 (result is zero) versus (-1)>> 1 (oops, result still -1). I tend to explicitly use shifts whenever possible (i.e. when I don't care about the rounding) instead of relying on the compiler to substitute a shift.
You missed my comment above.
You can do a div-by-power-of-2 on signed numbers with proper rounding on negative values with just ordinary shifts. On modern CPUs with a DIV instruction, though, it can be cheaper to just do the DIV. If the CPU doesn't have a DIV instruction, though, you're better off with the shifts than a call to the "div" support routine in libgcc.
--
Program Intellivision and play Space Patrol!
Of course Power(PC) is a
Of course Power(PC) is a saner platform than x86. ;-)
But as opposed to what GP said, gcc also emits shifts for signed power-of-2 divides on x86, very similar to what venerable Mr_Z posted above:
div2:
movl %edi, %eax
shrl $31, %eax
addl %edi, %eax
sarl %eax
ret
div4:
leal 3(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $2, %edi
movl %edi, %eax
ret
div8:
leal 7(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $3, %edi
movl %edi, %eax
ret
div16:
leal 15(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $4, %edi
movl %edi, %eax
ret
div32:
leal 31(%rdi), %eax
testl %edi, %edi
cmovs %eax, %edi
sarl $5, %edi
movl %edi, %eax
ret
shift in x86
Hello.
I don't know if this is related to your comment but x86 instruction set has the following shift instructionz :
1- SAR : shift arithmetic right. The vacated upper bits are filled with the sign(most significat bit). So this is a shift suitable for signed integers.
2- SHR : shift logical right. The vacated upper bits are filled with zeros.
3- SAR and SHL : Shift arithmetic left and Shift logicla left. The two instructions are identical for signed and unsigned instrction : lower bits are filled with zeros.
Wrong.
Unsigned types are zero extended, so 0xFFFFFFFF becomes 0x00000000FFFFFFFF and that is as easy as sign extending for signed types.
Huh? You make no sense.
From unsigned int to unsigned long long, you zero extend because the number is unsigned. I can make the same argument you're making, in reverse, to show how silly it is: If I put an unsigned number in a signed type, I have to do "foo & 0x00000000FFFFFFFFULL" to mask away those pesky unwanted sign bits after I copy the int to a long long. That means the signed type is inferior, right? No. It just means I'm using the wrong type for the job.
Unsigned types are important for certain jobs. For one thing, signed integer overflow is implementation defined in C, whereas unsigned integer overflow is well defined. On some CPUs, 0x7FFFFFFF + 1 = 0x7FFFFFFF for a signed 32-bit type, and it could result in a trap on others. (Granted, this isn't true for most mainstream CPUs, but it's worth noting.) 0xFFFFFFFF + 1 is always 0 for an unsigned 32 bit type though. This becomes very important, especially for measuring time intervals when looking only at the lower 32 bits of the current time. (Gee, why would a scheduler want to do that?)
Oh, I should add, on x86-64 CPUs, 32-bit arithmetic results are implicitly zero extended to 64 bits in the corresponding 64-bit register. i.e. a 32-bit result written to EAX results in a zero written to the upper 32-bits of RAX.
--
Program Intellivision and play Space Patrol!
Re: Huh? You make no sense.
Oh, I should add, on x86-64 CPUs, 32-bit arithmetic results are implicitly zero extended to 64 bits in the corresponding 64-bit register. i.e. a 32-bit result written to EAX results in a zero written to the upper 32-bits of RAX.
You're WRONG!
http://www.x86-64.org/documentation/assembly.html says:
Implicit zero extend
Results of 32-bit operations are implicitly zero extended to 64-bit values.
The 32-bit operations use 32-bit registers, not 64-bit registers, but the upper half of 64-bit registers is filled with zeros in 32-bit operations.
The 64-bit operations use 64-bit registers and can use both operations: sign extended or zero extended.
http://www.x86-64.org/documentation/assembly.html says:
Immediates
Immediate values inside instructions remain 32 bits and their value is sign extended to 64 bits before calculation.
The inmediate values are of 32 bits because smaller values for instructions are better and ALWAYS use sign extended operation :)
Huh?
I said: "32-bit arithmetic results are implicitly zero extended to 64 bits in the corresponding 64-bit register."
You said I was wrong, quoting this section of the x86-64 documentation: "Implicit zero extend: Results of 32-bit operations are implicitly zero extended to 64-bit values."
Can you tell me how those two statements differ? One says 32-bit results are implicitly zero extended, and the other says 32-bit results are implicitly zero extended. Maybe there's some difference between implicitly zero extending and zero extending implicitly that I'm unaware of, but as far as I can tell, they say the same thing.
--
Program Intellivision and play Space Patrol!
It depends
It depends -- addresses are sign extended (this makes the kernel model possible which uses negative
32bit addresses to run in the -2GB space area of the 48bit address room)
And there are explicit instructions to do sign extension from 32bit to 64bit.
But in general you're both right.
You're conflating two ideas.
My understanding is that address arithmetic is always 64 bit, and so sign extension isn't a worry there. There is sign extension on 32-bit displacements though before the computation. But, since the pointers are 64-bit to begin with, it's meaningless to talk about sign extending the result, since the result is 64 bit also. All that is irrelevant to the memory map issue you raised.
What I think you're thinking of is on current implementations, the architecture requires bits 47 through 63 of the address to be the same. If a computed address differs in those bits, it triggers a fault. The ALUs do not do this sign extension for you. This detail is left to your memory manager to get right.
--
Program Intellivision and play Space Patrol!
resoponse
yes I totally agree. I also tried it and it worked right away. Thanks for the good piece of information
You are wrong on x86.
I compared gcc's assembler output for 32-bit to 64-bit conversion for both unsigned and signed.
Both needed exactly one instruction. Signed used 'ctld'. Unsigned used 'xorl'.
Not universally true
Your statements on 64-bit platforms are true on LP64 platforms. There are ILP64 platforms for which int is 64 bit.
LP64 means "long" and "pointer" are 64 bit. ILP64 means "int", "long" and "pointer" are 64 bit. This site goes into greater detail.
--
Program Intellivision and play Space Patrol!
Wow, what utter gibberish.
Wow, what utter gibberish.