William Lee Irwin III [interview], from here on referred to simply as 'wli', has been maintaining a patchset against the 2.5 development kernel for some time. His announcement for 2.6.0-test11-wli-2 [story] caught my attention, so I decided to give it a try. Scroll down to the end of this article for a step-by-step guide walking you through how to apply the -wli patchset and compile your new kernel.
Curious to know more about wli's efforts, I dropped him an email with a few questions. His in-depth replies, included within, are quite insightful and informative. He explains the history behind this patchset, provides an overview of some of the improvements it contains, evaluates its stability, and talks a little about where he's going with it. Regarding the patchset, he explains, "one of the primary goals is to improve performance", adding, "there is a secondary goal of improving resource scalability and another of improving resource accounting."
On a cautionary note, some drivers and possibly some filesystems may have problems with a reduced kernel stack, so the 4K_STACK configuration option may be best left disabled, though read wli's comments within to determine if this affects you. Additionaly, wli explains that the -wli patchset is incompatible with smbfs and ncpfs due to removal of d_validate(), another change explained within. Finally, wli warns against using his patchset with binary-only graphics drivers, commenting that they seem, "utterly unable to cope with the changes I've made". None of these warnings applied to my personal desktop server which booted the -wli-2 kernel without problems. I'm happily testing it now as I write this article.
Jeremy Andrews: Is the focus of your -wli patchset to improve overall performance?
William Lee Irwin III: One of the primary goals is to improve performance, yes. I would say there is a secondary goal of improving resource scalability and another of improving resource accounting.
Jeremy Andrews: Are the changes best felt on big NUMA systems, or on smaller desktop boxes?
wli: It was originally developed as a set of patches to improve the performance on SDET (http://spec.org/sdm91/) on i386 NUMA systems.
At any rate, SDET is a multiuser simulation implemented as a set of shell scripts, so it stands to reason that it should improve the performance of shell scripts and the like on small systems as well as large. A number of the "design decisions" I made, if they can be called that, centered on the patches being useful on more systems than the ones I carried out the benchmark on. For instance, highpmd, which was done in order to allow middle-level pagetable nodes to reside in node-local memory, also has applications to 32-bit RISC machines, which have much stricter ZONE_NORMAL limits than i386, causing much more serious resource scalability issues with middle-level pagetables than i386 has.
Since that work was completed or at least halted, it's also served as something resembling a showcase for my work. I've added some low-impact things unrelated to that original effort like major/minor fault count accounting, wchan reporting improvements, and the like that are generally uninteresting with respect to performance.
JA: What are some of the more significant patches, especially for desktop users?
wli: One of the unfortunate aspects of the SDET benchmark is that it makes an assumption about ps(1) not having significant amounts of kernel involvement as this is the case on many other operating systems. My response to procps' rather heavy stressing of the kernel on Linux was to make adjustments to the kernel's mechanisms for retrieving /proc/ information. To do this, I forward ported Ben LaHaise's O(1) proc_pid_statm() from RHAS to 2.6, as well as a patch from Manfred Spraul for a faster proc_pid_readdir(), which I later replaced because it was more expensive in the case of smaller numbers of tasks than mainline. Others have done things like modifying the benchmark to remove the ps(1) component or replacing procps with libraries that parse /dev/kmem. So basically, top(1) and ps(1) should be much faster:
top - 18:20:07 up 1:17, 9 users, load average: 1.37, 1.07, 1.96 Tasks: 20611 total, 1 running, 20609 sleeping, 1 stopped, 0 zombie Cpu(s): 0.7% user, 0.7% system, 0.0% nice, 98.7% idle, 0.0% IO-wait Mem: 32655240k total, 2658452k used, 29996788k free, 524k buffers Swap: 0k total, 0k used, 0k free, 9096k cached PID USER PR NI VIRT RES S %CPU %MEM TIME+ #C nFLT Command 13735 wli 25 0 12312 11m R 17.6 0.0 1:27.75 15 0 top 10596 wli 16 0 3928 2816 S 1.0 0.0 0:29.99 14 0 slabtop 20969 wli 16 0 3928 2808 S 0.5 0.0 0:01.06 3 0 slabtop 498 wli 17 0 2052 992 S 0.1 0.0 0:02.78 4 0 profloop 10584 wli 16 0 6556 2128 S 0.1 0.0 0:02.80 2 0 sshd 470 wli 16 0 6556 2128 S 0.1 0.0 0:00.19 9 0 sshd 484 wli 15 0 6556 2128 S 0.1 0.0 0:00.89 5 0 sshd 13744 wli 16 0 6556 2128 S 0.1 0.0 0:00.17 6 0 sshd 13779 wli 17 0 2732 1700 S 0.1 0.0 0:00.16 4 0 zsh
and analogous cpu cost reductions hold for smaller machines, though some of the locking advantages of always-ready statistics might not be observable on UP.
Another very significant overhead as observed in the benchmarking was pte_chain manipulation. It's really a consequence of i386's poor MMU architecture and Linux' adoption of the data structures the i386 MMU uses as translation tables as a standardized data structure instead of a procedural interface sane architectures can use to avoid the space, time, TLB, and cache overhead of the structures, and that i386 itself could use to insulate other architectures from the complexity of how these overheads need to be mitigated via sharing and so on. At any rate, the pagetable structures themselves have extremely poor internal fragmentation properties and very high cache footprints, and this was aggravated a great deal on i386 NUMA machines by manipulating lowmem-allocated data structures (i.e. those stuck on node 0) with similar cache and fragmentation properties themselves during the repetitive pagetable setup and teardown in SDET. This kind of overhead can be triggered by stressing pagetable setup and teardown more heavily on smaller systems, for instance, by compiling programs with many different source files, or making numerous connections to forking servers.
So there were two steps to addressing this. The first was to cache prezeroed pagetable pages, which took some effort, since they could have come from highmem. 2.4 did this, though it didn't have to deal with non-addressible pagetable memory, and so linked the nodes through their own memory instead of through other preexisting accounting structures as I've done. The second was to forward port a patch of Hugh Dickins' called "anobjrmap", which sets up data structures requiring much fewer updates than pte_chains, and that have much smaller memory footprints as well. I should probably mention that anobjrmap itself was done as an extension to and correction of the "partial objrmap" patch, which used a hybrid scheme instead of establishing structures for anonymous mappings analogous to those for file-backed mappings, but had serious issues handling memory allocation failures, and also had the ugliness of handling anonymous and file-backed memory differently.
JA: How stable should it be? ie, is there any potential for data-corrupting type bugs? Also, are there any known incompatibilities?
wli: -wli has been largely in maintenance and bugfix mode since 2.5.74, and I've dropped a number of riskier patches, so it should be relatively stable. Not very many new things have been added; only the major/minor fault accounting and wchan accounting are truly new. The O(lg(n)) proc_pid_readdir() _code_ is also new, but it just replaces another similar patch by Manfred Spraul that does the same thing another way.
The two largest incompatibilities are smbfs and ncpfs. I removed d_validate() after hearing from some crash dump hackers about how kern_addr_valid() is utter nonsense on most architectures and noticing that d_validate() used it. I figured out that d_validate() takes some arbitrary address, checks kern_addr_valid() (which is total garbage), and then treats it as a dentry. I was too disgusted to let it stand, and so neither smbfs nor ncpfs will compile in -wli and depend on CONFIG_BROKEN. ncpfs should be rather rare, and smbfs has a replacement, cifs, which should do as well as mainline in -wli. It's something of a nonessential change, but in all truth, d_validate() scared me enough I wanted it gone.
Another thing to notice is that optional 4KB stacks may be problematic with some drivers or fs's that perform large stack allocations. This is inherent, so the safest option is to leave stacks at 8KB. I personally don't need any of the problematic code and just use 4KB stacks. There are reports that some PCMCIA drivers trigger problems with 4KB stacks.
JA: Do you know of any specific drivers or file systems that perform large stack allocations? Conversely, if not using PCMCIA drivers, with what filesystems should 4KB stacks be safe?
wli: I don't remember which driver it was, but there was a PCMCIA wireless card reported to get stack overflows in -wli, probably at some point early during the -test cycle. The PCMCIA stack is very fragile, so I didn't dare carry out the needed rearrangements to reduce stack usage for that case, and would have had a hard time doing it anyway.
I also just happen to know that filesystem codepaths can be involved in deep stack usage, especially when called indirectly from a normal context through the VM for allocations, so even though I've never heard a bugreport of that kind, I'd still say it's a risk.
I also need to add a very strong warning against using binary-only graphics drivers in combination with -wli. I've had numerous reports of nvidia's binary-only drivers being utterly unable to cope with the changes I've made regardless of attempts to update to the glue layer.
JA: Are you hoping to merge some/all of these patches into 2.6?
wli: I'm particularly interested in merging the pte caching bits into 2.6, since those have low core impact and address a clear regression vs. 2.4. Many of the other patches are less compelling as far as risk/benefit due to high core impacts, though in general, I'm perfectly willing and ready to send in things deemed mergeable. It seems that the major/minor fault accounting may go somewhere, as akpm has expressed interest in it. Anobjrmap appears to be accumulating some popularity, so that may be considered later if there's enough demand for it and general core team consensus, though I don't feel comfortable pushing it very hard during a stable release since it carries out some sweeping core VM changes.
JA: How long do you intend to keep updating -wli?
wli: Essentially indefinitely. Among other things, it also serves as a showcase for my work, whether it be original work or forward porting relatively complex patches, so I'll continue adding things that are easy enough to keep around to it.
Step 1: Upgrade to the latest 2.6 kernel
wli's patches apply against the latest 2.6 kernel source, so you'll need to download the latest 2.6 kernel source code first. For help on this, please refer to my earlier story about upgrading from 2.4 [story], and on using patches to upgrade 2.6 [story].
I was personally running 2.6.0.test10-mm1, so to save on time and bandwidth I copied this source tree and then upgraded it to 2.6.0-test11 with patches. First I copied the source tree (using links), then I removed the -mm1 patch, and finally I installed the -test11 patch:
# pwd /usr/src # cp -rl linux-2.6.0-test10-mm1 linux-2.6.0-test11-wli-2 # bunzip2 -dc ../2.6.0-test10-mm1.bz2 | patch -R -p1 # bunzip2 -dc ../patch-2.6.0-test11.bz2 | patch -p1
Step 2: Obtain wli's latest patch.
At the time of this writing, wli's latest patch is 2.6.0-test11-wli-2. This can be found from your nearest kernel.org mirror by navigating to "/pub/linux/kernel/people/wli/kernels/2.6.0-test11/".
You can find your nearest mirror at this link: http://kernel.org/mirrors/.
It's recommended that you also download the signature file to verify the patch's validity. Find full details on how this is done here.
Here's what I did to patch my kernel:
# pwd /usr/src # cd linux-2.6.0-test11-wli-2 # bzip2 -dc ../patch-2.6.0-test11-wli-2.bz2 | patch -p1
It's the second line that does the actual patching, taken straight out of the README that's in the top level of your Linux kernel source tree. If you're using a *.gz version of the patch, simply replace 'bzip2' with 'gzip' in that command.
# pwd /usr/src/linux-2.6.0-test11-wli-2 # cat ../wli-2.patch | patch -p1
Step 4: Cleanup stale .o files and dependencies.
Now that your kernel source tree is patched to the latest -wli code, be sure to remove the any stale object files and dependencies. This is done with 'make mrproper', as follows:
# pwd /usr/src/linux-2.6.0-test11-wli-2 # make mrproper
Note: If you didn't save your old source tree, be sure to save a copy of your '.config' file before running 'make mrproper'! It can be useful to store the latest copy in '
Some readers have pointed out that this step should no longer be required thanks to the new build system found in the 2.6 kernel. My reply is two-fold. First, it's not going to hurt anything. And second, the README included with the 2.6 kernel (linked above) still recommends this step and thus so do I.
Step 5: Configure your new kernel.
This step is made much simpler if you have an already compiled 2.6.0-test kernel. I used my old '.config' configuration file and the text based 'make oldconfig' method as follows:
# pwd /usr/src/linux-2.6.0-test11-wli-2 # cp ../linux-2.6.0-test10-mm1/.config . # make oldconfig
Most all the options will zoom by, automatically answered based on your existing .config file. You'll only be asked about new options. For example, when I upgraded from 2.6.0-test10-mm1, I saw the following three new options:
Use smaller 4k per-task stacks (4K_STACK) [N/y/?] (NEW) ? This option will shrink the kernel's per-task stack from 8k to 4k. This will greatly increase your chance of overflowing it. But, if you use the per-cpu interrupt stacks as well, your chances go way down. Also try the CONFIG_X86_STACK_CHECK overflow detection. It is much more reliable than the currently in-kernel version. Detect stack overflows (X86_STACK_CHECK) [N/y/?] (NEW) ? Say Y here to have the kernel attempt to detect when the per-task kernel stack overflows. This is much more robust checking than the above overflow check, which will only occasionally detect an overflow. The level of guarantee here is much greater. Some older versions of gcc don't handle the -p option correctly. Kernprof is affected by the same problem, which is described here: http://oss.sgi.com/projects/kernprof/faq.html#Q9 Basically, if you get oopses in __free_pages_ok during boot when you have this turned on, you need to fix gcc. The Redhat 2.96 version and gcc-3.x seem to work. If not debugging a stack overflow problem, say N Say Y here if you are hacking the kernel to trim stack usage on 4KB stacks and are unafraid of frequent panics. If youre using 8KB stacks, this is less interesting, but could point out unusual broken codepaths. Top-down vma allocation (MMAP_TOPDOWN) [N/y/?] (NEW) ? Say Y here to have the kernel change its vma allocation policy to allocate vma's from the top of the address space down, and to shove the stack low so as to conserve virtualspace. This is risky because various apps, including a number of versions of ld.so, depend on the kernel's bottom-up behavior.
Step 6: Build your new kernel.
To build a new kernel on x86, all you need to type is 'make'. If you've chosen to compile any modules, you'll also need to install them by typing 'make modules_install'. Or, you can string these two commands together: 'make && make modules_install'.
If you're curious about what other 'make' options there are when building your kernel, type 'make help'.
Step 7: Install your new kernel.
Now that you've built your kernel, you need to copy it into place. You'll want to copy this file and your new System.map into /boot. Some prefer to use 'make install' for this, but I prefer to do it manually so I have complete control over what happens. For example:
# pwd /usr/src/linux-2.6.0-test11-wli-2 # mv arch/i386/boot/bzImage /boot/bzImage-2.6.0-test11-wli-2 # mv System.map /boot/System.map-2.6.0-test11-wli-2 # cd /boot # rm System.map # ln -s System.map-2.6.0-test11-wli-2 System.map
Note that when typing 'rm System.map', I'm only removing a symbolic link, not an actual file.
Having copied your new kernel into place, now you need to configure your boot loader. You're probably using grub [manual] or lilo [howto], refer to the appropriate documentation if you're unsure how your boot loader works. My new grub entry looks like:
title 2.6.0-test11-wli-2 root (hd0,0) kernel /boot/bzImage-2.6.0-test11-wli-2 ro root=/dev/hda1