Following a review of Ingo Molnar [interview]'s Completely Fair Scheduler [story], Srivatsa Vaddagiri posted a patch allowing the new scheduler to provide fairness at a per-group level rather than at a per-process level. He described the changes that he made and noted, "I have used 'uid' as the basis of grouping for timebeing (since that grouping concept is already in mainline today). The patch can be adapted to a more generic process grouping mechanism later."
Ingo reacted to the patch favorably, "yeah, i like this alot." He went on to comment, "the 'struct sched_entity' abstraction looks very clean, and that's the main thing that matters: it allows for a design that will only cost us performance if group scheduling is desired." He went on to ask, "if you could do a -v14 port and at least add minimal SMP support: i.e. it shouldnt crash on SMP, but otherwise no extra load-balancing logic is needed for the first cut - then i could try to pick all these core changes up for -v15. (I'll let you know about any other thoughts/details when i do the integration.)"
"In no case is it ok to just 'shut up the warning'," Linus Torvalds exclaimed in response to a patch that stifled a compiler warning. Reminiscent of a thread on the lkml last year [story], Linus pointed out that it is very important to understand and properly fix compiler warnings [story]:
"Please, we do NOT fix compiler warnings without understanding the code! That's a sure way to just introduce _new_ bugs, rather than fix old ones. So please, please, please, realize that the compiler is _stupid_, and fixing warnings without understanding the code is bad.
"In this case, anybody who actually spends 5 seconds looking at the code should have realized that the warning is just another way of saying that the author of the code was on some bad drugs, and the warnings WERE BETTER OFF REMAINING! Because that code _should_ have warnings. Big fat warnings about incompetence!?"
Miklos Szeredi posted a patch to allow files to be accessed as directories, offering the example of accessing the contents of a compressed tarball as you would any other directory. He noted that this is not the only application of the patch, "others might suggest accessing streams, resource forks or extended attributes through such an interface. However this patch only deals with the non-directory case, so directories would be excluded from that interface. But otherwise this patch doesn't limit the uses of the 'file as directory' concept in any way. It just adds the infrastructure to support these whacky beasts." Al Viro took an interest in the patch noting, "I'll look through the patch tonight; it sounds interesting, assuming that we don't run into serious crap with locking and revalidation logics." This was followed by an interesting discussion between Miklos and Al regarding the implementation of the patch.
Miklos went on to explain how the functionality works using mounts with special properties, "if a non-directory object is accessed with a trailing slash, then the filesystem may opt to let the file be accessed as a directory. In this case 'something' (as supplied by the filesystem) is mounted on top of the non-directory object." He then explained the following special properties of these mounts: "If there's no trailing slash after the file name, the mount won't be followed, even if the path resolution would otherwise follow mounts; The mount only stays there while it is referenced by some external object, like a pwd or an open file. When it is no longer referenced, it is automatically unmounted; Unlike 'real' mounts, this won't block unlink(2) or rename(2) on the underlying object."
Jesse Barnes posted a summary of recent efforts to improve the Linux kernel's support for graphics, "in collaboration with the [framebuffer] guys, we've been working on enhancing the kernel's graphics subsystem in an attempt to bring some sanity to the Linux graphics world and avoid the situation we have now where several kernel and userspace drivers compete for control of graphics devices." He then explained, "there are several reasons to pull modesetting and proper multihead support into the kernel: suspend/resume, debugging (e.g. panic), non-X uses, and more reliable VT switch," going on to offer detail on each of these listed reasons. Jesse followed these explanations with an overview of the current status of the code:
"The current codebase is still incomplete in many ways: locking needs to be (re-)added around our various list manipulation paths, we need better initial configuration logic, only the Intel driver has any support (and it's still missing suspend/resume and accelerated FB functions), we need to check modes against monitor limitations (which come from EDID or the user), CVT and GTF based mode generation still isn't used by the DRM modesetting code, and much more. I'm hoping that by posting this now, we can get some ideas about what requirements other people have for graphics on Linux so we can prioritize our work."
"As I understand, fair_clock is a monotonously increasing clock which advances at a pace inversely proportional to the load on the runqueue," Srivatsa Vaddagiri explained in a review of Ingo Molnar [interview]'s CFS CPU scheduler [story], "if load = 1 (task), it will advance at same pace as wall clock, as load increases it advances slower than wall clock." He continued on to ask some questions about the choices made in CFS as compared to the EEVDF CPU scheduler [story]. In the resulting discussion, Ingo offered some insight into the design of the CFS. He began:
"80% of CFS's design can be summed up in a single sentence: CFS basically models an 'ideal, precise multi-tasking CPU' on real hardware. 'Ideal multi-tasking CPU' is a (non-existent :-) CPU that has 100% physical power and which can run each task at precise equal speed, in parallel, each at 1/nr_running speed. For example: if there are 2 tasks running then it runs each at 50% physical power - totally in parallel.
"On real hardware, we can run only a single task at once, so while that one task runs the other tasks that are waiting for the CPU are at a disadvantage - the current task gets an unfair amount of CPU time. In CFS this fairness imbalance is expressed and tracked via the per-task p->wait_runtime (nanosec-unit) value. 'wait_runtime' is the amount of time the task should now run on the CPU for it become completely fair and balanced."
The question was asked on the lkml whether or not memory allocated by kmalloc and vmalloc is swappable. Rik van Reil offered a clear explanation as to why it is not, "unswappable kernel memory is simpler and faster," adding, "there really is no good reason for swapping kernel memory nowadays." He went on to explain:
"Over the last 15 years, the memory requirements of the Linux kernel have grown maybe a factor 10, while the memory of computers has grown by a factor of 1000.
"The data structures that grow with memory (mostly the mem_map array of page structs) has actually gotten smaller since the 2.4 kernel and now takes under 1% of memory even on x86-64."
"Ok, the merge window has closed, and 2.6.22-rc1 is out there," Linus Torvalds announced on the Linux Kernel Mailing List. He noted that there were a large number of changes, "almost seven thousand files changed, and that's not double-counting the files that got moved around." As to what was changed, Linus summarized, "architecture updates, drivers, filesystems, networking, security, build scripts, reorganizations, cleanups.. You name it, it's there." He went on to add:
"You want a new firewire stack? We've got it. New wireless networking infrastructure? Check. New infiniband drivers? Digital video drivers? A totally new CPU architecture (blackfin)? Check, check, check.
"That said, I think (and certainly hope) that this will not be nearly as painful as the big fundamental timer changes for 2.6.21, and while there are some pretty core changes there (like the new SLUB allocator, which hopefully will end up replacing both SLAB and SLOB), it feels pretty solid, and not as scary as ripping the carpet from under the timer infrastructure."
In a series of ten patches, Mathieu Desnoyers posted an updated version of the Linux Kernel Markers. In the first patch he explains the need for markers:
"With the increasing complexity of today's user-space application and the wide deployment of SMP systems, the users need an increasing understanding of the behavior and performance of a system across multiple processes/different execution contexts/multiple CPUs. In applications such as large clusters (Google, IBM), video acquisition (Autodesk), embedded real-time systems (Wind River, Monta Vista, Sony) or sysadmin/programmer-type tasks (SystemTAP from Redhat), a tool that permits tracing of kernel-user space interaction becomes necessary."
Mathieu goes on to explain that in complex system finding bugs can be even more difficult due to the rarity of their occurance, "one can therefore only hope at having the best conditions to statistically reproduce the bug while extracting information from the system. Some bugs have been successfully found at Google using their ktrace tracer only because they could enable it on production machines and therefore recreate the same context where the bug happened." He then added, "therefore, it makes sense to offer an instrumentation set of the most relevant events occurring in the Linux that can have the smallest performance cost possible when not active while not requiring a reboot of a production system to activate. This is essentially what the markers are providing."
Jens Axboe [interview] posted a series of ten patches that add support for large IO commands. He began by defining the problem:
"Some people complain that Linux doesn't support really large IO commands. The main reason why we do not support infinitely sized IO is that we need to allocate a scatterlist to fill these elements into for dma mapping. The Linux scatterlist is an array of scatterlist elements, so we need to allocate a contiguous piece of memory to hold them all. On i386, we can at most fit 256 scatterlist elements into a page, and on x86-64 we are stuck with 128. So that puts us somewhere between 512kb and 1024kb for a single IO."
Jens went on to explain his solution, "to get around that limitation, this patchset introduces an sg chaining concept. The way it works is that the last element of an sg table can point to a new sgtable, thus extending the size of the total IO scatterlist greatly." Regarding the current status he noted, "it works for me, but you can't enable large commands on anything but i386 right now. I still need to go over the x86-64 iommu bits to enable it there as well."
Andrew Morton [interview] sent out the latest lguest patches for review, noting that he intends to merge the code into the mainline kernel, "some concern was expressed over the lguest review status, so I shall send the patches out again for people to review, to test, to make observations about the author's personal appearance, etc. I'll plan on sending these patches off to Linus in a week's time, assuming all goes well." The project's FAQ notes, "lguest is designed to be simple to use and modify, with the aim of keeping the codebase small. Currently it's around 5000 lines including userspace utility, whereas kvm is over 10 times that size, and Xen is around 10 times bigger again (of course, both have far more features)."
The lguest patches are written and maintained by Rusty Russell [interview] who also authored Rusty's Remarkably Unreliable Guide to Lguest, the project's documentation. The guide explains, "lguest is designed to be a minimal hypervisor for the Linux kernel, for Linux developers and users to experiment with virtualization with the minimum of complexity. Nonetheless, it should have sufficient features to make it useful for specific tasks, and, of course, you are encouraged to fork and enhance it." In the FAQ, lguest is compared to kvm [story], "kvm requires hardware virtualization support (most recent Intel and AMD chips have it), but it can run almost any Operating System (since it does full virtualization. It also has 64-bit support. Lguest doesn't do full virtualization: it only runs a Linux kernel with lguest support." The FAQ also compares lguest to Xen, "Xen is similar, in that it doesn't need hardware virtualization support (although it can use it), but Xen supports an extensive range of features such as PAE (ie. lots of memory), SMP guests, 64-bit. You have to boot your kernel under the Xen hypervisor; you can't simply modprobe when you want to create a guest."
Jörn Engel announced LogFS, "a scalable flash filesystem." The project's home page notes that LogFS aims to be the successor of JFFS2, "the two main problems of JFFS2 are memory consumption and mount time. Unlike most filesystems, there is no tree structure of any sorts on the medium, so the complete medium needs to be scanned at mount time and a tree structure kept in-memory while the filesystem is mounted. With bigger devices, both mount time and memory consumption increase linearly. JFFS2 has recently gained summary support, which helps reduce mount time by a constant factor. Linear scalability remains. YAFFS also appears to be better by a constant factor, yet still scales linearly."
In contrast, Jörn compared his new LogFS, "LogFS has an on-medium tree, fairly similar to Ext2 in structure, so mount times are O(1). In absolute terms, the OLPC system has mount times of ~3.3s for JFFS2 and ~60ms for LogFS." Regarding its stability, he noted, "LogFS works and survives my testcases. It has fairly good chances of not eating your data during regular operation. There are still two known bugs that will eat data if the filesystem is uncleanly unmounted. Also still missing is wear leveling." Thomas Gleixner reviewed the code and offered the following summary, suggesting the code has a ways to go before it replaces JFFS2, "the code is far from being useful on real world hardware. The error handling via BUG() is just making it useless. Also please fix the coding style and other issues from the seperate review. Some useful comments would make a functional review way easier."
Avi Kivity [interview] announced significant performance improvements and support for running 32-bit Windows Vista as a guest within the latest release of KVM. Originally merged into the 2.6.20 mainline Linux kernel [story], KVM stands for Kernel-based Virtual Machine, "a full virtualization solution for Linux on x86 hardware containing virtualization extensions". Regarding the new release, Avi announced:
"The happy theme of today's kvm is the significant performance improvements, brought to you by a growing team of developers. I've clocked kbuild at within 25% of native. This release also introduces support for 32-bit Windows Vista."
Con Kolivas [interview] continues to maintain the performance oriented -ck patchset that he started in early 2004 [story], "this patchset is designed to improve system responsiveness and interactivity. It is configurable to any workload but the default -ck patch is aimed at the desktop and -cks is available with more emphasis on serverspace." In Con's latest release, 2.6.21-ck1, he notes that he has updated the patchset to include his improved SD cpu scheduler [story], "the staircase-deadline cpu scheduler has replaced the old staircase design in this version."
Con goes on to explain, "the staircase-deadline cpu scheduler can be set in either purely forward-looking mode for absolutely rigid fairness and cpu distribution according to nice level, or it can allow a small per-process history to smooth out cpu usage perturbations common in interactive tasks by enabling this sysctl. While small fairness issues can arise with this enabled, overall fairness is usually still strongly maintained and starvation is never possible. Enabling this can significantly smooth out 3d graphics and games." Swap prefetch [story] is also among the patches included in the -ck patchset.
Kristian Høgsberg posted an update on the effort to rewrite the Linux kernel FireWire stack [story] explaining, "as you may know, we've been working on a new FireWire stack over on linux1394-devel. The main driver behind this work is to get a small, maintainable and supportable FireWire stack, with an acceptable backwards compatibility story." He went on to request the stack's inclusion in the mainline kernel, listing the following highlights: the new FireWire stack "has been in Fedora rawhide (development branch) and -mm for 3 months, will be shipping in Fedora 7; backwards compatible at the library level, existing user space libraries have been ported to use the new user space interface; less than 8k lines of code compared to 30k lines of code in the old stack, and a similar size reduction in the sizes of the .ko's; no kernel threads, compared to one subsystem thread and one thread per FireWire controller in the old stack; one user space interface to support zero-copy scatter-gather streaming, as opposed to the old stacks 4 (was 5) different streaming interfaces; per-device device files, letting userspace set up more finegrained access control, such as preventing direct access to FireWire storage devices."
Kristian went on to note the following regressions when comparing the new stack to the old: "eth1394 not ported over, there is nothing preventing this from being done, though, but there's a couple of infrastructure bits that aren't done yet; no support for the PCILynx chipset, nobody has this chipset anymore, and the pcilynx driver in the old stack is bit-rotting anyway; some SBP-2 (storage) devices fail after significant amounts of IO, not clear what the problem is, but I can reproduce it here and am working on fixing it." Regarding his plans going forward, "what I'd like to propose is that we carry both the new and the old stack in mainline for a few releases. Once we've reached a satisfactory level of stability and worked through what regressions there may be, we can consider deprecating the old stack."