Linux news

2.6.26-rc4, "Things Are Calming Down"

Submitted by Jeremy
on May 27, 2008 - 9:10am
Linux news

"You know the drill by now: another week, another -rc," began Linux creator, Linus Torvalds, announcing the 2.6.26-rc4 kernel. "There's a lot of small stuff in here", he continued, "most people won't even notice. The most noticeable thing is for all you 32-bit x86 people who use PAE (enabled by the HIGHMEM64G config option) due to having too much memory in your machine - mprotect() was broken due to some of the PAT fix/cleanup patches, causing the NX bit to be not set correctly." Linus described the fixed bug:

"If you had PAE enabled _and_ a recent enough CPU to have NX, but not recent enough to be 64-bit (or you were just perverse and wanted to run a 32-bit kernel despite having a chip that could do 64-bit and enough memory that you _really_ should have used a 64-bit kernel), you'd get various random program failures with SIGSEGV. It ranged from X not starting up to apparently OpenOffice not working if it did."

He went on to note, "most of the changes, as usual, are in drivers, at 60%, with some DRI changes leading the way (fixing a number of other regressions, mainly by reverting the under-cooked vblank update). Network, MMC, USB, watchdog and IDE drivers also got updates. We had CIFS and NFS updates, and some arch updates as usual." Linus concluded, "nothing really hugely exciting, I think we're doing pretty ok in the release cycle, and I'm getting the feeling that things are calming down."

Git Management

Submitted by Jeremy
on May 20, 2008 - 3:47pm
Linux news

"Is there a write up of what you consider the 'proper' git workflow?" Theodore Ts'o asked Linux creator Linus Torvalds, "why do you consider rebasing topic branches a bad thing?" Linus replied, "rebasing branches is absolutely not a bad thing for individual developers. But it *is* a bad thing for a subsystem maintainer." He went on to differentiate between 'grunts' who write the code and 'managers' who primarily collect other people's code, "a grunt should use 'git rebase' to keep his own work in line. A technical manager, while he hopefully does some useful work on his own, should strive to make _others_ do as much work as possible, and then 'git rebase' is the wrong thing, because it will always make it harder for the people around you to track your tree and to help you update your tree." Linus compared his own patch management style and productivity from over six years ago before he started using BK and git, to his current style using git:

"You can either try to drink from the firehose and inevitably be bitched about because you're holding something up or not giving something the attention it deserves, or you can try to make sure that you can let others help you. And you'd better select the 'let other people help you', because otherwise you _will_ burn out. It's not a matter of 'if', but of 'when'. [...] And when you're in that kind of ballpark, you should at least think of yourself as being where I was six+ years ago before BK. You should really seriously try to make sure that you are *not* the single point of failure, and you should plan on doing git merges. [...] I think a lot of people are a lot happier with how I can take their work these days than they were six+ years ago."

2.6.26-rc3, "Another Week, Another -rc Release"

Submitted by Jeremy
on May 19, 2008 - 7:57am
Linux news

"This time around, we have 60+% of the changes in drivers, notably drives/video and drivers/media, with some infiniband, networking and usb lovin' to fill things out," began Linux creator Linus Torvalds, announcing the 2.6.26-rc3 kernel. "The rest is (as usual) mostly arch updates," he continued, "this time mostly mips, m68k and uml." Linus noticed that Linux kernel development has been managed with git now as long as it was managed with BitKeeper, a little over three years for both tools. He explained, "the most striking difference has nothing to do with git or BK (the switch-over timing was just the reason I decided to take a look), but with the fact that we're not just continuing to develop, but we're developing faster and with more people," adding:

"So during the three years 2002->2005, we had 63,428 commits, attributed to 1,560 different authors (caveat: misspellings etc will mean that some people get counted more than once). During the last three years, we've had 96,885 attributed to 4,068 distinct authors (with the same caveat, obviously).

"I didn't do a lot of per-commit statistics yet, but from the little I've done it also seems like we've gotten increasingly better at doing small commits (which is probably one of the reasons we have a larger number of them, but also why we have more authors - small commits is how people get into doing kernel development)."

Removing the Big Kernel Lock

Submitted by Jeremy
on May 15, 2008 - 8:52am
Linux news

"As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptible BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptible code again. This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL. Ingo explained:

"This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch. It takes top people like Alan Cox to map the semantics and to remove BKL code, and even for Alan (who is doing this for the TTY code) it is a long and difficult task."

Ingo went on to describe how the BKL works, how it differs from other locking mechanisms, and why this complicates removing it permanently from the kernel. He noted that the various dependencies of the lock are lost in the haze of 15 years of code changes, "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong." He then suggested "changing the rules of the game", creating a "kill-the-BKL" branch which "turns the BKL into an ordinary albeit somewhat big mutex, with a quirky lock/unlock interface called 'lock_kernel()' and 'unlock_kernel()'."

Parallel Optimized Host Message Exchange Layered File System

Submitted by Jeremy
on May 14, 2008 - 9:45am
Linux news

"I'm please to announce [the] POHMEL high performance network filesystem. POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System," began Evgeniy Polyakov, explaining:

"This is a high performance network filesystem with local coherent cache of data and metadata. Its main goal is distributed parallel processing of data. Network filesystem is a client transport. POHMELFS protocol was proven to be superior to NFS in lots (if not all, then it is in a roadmap) operations."

This latest release prompted Jeff Garzik to reply, "this continues to be a neat and interesting project :)" New features include fast transactions, round-robin failover, and near-wire limit performance. This adds to existing features which include a local coherent data and metadata cache, async processing of most events, and a fast and scalable multi threaded user space server. Planned features include a server extension to allow mirroring data across multiple devices, strong authentication, and possible data encryption when transferring data over the network. Evgeniy linked to several benchmarks in his blog.

2.6.26-rc2, "Little Exciting Here"

Submitted by Jeremy
on May 13, 2008 - 8:15am
Linux news

"About 45% architecture updates (counting the include files too), about 30% drivers, and about 25% odds-and-ends. The odds-and-ends are mainly Documentation, filesystems (mostly cifs) and core kernel (scheduler updates etc)," said Linux creator Linus Torvalds, announcing the 2.6.26-rc2 kernel. He added, "if you read the shortlog and get the feeling that most of it is pretty boring small details, you'd be right. There is little exciting there." He continued:

"A fairly small part of it, but quite possibly the most noticeable one, is how the semaphore changes impacted the BKL (the old 'big kernel lock' that is still used for some legacy code, for you non-core people out there), which in the past had different versions ('regular', 'preemptable'). A few months ago we dropped the regular BKL version, but in 2.6.25-rc1 we then had performance (and then correctness) issues with the interaction between the semaphore implementation and the preemptable BKL, so we're back to the old regular version for now."

2.6.26-rc1, "Less Scary Stuff Going On"

Submitted by Jeremy
on May 4, 2008 - 6:47am
Linux news

"So this merge window was somewhat rocky in the sense that there was a lot of arguments about it, but at the same time I at least personally think that from a technical angle, we had somewhat less scary stuff going on than has been almost the rule lately," noted Linux creator Linus Torvalds, announcing the 2.6.26-rc1 kernel. He continued:

"Lots of changes, but nothing that really feels all that fragile to me. Famous last words. I expect that the x86 PAT support (which has been long in the making) has the potential to have some issues, but the obvious problems were hashed out long ago, and while the merge window already showed one bug, that one was fairly benign and quickly fixed."

Linus highlighted, "another feature that is notable not for its size, but because people have tried to get me to merge it for some long is kgdb support. Which really turned out pretty small and clean, once people started putting their effort into making it so." He concluded, "so go out and test it. The diffstat and shortlogs are too big to post here (7500+ commits and the compressed full patch is 8.5MB in size), but one interesting tidbit I found was that during this *one* merge window, we had almost 800 different authors."

Active Merge Windows

Submitted by Jeremy
on May 2, 2008 - 1:41pm
Linux news

"This is starting to get beyond frustrating for me," complained David Miller of the latest merge window, launching what turned into a very lengthy and ongoing discussion about the Linux kernel development process. The concept of a regular "merge window" was first discussed in July of 2005 with the release of the 2.6.14-rc4 kernel, following the 2005 Developers' Summit. From 2.6.14 on, the release of each official 2.6.y kernel has been followed by a two week period during which major changes are merged into the kernel, followed by a 2.6.y-rc1 release. David complained that this particular merge window has been more painful than others, "the tree breaks every day, and it's becoming an extremely non-fun environment to work in. We need to slow down the merging, we need to review things more, we need people to test their [...] changes!"

During the lengthy discussion, Linux creator Linus Torvalds explained:

"The notion that we should even _try_ to aim to slow things down, that one I find unlikely to be true, and I don't even understand why anybody would find it a logical goal? Of course, you will have fewer new bugs if you have fewer changes. But that's not a goal, that's a tautology and totally uninteresting. A small program is likely to have fewer bugs, but that doesn't make something small 'better' than something large that does more. Similarly, a stagnant development community will introduce new bugs more seldom. But does that make a stagnant one better than a vibrant one? Hell no. So what I'm arguing against here is not that we should aim for worse quality, but I'm arguing against the false dichotomy of believing that quality is incompatible with lots of change."

Btrfs 0.14, Managing Multiple Devices

Submitted by Jeremy
on April 30, 2008 - 8:36am
Linux news

"Btrfs v0.14 is now available for download," Chris Mason announced, adding, "please note the disk format has changed, and it is not compatible with older versions of Btrfs." The project has gained a new wiki home page on the kernel.org domain, where it is explained, "Btrfs is a new copy on write filesystem for Linux aimed at implementing advanced features while focusing on fault tolerance, repair and easy administration. Initially developed by Oracle, Btrfs is licensed under the GPL and open for contribution from anyone." Regarding the latest release, Chris explained:

"v0.14 has a few performance fixes and closes some races that could have allowed corrupted metadata in v0.13. The major new feature is the ability to manage multiple devices under a single Btrfs mount. Raid0, raid1 and raid10 are supported. Even for single device filesystems, metadata is now duplicated by default. Checksums are verified after reads finish and duplicate copies are used if the checksums don't match."

Chris offered links to multi-device benchmarks summarizing, "in general these numbers show that Btrfs does a good job at scaling to this storage configuration, and that is it on par with both HW raid and MD." Looking forward, he concluded, "next up on the Btrfs todo list is finishing off the device removal and IO error handling code. After that I'll add more fine grained locking to the btrees."

Ksplice, Rebootless Linux Kernel Security Updates

Submitted by Jeremy
on April 25, 2008 - 1:20pm
Linux news

"I've put together an automatic system for applying kernel security patches to the Linux kernel without rebooting it, and I wanted to share this system with the community in case others find it useful or interesting," said Jeff Arnold, announcing ksplice. He explained, "the system takes as input a kernel security patch (which can be a unified diff taken directly from Linus' GIT tree) and the source code corresponding to the running kernel, and it automatically creates a set of kernel modules to perform the update. The running kernel does not need to have been customized in advance in any way." The project's website notes, "ksplice cannot handle semantic changes to data structures—that is, changes that would require existing instances of kernel data structures to be transformed." With this limitation, Jeff suggested ksplice is still able to automatically apply 84% of the kernel security patches released between May 2005 and December 2007. He continued:

"I've been pursuing this project because I don't like dealing with reboots whenever a new local kernel security vulnerability is discovered. The rebootless update practices/systems that are already out there require manually constructing an update (through a process that can be tricky and error-prone), and they tend to have other disadvantages as well (such as requiring a custom kernel, not handling inline functions properly, etc). This new system works on existing kernels, and it simply takes a unified diff as input and does the rest on its own."

The Usefulness Of Linux-Next

Submitted by Jeremy
on April 23, 2008 - 6:52pm
Linux news

Discussing the latest breakage of the linux-next tree, Stephen Rothwell noted that the problem went unnoticed due to the arm tree not currently being included, "this is why I would have liked you to participate in the linux-next tree ...". Arm maintainer Russell King questioned the usefulness, saying, "linux-next will not give me anything which -mm isn't giving me. As I said in the discussion, linux-next value is _very_ small for me. Sorry but true." Several stepped in to offer some reasons that the linux-next tree is useful.

Andrew Morton noted, "putting arm into linux-next means that Stephen (and git) handle the merges rather than having me (and not-git) do it. Which helps me. I expect that linux-next will get a lot more cross-compilation testing than -mm. Which helps you." Greg KH added, "getting your stuff into linux-next would provide a public place for others to base off of, making it easier for them to send patches to you ensuring that they apply properly. Which in the end, will help others be able to contribute easier, and help you by getting patches you do not need to rebase yourself." Stephen Rothwell summarized the advantages for a maintainer:

"5 times a week your tree gets merged with lots of other code destined for Linus' next release. From this you get to find out about things in other trees that clash with yours. This tree gets built on several architectures for several configs (including arm). So you find out if other trees will break yours. I am happy to build more (basically all) the arm configs as I have offered before."

Defaulting To 4K Stacks

Submitted by Jeremy
on April 22, 2008 - 6:42pm
Linux news

Andrew Morton replied to a commit message making 4k stacks the default, saying, "this patch will cause kernels to crash." Ingo Molnar replied, "what mainline kernels crash and how will they crash? Fedora and other distros have had 4K stacks enabled for years." He added, "we've conducted tens of thousands of bootup tests with all sorts of drivers and kernel options enabled and have yet to see a single crash due to 4K stacks." During the lengthy discussion it was suggested that nfs+xfs+raid kernel configurations, and using ndiswrapper are the most common reasons for overflowing a 4K stack size.

Andi Kleen questioned the usefulness of 4k stacks, "as far as I can figure out they are not [a worthy goal]. They might have been a worthy goal on crappy 2.4 VMs, but these times are long gone." Arjan van de Ven suggested that though the 2.6 VM is much improved over the 2.4 VM, fragmentation with 8K stacks remains an unsolvable problem, "it's basic math; the Linux VM gets to deal with both short and long lasting allocations; no matter how hard you try to get some degree of fragmentation; especially due to the 15:1 acceleration you get due to the lowmem issue. And before you say 'you should use 64 bit on such machines'; I would love it if more people used 64 bit linux. Sadly the adoption rate of that is not very good still.... by far ;(" In another email, Arjan listed two advantages to 4K stacks, "1) less memory consumption in the lowmem zone (critical for enterprise use, also good for general performance), and 2) kernel stacks at 8K are one of the most prominent order-1 allocations in the kernel; again with big-memory systems the fragmentation of the lowmem zone is a problem (and the distros that ship 4K stacks went there because of customer complaints)".

2.6.25, "Long Promised"

Submitted by Jeremy
on April 17, 2008 - 3:48am
Linux news

"It's been long promised, but there it is now," began Linux creator Linus Torvalds, announcing the 2.6.25 Linux kernel. He continued, "special thanks to Ingo who found and fixed a nasty-looking regression that turned out to not be a regression at all, but an old bug that just had not been triggering as reliably before. That said, that was just the last particular regression fix I was holding things up for, and it's not like there weren't a lot of other fixes too, they just didn't end up being the final things that triggered my particular worries." Linus added:

"The full changelog from 2.6.24 is 7.5M, with a 12MB compressed patch. Tons and tons has changed, but if you've been following the -rc releases, you'll already know about the big things. The changes from the last rc (-rc9) are fairly small and mostly pretty trivial, and the shortlog is appended. So it's mostly one-liners, with some updates to drivers (net and usb) and to networking that are a bit larger (although a number of the driver updates are things like just new ID's etc)."

More information about the latest release can be found on the KernelNewbies Linux 2.6.25 wiki page.

Memory Corruption Bug Solved, 2.6.25 Expected Today

Submitted by Jeremy
on April 16, 2008 - 5:55am
Linux news

"Finally found it ... the patch below solves the sparsemem crash and the test system boots up fine now," announced Ingo Molnar. He described the patch as fixing a "memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS." He included a source snippet with the loop that could corrupt memory, "depending on what that memory is, we might crash, misbehave or just not notice the bug." Ingo went on to note that the bug was first introduced with sparsemem support in the 2.6.16 kernel:

"I believe this was the reason why my many bisection attempts were unsuccessful: the bug pattern was not stable and seemingly working kernels had the memory corruption too. It was pure luck that v2.6.24 'worked' and v2.6.25-rc9 broke visibly."

Linux creator Linus Torvalds replied, "good job. I've pushed this out, and will let this simmer at least overnight to see if there are any brown-paper-bag issues (either with this or with some last changes from Andrew), but I'm happy, and I think I'll do the real 2.6.25 tomorrow."

Budget Fair Queuing IO Scheduler

Submitted by Jeremy
on April 15, 2008 - 11:02am
Linux news

"We are working [on] a new I/O scheduler based on CFQ, aiming at improved predictability and fairness of the service, while maintaining the high throughput it already provides," began Fabio Checconi, announcing the BFQ I/O scheduler. "The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin scheduling policy of time slices into a fair queuing scheduling of sector budgets," he continued, "more precisely, each task is assigned a budget measured in number of sectors instead of amount of time, and budgets are scheduled using a slightly modified version of WF2Q+. The budget assigned to each task varies over time as a function of its behaviour. However, one can set the maximum value of the budget that BFQ can assign to any task." Fabio went on to explain:

"The time-based allocation of the disk service in CFQ, while having the desirable effect of implicitly charging each application for the seek time it incurs, suffers from unfairness problems also towards processes making the best possible use of the disk bandwidth. In fact, even if the same time slice is assigned to two processes, they may get a different throughput each, as a function of the positions on the disk of their requests. On the contrary, BFQ can provide strong guarantees on bandwidth distribution because the assigned budgets are measured in number of sectors. Moreover, due to its Round Robin policy, CFQ is characterized by an O(N) worst-case delay (jitter) in request completion time, where N is the number of tasks competing for the disk. On the contrary, given the accurate service distribution of the internal WF2Q+ scheduler, BFQ exhibits O(1) delay."

Jens Axboe reacted favorably, "Fabio, I've merged the scheduler for some testing. Overall the code looks great, you've done a good job!" He noted that the scheduler should soon appear in the -mm tree, and that it was worth considering merging the two I/O schedulers together.