"Finally found it ... the patch below solves the sparsemem crash and the test system boots up fine now," announced Ingo Molnar. He described the patch as fixing a "memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS." He included a source snippet with the loop that could corrupt memory, "depending on what that memory is, we might crash, misbehave or just not notice the bug." Ingo went on to note that the bug was first introduced with sparsemem support in the 2.6.16 kernel:
"I believe this was the reason why my many bisection attempts were unsuccessful: the bug pattern was not stable and seemingly working kernels had the memory corruption too. It was pure luck that v2.6.24 'worked' and v2.6.25-rc9 broke visibly."
Linux creator Linus Torvalds replied, "good job. I've pushed this out, and will let this simmer at least overnight to see if there are any brown-paper-bag issues (either with this or with some last changes from Andrew), but I'm happy, and I think I'll do the real 2.6.25 tomorrow."
"This NTFS update fixes the deadlock at mount time reported by several people over the years but it was only recently that someone who reported it actually replied to my response and helped me track it down (I have never been able to reproduce the deadlock)," Anton Altaparmakov explained about a patch against the NTFS filesystem. He summarized the changes:
"The fix was to stop calling ntfs_attr_set() at mount time as that causes balance_dirty_pages_ratelimited() to be called which on systems with little memory actually tries to go and balance the dirty pages which tries to take the s_umount semaphore but because we are still in fill_super() across which the VFS holds s_umount for writing this results in a deadlock.
"We now do the dirty work by hand by submitting individual buffers. This has the annoying 'feature' that mounting can take a few seconds if the journal is large as we have clear it all. One day someone should improve on this by deferring the journal clearing to a helper kernel thread so it can be done in the background but I don't have time for this at the moment and the current solution works fine so I am leaving it like this for now."
When the data corruption bug which is fixed as of 2.6.20-rc3 [story] was still being tracked down [story], it was thought that the bug, a race in shared mmap'ed page writeback, might have been in the 2.6 kernel for a very long time. It has since been determined that the bug was introduced much more recently. Nick Piggin [interview] explains, "this bug was only introduced in 2.6.19, due to a change that caused pte dirty bits to be discarded without a subsequent set_page_dirty() (nowhere else in the kernel should have done this)." Linus Torvalds noted that earlier kernels could have been affected by a less serious version of the bug:
"Actually, I think 2.6.18 may have a subtle variation on it. But that much older race would only trigger on SMP (or possibly UP with preempt). And I haven't actually thought about it that much, so I could be full of crap. But I don't see anything that protects against it: we may hold the page lock, but since the code that marks things _dirty_ doesn't necessarily always hold it, that doesn't help us. And we may hold the 'private_lock', but we drop it before we do the dirty bit clearing, and in fact on UP+PREEMPT that very dropping could cause an active preemption to take place, so.. I dunno. For older kernels? If there is a race there, it must be pretty damn hard to hit in practice (and it must have been there for a looong time), so trying to fix it is possibly as likely to cause problems as it migh to fix them."
David Miller pointed out that some of the confusion as to when the bug was actually introduced comes from the fact that the original bug was against a 2.6.18 Debian kernel. Andrew Morton [interview] explained, "that was 2.6.18+debian-added-dirty-page-tracking-patches," then went on to caution that the fix still does not address a newly reported and currently unconfirmed BerkeleyDB corruption bug, "I'll assert (and emphasise) that the cause of the alleged BerkeleyDB corruption is not known at this time. The post-2.6.19 'fix' might make it go away. But if it does, we do not know why, and it might still be there, only harder to hit."
A few hours before the new year, Linus Torvalds released the 2.6.20-rc3 Linux kernel, "in order to not get in trouble with MADR ("Mothers Against Drunk Releases") I decided to cut the 2.6.20-rc3 release early rather than wait for midnight, because it's bound to be new years _somewhere_ out there. So here's to a happy 2007 for everybody." In good humor, he noted that the new kernel would be available on all the kernel.org mirrors by the time everyone's New Years celebrations had concluded, "it's probably going to be up-to-date by the time the hangovers are mostly gone. At which point the first thing on any self-respecting geek's mind should obviously be: 'is there a new kernel release for me to try?'" Regarding the changes in the new release candidate, which include a data corruption fix [story], Linus summarized:
"The big thing at least for me personally is that nasty shared mmap corruption fix, but there's a number of other changes in here, many of them just documentation (and some media and network drivers). Shortlog and diffstat appended."
Bob Beck is an OpenBSD developer from Edmonton in Canada. He's one of around 60 OpenBSD developers currently working in an undisclosed hotel somewhere in downtown Calgary at the 2005 OpenBSD hackathon [story]. Bob was involved in setting up the infrastructure [story], and was responsible for the annual barbecue at OpenBSD creator Theo de Raadt [interview]'s house [story]. Following these two days of effort that helped to make the hackathon possible, he finally sat down to work on spamd and catch up on email. One of the emails in his inbox caught his attention, leading to a day's effort about which he notes, "some Days end up far far far from where they start."
In the following article, Bob provides a first-person account of tracking down what began simply as a RAID performance issue, but ultimately turned out to be a problem with the idle loop that when fixed resulted in an impressive performance boost. Bob noted, "the idle loop is where the kernel spins when there is no work to do in userland, because of this, it's also where we catch and service many of our interrupts from drivers that may queue work to the device and then tsleep waiting for an interrupt from the card saying the work is done." Bob went on to explain that prior to today's fix, interrupts were handled appropriately when there was userland work happening, but not when there was nothing happening in userland and the kernel was simply waiting for device input/output. Read on for Bob's full account of the day, leading up to the discovery of the problem and the implementation of the fix, including performance numbers.
Andrew Morton [interview] posted on the lkml, "In 2.4.20-pre5 an optimisation was made to the ext3 fsync function which can very easily cause file data corruption at unmount time". This bug only affects people using ext3 in the uncommon "data=journal" mode, or files operating under "chattr -j", and does not affect the 2.5 series of kernels.
Andrew went on to say that "The symptoms are that any file data which was written within the thirty seconds prior to the unmount may not make it to disk. A workaround is to run `sync' before unmounting". He also posted a patch to fix the problem. However, soon thereafter, he posted saying that "that 'fix' didn't fix it. Sorry about that". Until a proper fix can be developed, he recommends that people "please avoid ext3/data=journal". Since "data=journal" is not the default ext3 mode, it is unlikely most people running ext3 will be affected by this. However, it is a data corruption bug so you should double-check that you use either "data=ordered" or "data=writeback" as your ext3 mode of operation.