Andrew Morton

Stabilizing Existing Patches

Submitted by Jeremy
on October 23, 2007 - 6:23am
Linux news

When the same patchset arrived on the Linux Kernel mailing list from multiple sources, Christoph Hellwig asked, "any reason we've got this patchset posted by three people now? :)" Andrew Morton retorted, "presumably because I haven't been merging it." He went on to explain:

"I was in bugfix-only mode from a week prior to 2.6.24 release and during the merge window. Partly caused by the already-idiotic amount of stuff we had queued for 2.6.24, partly because we needed to concentrate on stabilising the 2.6.25 patchpile rather than writing new stuff.

"And partly to send the signal that rather than beavering away on new features all the time, we should also be spending some (more) time testing, reviewing and bugfixing the current and soon-to-be-current code. Probably I should have been more explicit about it, but it wasn't really planned. Next time I'll send more 'thanks, I parked this for consideration at a more appropriate time' emails."

Caution and Latency

Submitted by Jeremy
on October 22, 2007 - 5:28am
Linux news

"With latencytop, I noticed that the (in memory) atime updates during a kernel build had latencies of 600 msec or longer; this is obviously not so nice behavior. Other EXT3 journal related operations had similar or even longer latencies," Arjan van de Ven reported, describing a "mass priority inversion" caused by, "an interaction between EXT3 and CFQ in that CFQ tries to be fair to everyone, including kjournald. However, in reality, kjournald is 'special' in that it does a lot of journal work". Finally, he offered a tiny patch to resolve the issue, "the patch below makes kjournald of the IOPRIO_CLASS_RT priority to break this priority inversion behavior. With this patch, the latencies for atime updates (and similar operation) go down by a factor of 3x to 4x !"

Andrew Morton took a cautious stance, "seems a pretty fundamental change which could do with some careful benchmarking, methinks. See, your patch amounts to 'do more seeks to improve one test case'. Surely other testcases will worsen. What are they?" CFQ author Jens Axboe agreed, "It should not be merged as-is, instead I'll provide a function to do this." Ingo Molnar wasn't convinced, "atime update latencies went down by a factor of 3x-4x ... but what bothers me even more is the large picture. Linux's development is still fundamentally skewed towards bandwidth (which goes up with hardware advances anyway), while the focus on latencies is very lacking (which users do care about much more and which usually does _not_ improve with improved hardware), so i cannot see why we shouldnt apply this." He added, "if bandwidth hurts anywhere, it will be pointed out and fixed, we've got like tons of bandwidth benchmarks and it's _easy_ to fix bandwidth problems. But _finally_ we now have desktop latency tools, hard numbers and patches that fix them, but what do we do ... we put up extra roadblocks??" Andrew calmy replied, "I think the situation is that we've asked for some additional what-can-be-hurt-by-this testing. Yes, we could sling it out there and wait for the reports. But often that's a pretty painful process and regressions can be discovered too late for us to do anything about them."

Quote: Design First

Submitted by Jeremy
on October 22, 2007 - 2:07am

"It wouldn't be efficient for you to implement something new, only to have it criticized again. I'd suggest that you come up with a concrete design, describe to us what you propose to do and let's take it from there."

KGDB Merge Postponed Until 2.6.25

Submitted by Jeremy
on October 18, 2007 - 7:37am
Linux news

"This is a request to merge KGDB into the mainline kernel," Jason Wessel announced, posting a series of patches aiming toward that goal. He continued, "as of right now KGDB is comprised of 21 different patches adding in the core api and docs first and then working up to add drivers and arch specific support to KGDB. The patches were broken down into logical pieces for review and comments." He went on to explain:

"The intent of the KGDB patches is to unify the KGDB support across all the architectures that elect to implement the KGDB functionality by providing a common core and an arch specific stub. For quite some time there has been different features and uses of KGDB across the most popular architectures. Having a common core that takes care of protocol parsing and the typical use case of software breakpoints should eliminate the inconsistencies across the archs as well as making it easier to add KGDB support to a new arch."

Andrew Morton, who has been supportive of getting a kernel debugger into the mainline kernel, explained that it was too late in the 2.6.24 review cycle to merge KGDB, meaning it would have to wait for 2.6.25 at the earliest, "this won't work very well. There's a lot of review work to be done here, and a lot of it by busy architecture maintainers. Expecting people to do all this review and test work late in the merge window when they're all madly scrambling to get their bugs^Wpatches into mainline is not reasonable. This should all have started a month ago. So we're looking at a 2.6.25 merge for this work."

2.6.23-mm1, "Working a Bit Better"

Submitted by Jeremy
on October 13, 2007 - 2:07am
Linux news

Andrew Morton posted his first -mm patchset against the recently released 2.6.23 kernel, preparing for a big merge of patches bound for inclusion in the upcoming 2.6.24 kernel. He noted:

"I've been largely avoiding applying anything since rc8-mm2 in an attempt to stabilise things for the 2.6.23 merge.

"But that didn't stop all the subsystem maintainers from going nuts, with the usual accuracy. We're up to a 37MB diff now, but it seems to be working a bit better."

Avoiding Blobs

Submitted by Jeremy
on October 11, 2007 - 1:51am
Linux news

A recent attempt to push some V4L/DVB updates for inclusion in the 2.6.24 Linux kernel met with some resistance. Linus Torvalds summarized the problems affecting the em28xx video driver:

"I've talked to various people, and none of the main kernel people end up being at all interested in a kernel that has external dependencies on binary blobs for tuners.

"So right now it seems like while I would personally want to have more vendors supprt their own drivers, if that in this case means that we'd have to have user-space and unmaintainable binaries to tune the cards, everybody seems to hate that idea."

High Idle Load Average

Submitted by Jeremy
on October 6, 2007 - 11:59am
Linux news

When a Linux user reported a repeatedly high load average on an idle server, tracking the problem to a specific patch labeled, "user of the jiffies rounding code", Andrew Morton replied, "this is unexpected. High load average is due to either a task chewing a lot of CPU time or a task stuck in uninterruptible sleep." Linus Torvalds disagreed, explaining:

"We saw high loadaverages with the timer bogosity with 'gettimeofday()' and 'select()' not agreeing, so they would do things like 'date = time(..); select(.. , timeout = );' and when 'date' wasn't taking the jiffies offset into account, and thus mixing these kinds of different time sources, the select ended up returning immediately because they effectively used different clocks, and suddenly we had some applications chewing up 30% CPU time, because they were in a loop that *tried* to sleep."

Linus offered what he described as an "idiotic patch" to cause the load average to not be calculated exactly once every 5 seconds to prevent it from being in sync with something else waking up every 5 seconds, noting, "the load average is not calculated every tick, because that's not just expensive, but we also want to have some time-based decay." Arjan van de Ven pointed out that this shouldn't help, "I mean, the load gets only updated in actual timer interrupts... and on a tickless system there's very few of those around..... and usually at places round_jiffies() already put a timer on." Linus agreed with this reasoning, suggesting, "maybe Anders' problem stems partly from the fact that he really is using the tweaks to make that tickless theory more true than it tends to be on most systems?" Arjan pointed out that a lot of work has been successful in making tickless kernels wake up less, "we fixed a TON of stuff over the last months.. standard desktops (F8 / next Ubuntu) will be around 10 wakeups/sec, in a lab environment you can get below 2 ;)"

Merging From -mm in 2.6.24

Submitted by Jeremy
on October 1, 2007 - 12:56pm
Linux news

With the official release of the 2.6.23 kernel expected any day now, Andrew Morton posted his -mm merge plans for the 2.6.24 kernel. The current Linux kernel development model is to open up the mainline kernel for significant merges during the two weeks following a major kernel release. Thus, during the two weeks following the imminent release of the 2.6.23 kernel, subsystem maintainers will push their latest trees to Linus' mainline tree. Andrew Morton will also push many of the patches he collects in his -mm tree to Linus' mainline tree during these two weeks, as detailed in his email. At the end of the merge window, 2.6.24-rc1 will be released and the stabilization process begins, though in reality significant merges also often slip in between -rc1 and -rc2. A series of -rc kernels will be released, eventually leading to a stable 2.6.24 kernel two or three months after the process started, and it all starts again.

Simplified Mandatory Access Control Kernel

Submitted by Jeremy
on September 30, 2007 - 5:20pm
Linux news

"Smack is the Simplified Mandatory Access Control Kernel," Casey Schaufler said posting the third version of his patchest. He explained, "Smack implements mandatory access control (MAC) using labels attached to tasks and data containers, including files, SVIPC, and other tasks. Smack is a kernel based scheme that requires an absolute minimum of application support and a very small amount of configuration data." Casey continued:

"Smack is implemented as a clean LSM. It requires no external code changes and the patch modifies only the Kconfig and Makefile in the security directory. Smack uses extended attributes and provides a set of general mount options, borrowing technics used elsewhere. Smack uses netlabel for CIPSO labeling. Smack provides a pseudo-filesystem smackfs that is used for manipulation of system Smack attributes."

Andrew Morton replied to Casy's lengthy description, "I don't know enough about security even to be dangerous. I went back and reviewed the August thread from your version 1 submission and the message I take away is that the code has been well-received and looks good when considered on its own merits, but selinux could probably be configured to do something sufficiently similar." He added, "so with the information which I presently have available to me, I'm thinking that this should go into 2.6.24."

Improving checkpatch

Submitted by Jeremy
on September 30, 2007 - 3:27am
Linux news

"This version brings a number of new checks, and a number of bug fixes," Andy Whitcroft noted in his announcement for version 0.10 of checkpatch.pl, used by Linux kernel developers to scan their code for common mistakes. Ingo Molnar expressed concern, "your checkpatch patch itself produces 22 warnings." He pointed out that there were numerous bogus warnings generated by the script, "ever since v8 the quality of checkpatch.pl has been getting worse and worse as there are way too many false positives. I'm still stuck on v8 for my own use, v9 and v10 is unusable." Ingo continued, "what matters is that only items should be displayed that i _can_ fix. With v8 i was able to make kernel/sched*.c almost noise-free, but with v9 and v10 that's not possible anymore." He noted that he was fine with there being a flag that would cause the script to generate additional questionable warnings, "but these default false positives are _lethal_ for a tool like this. (and i made this point before.) This is a _fundamental_ thing".

Andy added a new option to make it possible to disable some of the more subjective tests, noting that he preferred the tests to be enabled by default, "fundamentally I am not trying to help the people who are careful but those who do not know better. As for the false positives, those I am always interested in and always striving to remove, as they annoy me as much as the next man." Andrew Morton disagreed with the option being enabled by default, suggesting, "off, I'd say. That way people are more likely to use it. Or, more accurately, will have less excuses to not use it." Andy acquiesced, "off it is." He added, "I will also review the tests which are warnings and checks (subjective) and see if any are now miss-categorised." He pointed out that as the script is not a C language parser, instead detecting C language style validations using regular expressions, it won't ever be 100% accurate and is instead only intended as a useful guide.

Sysfs Stability

Submitted by Jeremy
on September 29, 2007 - 6:02pm
Linux news

"The fact that we continue to expose internal data structures via sysfs is a gaping open pit [and] is far more likely to cause any kind of problems than changing an error return," Theodore Ts'o noted, responding to a thread discussing a patch to fix an error return code. Andrew Morton agreed, "I was staring in astonishment at the pending sysfs patch pile last night. Forty syfs patches and twenty-odd patches against driver core and the kobject layer." He continued, "that's a huge amount of churn for a core piece of kernel infrastructure which has been there for four or five years. Not a good sign." Andrew then added a humorous quip, "I mean, it's not as if, say, the CPU scheduler guys keep on rewriting all their junk. oh, wait.."

Sysfs maintainer Greg KH replied, "I'm sorry, have I missed a breakage lately? I don't know of one in over a year that has not been fixed. Do you?" He noted that when sysfs is used properly from user space no breakage occurs, "if you want to propose some other kind of alternative to exporting this kind of _needed_ information to userspace, in a simple and easy-to-use manner, please do so. Until then, stop complaining unnecessarily." He went on to explain that most sysfs changes are to support things like containers, requiring per-user/per-container views, something sysfs wasn't originally designed for. "These aren't being done just because we like to break things, we are trying to make things better, and fix real bugs here."

UBIFS Writeback

Submitted by Jeremy
on September 29, 2007 - 4:23am
Linux news

UBIFS is described as, "a new flash file system which is designed to work on top of UBI." It has replaced the JFFS3 project, a choice explained on the project webpage, "we have realized that creating a scalable flash file system on top of bare flash is a difficult task, just because the flash media is so problematic (wear-leveling, bad eraseblocks). We have tried this way, and it turned out to be that we solved media problems, instead of concentrating on file system issues. So we decided to split one big and complex tasks into 2 sub-tasks: UBI solves the media problems, like bad eraseblocks and wear-leveling, and UBIFS implements the file system on top. And now finally, we may concentrate on file-system issues: implementing write-back caching, multi-headed journal, garbage collector, indexing information management and so on. There are a lot of FS problems to solve - orphaned files, deletions, recoverability after unclean reboots and so on."

In a recent posting to the lkml, Artem Bityutskiy noted that UBIFS has to take into account that there is a small amount of unused block space at the ends of eraseblocks, and the size of pages written to disk are smaller than they are in memory as the filesystem performs compression. "So, if our current liability is X, we do not know exactly how much flash space (Y) it will take. All we can do is to introduce some pessimistic, worst-case function Y = F(X). This pessimistic function assumes that pages won't be compressible, and it assumes worst-case wastage." The calculation is necessary as even though data is not written immdiately to the flash device, it's important to be able to inform the application writing data if there's not enough space left. "So my question is: how can we flush _few_ oldest dirty pages/inodes while we are inside UBIFS (e.g., in ->prepare_write(), ->mkdir(), ->link(), etc)?"

Avoiding Unnecessary Delays

Submitted by Jeremy
on September 27, 2007 - 5:21pm
Linux news

"We don't want to introduce pointless delays in throttle_vm_writeout() when the writeback limits are not yet exceeded, do we?" asked Fengguang Wu as the description of his patch to mm/page-writeback.c. Andrew Morton replied, "this is a pretty major bugfix, explaining, "this patch has the potential to significantly alter the dynamics of the VM behaviour under particular workloads. It might turn up other stuff..." He continued:

"I wonder why nobody noticed this happening. Either a) it turns out that kswapd is doing a good job and such callers don't do direct reclaim much or b) nobody is doing any in-depth kernel instrumentation.

"Now, how _would_ one notice this problem? We don't have very good tools, really. Booting with "profile=sleep" and looking at the profile data would be one way. Repeatedly doing sysrq-T is another. Perhaps the new lockstat-via-lockdep code would allow this to be observed in some fashion, dunno."

Read-only Bind Mounts

Submitted by Jeremy
on September 24, 2007 - 1:59am
Linux news

"This feature allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount," Dave Hansen began, describing his "read-only bind mounts" patches. He continued, "this has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to some filesystems. This also replaces patches that vserver has had out of the tree for several years. It allows security enhancements by making sure that parts of your filesystem [are] read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated."

Christoph Hellwig was interested in seeing the patches get some more testing, "I still think we really want this in -mm. As we've seen at the kernel summit there's a pretty desperate need for it." Andrew Morton noted that the "unprivileged mounts" code was working in the same area, but described that work as "a bit stuck." He suggested, "it sounds like a better approach would be for me to merge the r/o bind mounts code and to drop (or maybe rework) the unprivileged mounts patches." Dave explained that they don't collide much, to which Andrew's reply suggested that the read-only mount patches would be merged into the -mm kernel soon.

Suspend and Resume with ACPI

Submitted by Jeremy
on September 23, 2007 - 6:35pm
Linux news

"It took me quite a while to realize the real root cause of the VAIO - and probably many other machines - suspend/resume regressions, which were unearthed by the dyntick / clockevents patches," Thomas Gleixner explained regarding two patches for fixing suspend issues that Andrew Morton experienced with his VAIO laptop. He continued, "we disable a lot of ACPI/BIOS functionality during suspend, but we keep the lower idle C-states functionality active across suspend/resume. It seems that this causes trouble with certain BIOSes, but I assume that the problem is more wide spread and just not surfacing due to the various scenarios in which a machine goes into suspend/resume." Thomas concluded, "I really hope that this two patches finally set an end to the 'jinxed VAIO heisenbug series', which started when we removed the periodic tick with the clockevents/dyntick patches."

Linus Torvalds expressed some concerns, "the patches look fine, but I somehow have this slight feeling that you gave up a bit too soon on the '*why* does this happen?' question." He agreed that at that point there was a problem with ACPI, but cautioned that this could be triggered by another bug, "in particular, I also suspect that this may not really fix the problem - maybe it just makes the window sufficiently small that it no longer triggers. Because we don't necessarily understand what the real background for the problem is, I'm not sure we can say that it is solved." Linus concluded, "but hey, I think I'll apply the patches as-is. I'd just feel even better if we actually understood *why* doing the CPU Cx states is not something we can do around the suspend code!"