"This NTFS update fixes the deadlock at mount time reported by several people over the years but it was only recently that someone who reported it actually replied to my response and helped me track it down (I have never been able to reproduce the deadlock)," Anton Altaparmakov explained about a patch against the NTFS filesystem. He summarized the changes:
"The fix was to stop calling ntfs_attr_set() at mount time as that causes balance_dirty_pages_ratelimited() to be called which on systems with little memory actually tries to go and balance the dirty pages which tries to take the s_umount semaphore but because we are still in fill_super() across which the VFS holds s_umount for writing this results in a deadlock.
"We now do the dirty work by hand by submitting individual buffers. This has the annoying 'feature' that mounting can take a few seconds if the journal is large as we have clear it all. One day someone should improve on this by deferring the journal clearing to a helper kernel thread so it can be done in the background but I don't have time for this at the moment and the current solution works fine so I am leaving it like this for now."
Andrew Morton posted his first -mm patchset against the recently released 2.6.23 kernel, preparing for a big merge of patches bound for inclusion in the upcoming 2.6.24 kernel. He noted:
"I've been largely avoiding applying anything since rc8-mm2 in an attempt to stabilise things for the 2.6.23 merge.
"But that didn't stop all the subsystem maintainers from going nuts, with the usual accuracy. We're up to a 37MB diff now, but it seems to be working a bit better."
"There are no 'persons responsible for defending the kernel GPL', there are just a few hundreds or thousands copyright holders of the kernel, and each of them has the right to sue you if he thinks you distribute something that violates his copyright," Adrian Bunk responded in a recent discussion about the legality of linking to GPL'd code in embedded applications. He added, "jurisdiction and applicable copyright law depends on things like where the copyright holder lives and where you distribute it." When it was asked how the constraints of a given piece of hardware might affect the interpretation of the GPL, Theodore T'so explained:
"At the end of the day it all boils down to what is a derived work. If an object file which is designed to link into a kernel is a derived work, then the GPL claims that it will infect across to that derived work. Whether or not it this is a case is a matter of much debate, and as far as I know, no court has ever ruled on point regarding the question of object files, dynamical linking, and whether or not that would be a derived work or not. It seems likely that the answer may vary from one legal jurisdiction to another. Hence, the only answer that we can give which is useful is, 'Take this off of LKML, and go ask a lawyer.'"
"We have seen ramdisk based install systems, where some pages of mapped libraries and programs were suddendly zeroed under memory pressure. This should not happen, as the ramdisk avoids freeing its pages by keeping them dirty all the time," Christian Borntraeger began, explaining the need for his small patch to the ramdisk driver. He continued, "it turns out that there is a case, where the VM makes a ramdisk page clean, without telling the ramdisk driver. On memory pressure shrink_zone runs and it starts to run shrink_active_list. There is a check for buffer_heads_over_limit, and if true, pagevec_strip is called. pagevec_strip calls try_to_release_page. If the mapping has no releasepage callback, try_to_free_buffers is called. try_to_free_buffers has now a special logic for some file systems to make a dirty page clean, if all buffers are clean. Thats what happened in our test case."
He provided two methods for duplicating the reported problem, "you have to make buffer_heads_over_limit true" This is done by either lowering
max_buffer_heads or having a system with lots of high memory. "The solution is to provide a noop-releasepage callback for the ramdisk driver. This avoids try_to_free_buffers for ramdisk pages."
A recent attempt to push some V4L/DVB updates for inclusion in the 2.6.24 Linux kernel met with some resistance. Linus Torvalds summarized the problems affecting the
em28xx video driver:
"I've talked to various people, and none of the main kernel people end up being at all interested in a kernel that has external dependencies on binary blobs for tuners.
"So right now it seems like while I would personally want to have more vendors supprt their own drivers, if that in this case means that we'd have to have user-space and unmaintainable binaries to tune the cards, everybody seems to hate that idea."
Douglas Gilbert announced the 1.02 release of the sdparm utility. Originally written for Linux, it has also been ported to FreeBSD, Solaris, Tru64 and Windows. Douglas described the program:
"sdparm is a command line utility designed to get and set SCSI device parameters (cf hdparm for ATA disks). The parameters are held in mode pages. Apart from SCSI devices (e.g. disks, tapes and enclosures) sdparm can be used on any device that uses a SCSI command set. Almost all CD/DVD drives use the SCSI MMC set irrespective of the transport. sdparm also can decode VPD pages including the device identification page. Commands to start and stop the media; load and unload removable media and some other housekeeping functions are supported."
"As you can see it in the graph, v2.6.23 schedules much more consistently too. [ v2.6.22 has a small (but potentially statistically insignificant) edge at 4-6 clients, and CFS has a slightly better peak (which is statistically insignificant)."
Ingo noted that he was nuable to find information as to how the other benchmark was generated, "there are no .configs or other testing details at or around that URL that i could use to reproduce their result precisely, so at least a minimal bugreport would be nice." He then offered some tips on how sysbench works and some suggested tunings, "sysbench is a pretty 'batched' workload: it benefits most from batchy scheduling: the client doing as much work as it can, then server doing as much work as it can - and so on. The longer the client can work the more cache-efficient the workload is. Any round-trip to the server due to pesky preemption only blows up the cache footprint of the workload and gives lower throughput."
"Finally. Yeah, it got delayed, not because of any huge issues, but because of various bugfixes trickling in and causing me to reset my 'release clock' all the time. But it's out there now, and hopefully better for the wait," Linus Torvalds said announcing the 2.6.23 kernel. He noted few changes since the last release candidate, "not a whole lot of changes since -rc9, although there's a few updates to mips, sparc64 and blackfin in there. Ignoring those arch updates, there's basically a number of mostly one-liners (mostly in drivers, but there's some networking fixes and soem VFS/VM fixes there too)." Source level changes can be viewed via the gitweb interface. A nice overview of all changes can be found at Kernel Newbies. Linus went on to describe his plan going forward:
"I want this to be what people look at for a few days, but expect the x86 merge to go ahead after that. So far, all indications are still that it's going to be all smooth sailing, but hey, those indicators seem to always say that, and only after the fact do people notice any problems ;)"
Ralf Baechle posted the Linux/MIPS architecture merge plans for the upcoming 2.6.24 kernel. The diffstat for all changes showed, "435 files changed, 14274 insertions(+), 10196 deletions(-)", about which Ralf noted, "the number of patch lines and files is inflated by two large whitespace cleanup patches." He continued:
"The biggest actual changes are the support for tickless kernels on MIPS and the rewrite for many of the timer devices previously used as clocksources and clockevents. Various cleanups, including some moving of code and support for 32-bit Broadcom BCM47XX processors, the return of support for LASAT which isn't quite as unused as previously thought."
"Last month, at the kernel summit, there was discussion of putting a Reviewed-by: tag onto patches to document the oversight they had received on their way into the mainline," began Jonathan Corbet in an effort to define the meaning of the recently introduced
reviewed-by tag. He continued, "that tag has made an occasional appearance since then, but there has not yet been a discussion of what it really means. So it has not yet brought a whole lot of value to the process."
In the continued discussion, it was requested that all commit tags be defined, prompting Jonathan to update his documentation to include Signed-off-by, Acked-by, Cc, and Tested-by along with his documentation for Reviewed-by. He offered the following definition for the new Reviewed-by tag:
"The patch has been reviewed and found acceptible according to the Reviewer's Statement as found at the bottom of this file. A Reviewed-by tag is a statement of opinion that the patch is an appropriate modification of the kernel without any remaining serious technical issues. Any interested reviewer (who has done the work) can offer a Reviewed-by tag for a patch."
"15 partitions (at least for sd_mod devices) are too few," Jan Engelhardt suggested along with a patch to try and make the mounting of an unlimited number of partitions possible. H. Peter Anvin proposed as an alternative, "now when we have 20-bit minors, can't we simply recycle some of the higher bits for additional partitions, across the board? 63 partitions seem to have been sufficient; at least I haven't heard anyone complain about that for 15 years."
Alan Cox explained, "this was proposed ages ago. Al Viro vetoed sparse minors and it has been stuck this way ever since. If you have > 15 partitions use device mapper for it. I'd prefer it fixed but it's arguable that device mapper is the right way to punt all our partitioning to userspace".
Paul Jackson described a new per-cpuset flag called 'sched_load_balance', "when enabled in a cpuset (the default value) it tells the kernel scheduler that the scheduler should provide the normal load balancing on the CPUs in that cpuset, sometimes moving tasks from one CPU to a second CPU if the second CPU is less loaded and if that task is allowed to run there. When disabled (write '0' to the file) then it tells the kernel scheduler that load balancing is not required for the CPUs in that cpuset." Paul went on to explain why the feature is useful:
"1) It provides a mechanism for real time isolation of some CPUs, and
"2) it can be used to improve performance on systems with many CPUs by supporting configurations in which load balancing is not done across all CPUs at once, but rather only done in several smaller disjoint sets of CPUs."
"It looks to be about 2.1% increase in time to do the make/mount/unmount operations with the marker patches in place and no blktrace operations," Alan Brunelle summarized some benchmarks testing the overhead of the kernel markers patches. He continued, "with the blktrace operations in place we see about a 3.8% decrease in time to do the same ops." Block layer maintainer Jens Axboe responded favorably, "thanks for running these numbers. I don't think you have to bother with it more. My main concern was a performance regression, increasing the overhead of running blktrace." He added, "I'd say the above is Good Enough for me," acking the kernel marker patches.
Jens went on to muse, "I do wonder about that performance _increase_ with blktrace enabled. I remember that we have seen and discussed something like this before, it's still a puzzle to me..." Mathieu Desnoyers agreed, "interesting question indeed," going on to suggest possible future tests to understand the unexpected performance increase.
blktrace is a block layer IO tracing tool for providing detailed information about request queue operations, originally developed by Jens Axboe and merged into the mainline kernel in 2.6.17-rc1.
Jeff Garzik posted a series of five patches for the forcedeth driver which he described as, "several proposed updates for testing". Forcedeth is a GPL'd driver for the Ethernet interface of the NVIDIA nForce chipset, originally merged into the 2.4.26 and 2.6.5 Linux kernels. Jeff noted two main goals for the patches:
"1) move the driver towards a more sane, simple, easy to verify locking setup -- irq handler would often acquire/release the lock twice for each interrupt.
"2) to eliminate a rarely used, apparently fragile locking scheme that includes heavy use of disable_irq(). this tool is most often employed during NIC reset/reconfiguration, so satisfying this goal implies changing the way NIC reset and config are accomplished."
Jeff explained that he was looking to get the changes tested, "these are intended for feedback and testing, NOT for merging." He went on to explain that one of the changes included two independent napi_structs, one for receiving and one transmitting, "I feel TX NAPI is a useful tool, because it provides an independent TX process control point and system load feedback point. Thus I felt this was slightly superior to tasklets. But who knows if this is a good idea? :) I am interested in feedback and criticism on this issue."