It's out there now, or at least in the process of getting mirrored out.
The most obvious change is the (temporary) change of logo to Tuz, the
Tasmanian Devil. But there's a number of driver updates and some m68k
header updates (fixing headers_install after the merge of non-MMU/MMU)
that end up being pretty noticeable in the diffs.
The shortlog (from -rc8, obviously - the full logs from 2.6.28 are too big
to even contemplate attaching here) is appended, and most of the non-logo
changes really shouldn't be all that noticeable to most people. Nothing
really exciting, although I admit to fleetingly considering another -rc
series just because the changes are bigger than I would have wished for
this late in the game. But there was little point in holding off the real
release any longer, I feel.
This obviously starts the merge window for 2.6.30, although as usual, I'll
probably wait a day or two before I start actively merging. I do that in
order to hopefully result in people testing the final plain 2.6.29 a bit
more before all the crazy changes start up again.
Linus
---
Aaro Koskinen (2):
ARM: OMAP: sched_clock() corrected
ARM: OMAP: Allow I2C bus driver to be compiled as a module
Abhijeet Joglekar (2):
[SCSI] libfc: Pass lport in exch_mgr_reset
[SCSI] libfc: when rport goes away (re-plogi), clean up exchanges to/from rport
Achilleas Kotsis (1):
USB: Add device id for Option GTM380 to option driver
Al Viro (1):
net: fix sctp breakage
Alan Stern (2):
USB: usbfs: keep async URBs until the device file is closed
USB: EHCI: expedite unlinks when the root hub is suspended
Albert Pauw (1):
USB: option.c: add ZTE 622 modem device
Alexander Duyck (1):
igb: remove ASPM L0s workaround
Andrew Vasquez (4):
[SCSI] qla2xxx: Correct address range checking for option-rom updates.
[SCSI] qla2xxx: Correct truncation in return-code status checking.
[SCSI] qla2xxx: Correct overwrite ...I know this has been discussed before: [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480 seconds. [129402.084667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [129402.179331] updatedb.mloc D 0000000000000000 0 31092 31091 [129402.179335] ffff8805ffa1d900 0000000000000082 ffff8803ff5688a8 0000000000001000 [129402.179338] ffffffff806cc000 ffffffff806cc000 ffffffff806d3e80 ffffffff806d3e80 [129402.179341] ffffffff806cfe40 ffffffff806d3e80 ffff8801fb9f87e0 000000000000ffff [129402.179343] Call Trace: [129402.179353] [<ffffffff802d3ff0>] sync_buffer+0x0/0x50 [129402.179358] [<ffffffff80493a50>] io_schedule+0x20/0x30 [129402.179360] [<ffffffff802d402b>] sync_buffer+0x3b/0x50 [129402.179362] [<ffffffff80493d2f>] __wait_on_bit+0x4f/0x80 [129402.179364] [<ffffffff802d3ff0>] sync_buffer+0x0/0x50 [129402.179366] [<ffffffff80493dda>] out_of_line_wait_on_bit+0x7a/0xa0 [129402.179369] [<ffffffff80252730>] wake_bit_function+0x0/0x30 [129402.179396] [<ffffffffa0264346>] ext3_find_entry+0xf6/0x610 [ext3] [129402.179399] [<ffffffff802d3453>] __find_get_block+0x83/0x170 [129402.179403] [<ffffffff802c4a90>] ifind_fast+0x50/0xa0 [129402.179405] [<ffffffff802c5874>] iget_locked+0x44/0x180 [129402.179412] [<ffffffffa0266435>] ext3_lookup+0x55/0x100 [ext3] [129402.179415] [<ffffffff802c32a7>] d_alloc+0x127/0x1c0 [129402.179417] [<ffffffff802ba2a7>] do_lookup+0x1b7/0x250 [129402.179419] [<ffffffff802bc51d>] __link_path_walk+0x76d/0xd60 [129402.179421] [<ffffffff802ba17f>] do_lookup+0x8f/0x250 [129402.179424] [<ffffffff802c8b37>] mntput_no_expire+0x27/0x150 [129402.179426] [<ffffffff802bcb64>] path_walk+0x54/0xb0 [129402.179428] [<ffffffff802bfd10>] filldir+0x0/0xf0 [129402.179430] [<ffffffff802bcc8a>] do_path_lookup+0x7a/0x150 [129402.179432] [<ffffffff802bbb55>] getname+0xe5/0x1f0 [129402.179434] [<ffffffff802bd8d4>] user_path_at+0x44/0x80 [129402.179437] [<ffffffff802b53b5>] ...
Ouch - 480 seconds, how much memory is in that machine, and how slow are the disks? What's your vm.dirty_background_ratio and All filesystems seem to suffer from this issue to some degree. I posted to the list earlier trying to see if there was anything that could be done to help my specific case. I've got a system where if someone starts writing out a large file, it kills client NFS writes. Makes the system unusable: http://marc.info/?l=linux-kernel&m=123732127919368&w=2 Only workaround I've found is to reduce dirty_background_ratio and dirty_ratio to tiny levels. Or throw good SSDs and/or a fast RAID array at it so that large writes complete faster. Have you tried the Everyone seems to agree that "autotuning" it is the way to go. But no one seems willing to step up and try to do it. Probably because it's hard to get right! -Dave --
The 480 secondes is not the "wait time" but the time gone before the message is printed. It the kernel-default it was earlier 120 seconds but thats changed by Ingo Molnar back in september. I do get a lot of less noise but it really doesn't tell anything about the nature of the problem. The systes spec: 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to decide if thats fast or slow? The strange thing is actually that the above process (updatedb.mlocate) is writing to / which is a device without any activity at all. All activity is on the Fibre Channel device above, but process writing outsid that seems to be effected as well. 2.6.29-rc8 defaults: jk@hest:/proc/sys/vm$ cat dirty_background_ratio 5 jk@hest:/proc/sys/vm$ cat dirty_ratio No.. What would you suggest to be a reasonable setting for that? > Everyone seems to agree that "autotuning" it is the way to go. But no > one seems willing to step up and try to do it. Probably because it's > hard to get right! I can test patches.. but I'm not a kernel-developer.. unfortunately. Jesper -- Jesper --
That's true - the detector is really simple and only tries to flag suspiciously long uninterruptible waits. It prints out the context it finds but otherwise does not try to go deep about exactly why that delay happened. Would you agree that the message is correct, and that there is some sort of "tasks wait way too long" problem on your system? i think it's fair to say that an almost 10 minutes uninterruptible sleep sucks to the user, by any reasonable standard. It is the year 2009, not 1959. The delay might be difficult to fix, but it's still reality - and that's the purpose of this particular debug helper: to rub reality under our noses, whether we like it or not. ( _My_ personal pain threshold for waiting for the computer is around 1 _second_. If any command does something that i cannot Ctrl-C or Ctrl-Z my way out of i get annoyed. So the historic limit for the hung tasks check was 10 seconds, then 60 seconds. But people argued that it's too low so it was raised to 120 then 480 seconds. If almost 10 minutes of uninterruptible wait is still acceptable then the watchdog can be turned off (because it's basically pointless to run it in that case - no amount of delay will be 'bad'). ) Ingo --
The message is absolutely correct (it was even at 120s).. thats too long Thats about the same definitions for me. But I can accept that if I happen to be doing something really crazy.. but this is merely about reading some files in and generating indexes out of them. None of the file are "huge".. < 15GB for the top 3, average < 100MB. -- Jesper --
The drives should be fast enough to saturate 4Gbit FC in streaming Ah. Sounds like your setup would benefit immensely from the per-bdi patches from Jens Axobe. I'm sure he would appreciate some feedback On a 32GB system that's 1.6GB of dirty data, but your array should be able to write that out fairly quickly (in a couple seconds) as long as it's not too random. If it's spread all over the disk, write throughput will drop significantly - how fast is data being written to Yeah, your disks aren't keeping up and/or data isn't being written out Me either - but luckily there have been plenty chiming in on this thread now. -Dave --
Thats allways a good question.. This is by far not being the only user of the array at the time of testing.. (there are 4 FC-channel connected to a switch). Creating a fresh slice.. and just dd'ing onto it from /dev/zero gives: jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 78.0557 s, 134 MB/s jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 8.11019 s, 129 MB/s Watching using dstat while dd'ing it peaks at 220M/s If I watch numbers on "dstat" output in production. It gets at peak around the same(130MB/s) but average is in the 90-100 MB/s range. It has 2GB of battery backed cache. I'm fairly sure that when it was new Thats another thing. I havent been debugging while hitting it (yet) but if I go ind and do a sync on the system manually. Then it doesn't get above 50MB/s in writeout (measured using dstat). But even that doesn't sum up to 8 minutes .. 1.6GB at 50MB/s ..=> 32 s. -- Jesper --
With 2GB of BBC, I'm surprised you are seeing as much latency as you are. It should be able to suck down writes as fast as you can throw Have you also tried increasing the IO priority of the kjournald processes as a workaround as Arjan van de Ven suggests? You must have a significant amount of activity going to that FC array from other clients - it certainly doesn't seem to be performing as well as it could/should be. -Dave --
Yes, but I triple checked.. the memory upgrade hadn't been installed, so No. I'll try to slip that one in. -- Jesper --
Agreed; we probably will need to get some blktrace outputs to see what I'm beginning to think that using a "ratio" may be the wrong way to go. We probably need to add an optional dirty_max_megabytes field where we start pushing dirty blocks out when the number of dirty blocks exceeds either the dirty_ratio or the dirty_max_megabytes, which ever comes first. The problem is that 5% might make sense for a small machine with only 1G of memory, but it might not make so much sense if you have 32G of memory. But the other problem is whether we are issuing the writes in an efficient way, and that means we need to see what is going on at the blktrace level as a starting point, and maybe we'll need some custom-designed trace outputs to see what is going on at the inode/logical block level, not just at the physical block level. - Ted --
We have that. Except it's called "dirty_bytes" and "dirty_background_bytes", and it defaults to zero (off). The problem being that unlike the ratio, there's no sane default value that you can at least argue is not _entirely_ pointless. Linus --
Well, if the maximum time that someone wants to wait for an fsync() to return is one second, and the RAID array can write 100MB/sec, then setting a value of 100MB makes a certain amount of sense. Yes, this doesn't take seek overheads into account, and it may be that we're not writing things out in an optimal order, as Alan as pointed out. But 100MB is much lower number than 5% of 32GB (1.6GB). It would be better if these numbers were accounted on a per-filesystem instead of a global threshold, but for people who are complaining about huge latencies, it at least a partial workaround that they can use today. I agree, it's not perfect, but this is a fundamentally hard problem. We have multiple solutions, such as ext4 and XFS's delayed allocation, which some people don't like because applications aren't calling fsync(). We can boost the I/O priority of kjournald which definitely helps, as Arjan has suggested, but Andrew has vetoed that. I have a patch which hopefully is less controversial, that posts writes using WRITE_SYNC instead of WRITE, but which only will help in some circumstances, but not in the distcc/icecream/fast downloads scnearios. We can use data=writeback, but folks don't like the security implications of that. People can call file system developers idiots if it makes them feel better --- sure, OK, we all suck. If someone wants to try to create a better file system, show us how to do better, or send us some patches. But this is not a problem that's easy to solve in a way that's going to make everyone happy; else it would have been solved already. - Ted --
How are you going to tell the kernel that the RAID array can write 100MB/s? The kernel has no idea. Linus --
Not at boot up, but after it's been using the RAID array for a little
while it could...
Bron (... imagining a tunable "max_fsync_wait_target_centisecs = 100"
which caused the kernel to notice how long flushes were taking
and tune its buffer sizes to be approximately right over time )
--
This tuning logic is the core of what Josef Bacik did for the transaction batching code for ext4.... ric --
userspace can do it quite easily. Run a self-tuning script after installation and when the disk hardware changes significantly. It is very disappointing that nobody appears to have attempted to do _any_ sensible tuning of these controls in all this time - we just keep thrashing around trying to pick better magic numbers in the base kernel. Maybe we should set the tunables to 99.9% to make it suck enough to motivate someone. --
Uhhuh. "user space can do it". That's the global cop-out. The fact is, user-space isn't doing it, and never has done anything even _remotely_ like it. In fact, I claim that it's impossible to do. If you give me a number for the throughput of your harddisk, I will laugh in your face and call you a moron. Why? Because no such number exists. It depends on the access patterns. If you write one large file, the number will be very different (and not just by a few percent) from the numbers of you writing thousands of small files, or re-writing a large database in random order. So no. User space CAN NOT DO IT, and the fact that you even claim The only times tunables have worked for us is when they auto-tune. IOW, we don't have "use 35% of memory for buffer cache" tunables, we just dynamically auto-tune memory use. And no, we don't expect user space to run some "tuning program for their load" either. Linus --
userspace can get closer. Even if it's asking the user "what sort of applications will this machine be running" and then use a set of canned tunables based on that. Better would be to observe system behaviour, perhaps in real time and This particular case is exceptional - it's just too hard for the kernel to be able to predict the future for this one. It wouldn't be terribly hard for a userspace daemon to produce better results than we can achieve in-kernel. That might of course require additional kernel work to support it well. --
Andrew, that's SIMPLY NOT TRUE. You state that without any amount of data to back it up, as if it was some Not by user space they aren't, and not dynamically. At least not as well as they are for the kernel. So when you say "user space can do it better", you base that statement on exactly what? The night-time whisperings of the small creatures living in your basement? The fact is, user space can't do better. And perhaps equally importantly, we have 16 years of history with user space tuning, and that history tells us unequivocally that user space never does anything like this. Name _one_ case where even simple tuning has happened, and where it has actually _worked_? I claim you cannot. And I have counter-examples. Just look at the utter fiasco that was user-space "tuning" of nice-levels that distros did. Ooh. Yeah, it didn't work so well, did it? Especially not when the kernel changed subtly, and the "tuning" that had been done was shown to be We've never even tried. The dirty limit was never about trying to tune things, it started out as protection against deadlocks and other catastrophic failures. We used to allow 50% dirty or something like that (which is not unlike our old buffer cache limits, btw), and then when we had a HIGHMEM lockup issue it got severly cut down. At no point was that number even _trying_ to limit latency, other than as a "hey, it's probably good to not have all memory tied up in dirty pages" kind of secondary way. I claim that the whole balancing between inodes/dentries/pagecache/swap/ anonymous memory/what-not is likely a much harder problem. And no, I'm not claiming that we "solved" that problem, but we've clearly done a pretty good job over the years of getting to a reasonable end result. Sure, you can still tune "swappiness" (nobody much does), but even there you don't actually tune how much memory you use for swap cache, you do more of a "meta-tuning" where you tune how the auto-tuning works. That ...
I've seen you repeatedly fiddle the in-kernel defaults based on in-field experience. That could just as easily have been done in initscripts by distros, and much more effectively because it doesn't need a new kernel. That's data. The fact that this hasn't even been _attempted_ (afaik) is deplorable. Why does everyone just sit around waiting for the kernel to put a new value into two magic numbers which userspace scripts could have set? My /etc/rc.local has been tweaking dirty_ratio, dirty_background_ratio That's different. It's inherent JBD/ext3-ordered brain damage. Unfixable without turning the fs into something which just isn't jbd/ext3 any more. data=writeback is a workaround, with the obvious integrity issues. The JBD journal is a massive designed-in contention point. It's why for several years I've been telling anyone who will listen that we need a new fs. Hopefully our response to all these problems will soon be "did you try btrfs?". --
On Thu, Mar 26, 2009 at 6:25 PM, Andrew Morton The only people who bother to tune those values are people who get annoyed enough to do the research to see if it's something that's tunable - hackers. Everyone else simply says "man, Linux *sucks*" and lives life hoping it will get better some day. From posts in this thread - even most developers just live with it, and have been doing so for *years*. Even Linux distros don't bother modifying init scripts - they patch them into kernel instead. I routinely watch Fedora kernel changelogs and found these comments in the changelog recently: * Mon Mar 23 2009 xx <xx@xx.xx> 2.6.29-2 - Change default swappiness setting from 60 to 30. * Thu Mar 19 2009 xx <xx@xx.xx> 2.6.29-0.66.rc8.git4 - Raise default vm dirty data limits from 5/10 to 10/20 percent. Why are the going in the kernel package instead of /etc/sysctl.conf? Why is Fedora deviating from upstream? (probably sqlite performance) Maybe there's a good reason to put them into the kernel - for some reason the latest kernels perform better with those values where the previous ones didn't. But still - why ship those 2 bytes of configuration in a 75MB package instead of one that could be a fraction of that size? Does *any* distro fiddle those bits in userspace instead of patching the kernel? -Dave --
Given that the optimal values of these tunables often seems to vary between kernel versions, it's easier to just put them in the kernel. -- Matthew Garrett | mjg59@srcf.ucam.org --
On Thu, Mar 26, 2009 at 07:21:08PM -0700, David Rees wrote: > * Mon Mar 23 2009 xx <xx@xx.xx> 2.6.29-2 > - Change default swappiness setting from 60 to 30. > > * Thu Mar 19 2009 xx <xx@xx.xx> 2.6.29-0.66.rc8.git4 > - Raise default vm dirty data limits from 5/10 to 10/20 percent. > > Why are the going in the kernel package instead of /etc/sysctl.conf? At least in part, because rpm sucks. If a user has editted /etc/sysctl.conf, upgrading the initscripts package won't change that file. Dave --
If there's a sensible default then it belongs in the kernel. Forcing these decisions out to userspace just means that every distribution needs to work out what these settings are, and the evidence we've seen when they attempt to do this is that we end up with things like broken cpufreq parameters because these are difficult problems. The simple reality is that almost every single distribution lacks developers with sufficient understanding of the problem to make the correct choice. The typical distribution lifecycle is significantly longer than a kernel If the distribution can set a globally correct value then that globally And how have you got these values pushed into other distributions? Is your rc.local available anywhere? Linus is absolutely right here. Pushing these decisions out to userspace means duplicated work in the best case - in the worst case it means most users end up with the wrong value. -- Matthew Garrett | mjg59@srcf.ucam.org --
.. and as a result you're also testing something that nobody else is. Look at the complaints from people about fsync behavior that Ted says he cannot see. Let me guess: it's because Ted probably has tweaked his environment, because he is advanced. As a result, other people see problems, he does not. That's not "advanced". That's totally f*cking broken. Having different distributions tweak all those tweakables is just even _more_ so. It's the anti-thesis of "advanced". It's just stupid. We should aim to get it right. The "user space can tweak any numbers they want" is ALWAYS THE WRONG ANSWER. It's a cop-out, but more importantly, it's a cop-out that doesn't even work, and that just results in everybody having different setups. Then nobody is happy. Linus --
In fact it results in "everybody" just having the distro defaults, which in some cases then depend on things like which particular version they initially installed things with (because some decisions end up being codified in long-term memory by that initial install - like the size of the journal when you mkfs'd your filesystem, or the alignment of your partitions, or whatever). The exception, of course, ends up being power-users that then tweak things on their own. Me, I may be a power user, but I absolutely refuse to touch default values. If they are wrong, they should be fixed. I don't want to add "relatime" to my /etc/fstab, because then the next time I install, I'll forget - and if I really need to do that, then the kernel should have already done it for me as the default choice. I also don't want to say that "Fedora should just do it right" (I'll complain about things Fedora does badly, but not setting magic values in /proc is not one of them), because then even if Fedora _were_ to get things right, others won't. Or even worse, somebody will point that SuSE or Ubuntu _did_ do it right, but the distro I happen to use is doing the wrong thing. And yes, I could do my own site-specific tweaks, but again, why should I? If the tweak really is needed, I should put it in the generic kernel. I don't do anything odd. End result: regardless of scenario, depending on user-land tweaking is always the wrong thing. It's the wrong thing for distributions (they'd all need to do the exact same thing anyway, or chaos reigns, so it might as well be a kernel default), and it's the wrong thing for individuals (because 99.9% of individuals won't know what to do, and the remaining 0.1% should be trying to improve _other_ peoples experiences, not just their own!). The only excuse _ever_ for user-land tweaking is if you do something really odd. Say that you want to get the absolutely best OLTP numbers you can possibly get - with no regards for _any_ other ...
while I agree with most of what you say, I'll point out that many enterprise servers really do care about one particular workload to the exclusion of everything else. if you can get another 10% performance by tuning your box for an OLTP workload and make your cluster 9 boxes instead of 10 it's well worth it (it ends up being better response time for users, less hardware, and avoiding software license costs most of the time" this is somewhere between benchmarking and embedded, but it is a valid case. most users (even most database users) don't need to go after that last little bit of performance, the defalts should be good enough for most users, no matter what workload they are running. David Lang --
Three reasons. Firstly, this utterly does not scale. Microsoft has built an empire on the 'power of the default settings' - why cannot Linux kernel developers finally realize the obvious: that setting defaults centrally is an incredibly efficient way of shaping the end result? The second reason is that in the past 10 years we have gone through a couple of toxic cycles of distros trying to work around kernel behavior by setting sysctls. That was done and then forgotten, and a few years down the line some kernel maintainer found [related to a bugreport] that distro X set that sysctl to value Y which now had a different behavior and immediately chastised the distro broken and refused to touch the bugreport and refused bugreports from that distro from that point on. We've seen this again, and again, and i remember 2-3 specific examples and i know how badly this experience trickles down on the distro side. The end result: pretty much any tuning of kernel defaults is done extremely reluctantly by distros. They consider kernel behavior a domain of the kernel, and they dont generally risk going away from the default. [ In other words, distro developers understand the 'power of defaults' a lot better than kernel developers ... ] This is also true in the reverse direction: they dont actually mind the kernel doing a central change of policy, if it's a general step forward. Distro developers are very practical, and they are a lot less hardline about the sacred Unix principle of separation of kernel from policy. Thirdly: the latency of getting changes to users. A new kernel is released every 3 months. Distros are released every 6 months. A new Firefox major version is released about once a year. A new major GCC is released every three years. Given the release frequency and given our goal to minimize the latency of getting improvements to users, which of these projects is best suited to introduce a new default value? [and no, such changes are ...
Oh I look forward to the day when it will be safe to convert my mythtv box from ext3 to btrfs. Current kernels just have too much IO latency with ext3 it seems. Older kernels were more responsive, but probably had other places they were less efficient. -- Len Sorensen --
On Wed, 1 Apr 2009 17:03:38 -0400 Back in 2002ish I did a *lot* of work on IO latency, reads-vs-writes, etc, etc (but not fsync - for practical purposes it's unfixable on ext3-ordered) Performance was pretty good. From some of the descriptions I'm seeing get tossed around lately, I suspect that it has regressed. It would be useful/interesting if people were to rerun some of these tests with `echo anticipatory > /sys/block/sda/queue/scheduler'. Or with linux-2.5.60 :( --
Well 2.6.18 seems to keep popping up as the last kernel with "sane" behaviour, at least in terms of not causing huge delays under many workloads. I currently run 2.6.26, although that could be updated as soon as I get around to figuring out why lirc isn't working for me when I move past 2.6.26. I could certainly try changing the scheduler on my mythtv box and seeing if that makes any difference to the behaviour. It is pretty darn obvious whether it is responsive or not when starting to play back a video. -- Len Sorensen --
.. My Myth box here was running 2.6.18 when originally set up, and even back then it still took *minutes* to delete large files. So that part hasn't really changed much in the interim. Because of the multi-minute deletes, the distro shutdown scripts would fails, and power off the box while it was still writing to the drives. Ouch. That system has had XFS on it for the past year and a half now, and for Myth, there's no reason not to use XFS. It's great! Cheers --
Mythtv has a 'slow delete' option that I believe works by slowly truncating the file. Seems they believe that ext3 is bad at handling large file deletes, so they try to spread out the pain. I don't remember if that option is on by default or not. I turned it off. -- Len Sorensen --
.. That option doesn't make much difference for the shutdown failure. And with XFS there's no need for it, so I now have it "off". Cheers --
yeah. There's a dirty hack you can do where you append one byte to the file every 4MB, across 1GB (say). That will then lay the file out on-disk as one bitmap block one data block one bitmap block one data block one bitmap block one data block one bitmap block one data block <etc> lots-of-data-blocks So when the time comes to delete that gigabyte, the bitmaps blocks are only one block apart, and reading them is much faster. That was one of the gruesome hacks I did way back when I was in the streaming video recording game. Another was the slow-delete thing. - open the file - unlink the file - now sit in a loop, slowly nibbling away at the tail with ftruncate() until the file is gone. The open/unlink was there so that if the system were to crash midway, ext3 orphan recovery at reboot time would fully delete the remainder of the file. Another was to add an ioctl to ext3 to extend the file outside EOF, but only metadata - the corresponding data blocks are left uninitialised. That permitted large amount of data blocks to be allocated to the file with high contiguity, fixing the block-intermingling problems when ext3 is writing multiple files (which reservations later addressed). This is of course insecure, but that isn't a problem on an embedded/consumer black box device. ext3 sucks less nowadays, but it's still a hard vacuum. --
.. That's similar to what Mythtv currently does. Except it nibbles away in painfully tiny chunks, so deleting takes hours that way. Which means it's still in progress when the system auto-shutdowns between uses. So the delete process gets killed, and the subsequent remount,ro and umount calls simply fail (fs is still busy), and it then powers off while the drive light is still solidly busy. That's where I modified the shutdown script to check the result code, sleep, and loop again, for up to five minutes before pulling the plug. But switching to xfs cured all of that. :) --
.. I think it does the equivalent of that today. Problem is, if you do the unlink without the nibbling, then the disk locks up the system cold for 2-3 minutes until the disk delete actually completes. -ml --
I'll test this (and the other suggestions) once i'm out of the merge I probably wont test that though ;-) Going back to v2.6.14 to do pre-mutex-merge performance tests was already quite a challenge on modern hardware. Ingo --
Well after a day of running my mythtv box with anticipatiry rather than the default cfq scheduler, it certainly looks a lot better. I haven't seen any slowdowns, the disk activity light isn't on solidly (it just flashes every couple of seconds instead), and it doesn't even mind me lanuching bittornado on multiple torrents at the same time as two recordings are taking place and some commercial flagging is taking place. With cfq this would usually make the system unusable (and a Q6600 with 6GB ram should never be unresponsive in my opinion). So so far I would rank anticipatory at about 1000x better than cfq for my work load. It sure acts a lot more like it used to back in 2.6.18 times. -- Len Sorensen --
Jens - remind us what the problem with AS was wrt CFQ? There's some write throttling in CFQ, maybe it has some really broken case? Linus --
CFQ was just faster, plus it supported things like io priorities that AS Who knows, it's definitely interesting and something to look into why AS performs that differently to CFQ on his box. Lennart, can you give some information on what file system + mount options, disk drive(s), etc? A -- Jens Axboe --
btw., while pluggable IO schedulers have their upsides:
- They are easier to test during development and deployment.
- The uptick of a new, experimental IO scheduler is faster due to
easier availability.
- Regressions in the primary IO scheduler are easier to prove.
And the technical case for pluggable IO schedulers is much stronger
than the case for pluggable process schedulers:
- Persistent media has persistent workloads - and each workload has
different access patterns.
- The inefficiencies of mixed workloads on the same rotating media
have forced a clear separation of the 'one disk, one workload'
usage model, and has hammered this down people's minds. (Nobody
in their right mind is going to put a big Oracle and SAP
installation on the same [rotating] disk.)
- the 'NOP' scheduler makes sense on media with RAM-like
properties. 90% of CFQ's overhead is useless fluff on such media.
- [ These properties are not there for CPU schedulers: CPUs are
data processors not persistent data storage so they are
fundamentally shared by all workloads and have a lot less
persistent state - so mixing workloads on CPUs is common and
having one good scheduler is paramount. ]
At the risk of restarting the "to plug or not to plug" scheduler
flamewars ;-), the pluggable IO scheduler design has its very clear
downsides as well:
- 99% of users use CFQ, so any bugs in it will hit 99% of the Linux
community and we have not actually won much in terms of helping
real people out in the field.
- We are many years down the road of having replaced AS with the
supposedly better CFQ - and AS is still (or again?) markedly
better for some common tests.
- The 1% of testers/users who find that CFQ sucks and track it down
to CFQ can easily switch back to another IO scheduler: NOP or AS.
This dillutes the quality of _CFQ_, our crown jewel IO scheduler:
as it removes critical participiants from the pool ...I rarely disagree with you, and more rarely feel like arguing a point in public, but you are basing your whole opinion on the premise that it is possible to have one io scheduler which handles all cases. And that seems obviously wrong, because you address different types of activity with tuning or adapting, in some cases you need a whole different approach, and you need to lock in that approach even if some metric says something else would be better for the "better" seen by I think that by trying to create "one size fits all" you will hit a significant number of cases where it really doesn't fit well and you have so many tuning features both automatic and manual that you wind up with code which is big, inefficient, confusing to tune, hard to maintain, and generally not optimal for any one thing. What we have is easy to test and the behavior is different enough in most cases that you can tell which is best, or at least that a change didn't help. I have watched long threads and chats about tuning VM (dirty_*, swappiness, etc) to be aware that in most cases either faster disk or more memory is the answer, not tuning to be "less unsatisfactory." Several distinct io schedulers is good, one complex bland one would not be. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
Faster at what? I am now wondering if switching the servers at work to anticipatory will make them be more responsive when an rsnapshot run is done (which it is every 3 hours). That would provide another data point. It is currently very easy to tell when 10:00, 13:00, 16:00 and 19:00 Well the system is setup like this: Core 2 Quad Q6600 CPU (2.4GHz quad core). Asus P5K mainboard (Intel P35 chipset) 6GB of ram PVR500 dual NTSC tuner pci card 4 x 500GB WD5000AAKS SATA drives 25GB sda1 + sdb1 raid1 for / 25GB sdc1 + sdd1 raid1 for /home remaining as sda2 + sdb2 + sdc2 + sdd2 raid5 for LVM. 1.2TB /var uses most of the LVM for mythtv storage and other data 6GB swap on LVM 94GB test volume on LVM (this uses ext4 but is hardly ever used) all filesystems other than the test one are ext3 I run the ICH9 in AHCI mode since in IDE mode it doesn't do 64bit DMA and the bounce buffers seemed to be having issues keeping up. So normal use of the machine is: mythtv-backend + mysql takes care of the mythtv recording work. mythtv-frontend with output on an nvidia 8600GT (with proprietary drivers in use) commercial flagging and some transcode to mpeg4 for shows I keep for a while run in parallel, since after all there are 4 cores to use. folding@home running smp (using all 4 cores) at idle priority birtornado running with many torrents running slowly seeding (I limit it to 5kb/s up at all times due to monthly caps on transfers from my ISP, s this way it can do something consistently without going over). It probably has 300GB worth of files being seeded at the moment. So when I first built the machine it ran really nicely, but I think that was with 2.6.16 or 2.6.18 or so. It was a while ago. It worked quite well, responsiveness was good, etc. 2.6.24 - 2.6.26 has been not so great. Well until I switched the ioscheduler a couple of days ago. So the behaviour with cfq is: Disk light seems to be constantly on if there is any disk activity. iotop can show a total ...
.. Lennart, I wonder if the problem with your system is really a Myth/driver issue? Curiously, I have a HVR-1600 card here, and when recording analog TV with it the disk lights are on constantly. The problem with it turns out to be mythbackend doing fsync() calls ten times a second. My other tuner cards don't have this problem. So perhaps the PVR-500 triggers the same buggy behaviour as the HVR-1600? To work around it here, I decided to use a preload library that replaces the frequent fsync() calls with a more moderated behaviour: http://rtr.ca/hvr1600/libfsync.tar.gz Grab that file and try it out. Instructions are included within. Report back again and let us know if it makes any difference. Someday I may try and chase down the exact bug that causes mythbackend to go fsyncing berserk like that, but for now this workaround is fine. Cheers --
Well if it is the real cause of the bad behaviour then it would certainly be good to track down. -- Len Sorensen --
mythtv/libs/libmythtv/ThreadedFileWriter.cpp is a good place to start (Sync method... uses fdatasync if available, fsync if not). mythtv is definitely a candidate for sync_file_range() style output, IMO. Jeff --
Just curious, does MythTV need fsync(), or merely to tell the kernel to begin asynchronously writing data to storage? sync_file_range(..., SYNC_FILE_RANGE_WRITE) might be enough, if you do not need to actually wait for completion. This may be the case, if the idea behind MythTV's fsync(2) is simply to prevent the kernel from building up a huge amount of dirty pages in the pagecache [which, in turn, produces bursty write-out behavior]. Jeff --
quoting the TheadedFileWriter comments
/*
* NOTE: This doesn't even try flush our queue of data.
* This only ensures that data which has already been sent
* to the kernel for this file is written to disk. This
* means that if this backend is writing the data over a
* network filesystem like NFS, then the data will be visible
* to the NFS server after this is called. It is also useful
* in preventing the kernel from buffering up so many writes
* that they steal the CPU for a long time when the write
* to disk actually occurs.
see above, we care only about the write-out. The f{data}*sync calls are
already in a seperate thread doing nothing else.
Janne
--
There is no need to fsync data on a NFS mount in Linux anymore. All NFS mounts are mounted sync by default now unless you explicitly specify otherwise (and then you should then know what you're getting in to). -Dave --
If all you want to do is _start_ the write-out from kernel to disk, and let the kernel handle it asynchronously, SYNC_FILE_RANGE_WRITE will do that for you, eliminating the need for a separate thread. If you need to wait for the data to hit disk, you will need the other SYNC_FILE_RANGE_xxx bits. On a related subject, reads: consider posix_fadvise(POSIX_FADV_SEQUENTIAL) and/or readahead(2) for optimizing the reading side of things. Jeff Jeff --
It may not eliminate the need for a separate thread. SYNC_FILE_RANGE_WRITE will still block on things. It just will block on _much_ less than fsync. In particular, it will block on: - actually queuing up the IO (ie we need to get the bio, request etc all allocated and queued up) - if a page is under writeback, and has been marked dirty since that writeback started, we'll wait for that IO to finish in order to start a new one. and depending on load, both of these things _can_ be issues and you might still want to do the SYNC_FILE_RANGE_WRITE as a async thread separate from the main loop so that the latency of the main loop is not affected by that. But the latencies will be _much_ smaller issues than with f[data]sync(), though, especially if you're not ever really hitting the limits on the disk subsystem. Because those will additionally - wait for all old writeback to complete (whether the page was dirtied after the writeback started or not) - additionally, wait for all the new writeback it started. - wait for the metadata too (fsync()). so they are pretty much _guaranteed_ to sleep for actual IO to complete I doubt POSIX_FADV_SEQUENTIAL will do very much. The kernel tends to figure out the read patterns on its own pretty well. Of course, explicit readahead() can be noticeable for the right patterns. Linus --
The *only* reason MythTV fsyncs (or fdatasyncs) the data to disk all the time is to keep a large amount of dirty pages from building up and then causing horrible latencies when that data starts getting flushed to disk. A typical example of this would be that MythTV is recording a show in the background while playing back another show. When the dirty limit is hit and data gets flushed to disk, this would keep the read buffer on the player from happening fast enough and then playback would stutter. Instead of telling people ext3 sucks - mount it in writeback or use xfs or tweak your vm knobs, they simply put a hack in there instead which largely eliminates the effect. I don't think many people would care too much if they lost 30-60 seconds of their recorded TV show if the system crashes for whatever reason. -Dave --
Jeff, could you please try following patch for 0.21 or update to the
latest trunk revision. I don't have a way to reproduce the high
latencies with fdatasync on ext3, data=ordered. Doing a parallel
"dd if=/dev/zero of=file" on the same partition introduces even with
sync_file_range latencies over 1 second.
Janne
---
Index: configure
===================================================================
--- configure (revision 20302)
+++ configure (working copy)
@@ -873,6 +873,7 @@
sdl_video_size
soundcard_h
stdint_h
+ sync_file_range
sys_poll_h
sys_soundcard_h
termios_h
@@ -2413,6 +2414,17 @@
int main( void ) { return (round(3.999f) > 0)?0:1; }
EOF
+# test for sync_file_range (linux only system call since 2.6.17)
+check_ld <<EOF && enable sync_file_range
+#define _GNU_SOURCE
+#include <fcntl.h>
+
+int main(int argc, char **argv){
+ sync_file_range(0,0,0,0);
+ return 0;
+}
+EOF
+
# test for sizeof(int)
for sizeof in 1 2 4 8 16; do
check_cc <<EOF && _sizeof_int=$sizeof && break
Index: libs/libmythtv/ThreadedFileWriter.cpp
===================================================================
--- libs/libmythtv/ThreadedFileWriter.cpp (revision 20302)
+++ libs/libmythtv/ThreadedFileWriter.cpp (working copy)
@@ -18,6 +18,7 @@
#include "ThreadedFileWriter.h"
#include "mythcontext.h"
#include "compat.h"
+#include "mythconfig.h"
#if defined(_POSIX_SYNCHRONIZED_IO) && _POSIX_SYNCHRONIZED_IO > 0
#define HAVE_FDATASYNC
@@ -122,6 +123,7 @@
// file stuff
filename(QDeepCopy<QString>(fname)), flags(pflags),
mode(pmode), fd(-1),
+ m_file_sync(0), m_file_wpos(0),
// state
no_writes(false), flush(false),
in_dtor(false), ignore_writes(false),
@@ -154,6 +156,8 @@
buf = new char[TFW_DEF_BUF_SIZE + 1024];
bzero(buf, TFW_DEF_BUF_SIZE + 64);
+ m_file_sync = m_file_wpos = ...Is dd + sync_file_range really a realistic comparison? dd is streaming as fast as the disk can output data, whereas MythTV is streaming as fast as video is being recorded. If you are maxing out your disk throughput, there will be obvious impact no matter what. I would think a more accurate comparison would be recording multiple video streams in parallel, comparing fsync/fdatasync/sync_file_range? IOW, what is an average MythTV setup -- what processes are actively reading/writing storage? Where are you noticing latencies, and does sync_file_range decrease those areas of high latency? Jeff --
sure, I tried simulating a case where the fsync/fdatasync from mythtv
impose high latencies on other processes due to syncing other big writes
I tested 3 simultaneous recordings and haven't noticed a difference. I'm
even sure if I should. With multiple recording at the same time mythtv
would also call fdatasync multiple times per second.
I guess I could compare how long fdatasync and sync_file_range with
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER are blocking. Not that mythtv would care
writing:
mythbackend - recordings and preview images (2-20mbps)
reading:
mythfrontend - viewing (2-20mbps)
mythcommflag - faster than viewing, maybe up to 50mbps (depending on cpu)
writing+reading:
mythtranscode - combined rate less than 50mbps, usually more reads than
writes (depending on cpu)
I don't notice latencies in mythtv, at least no for which file systems
or the block layer can be blamed for. But my setup is build to avoid
these. Mythtv records to it's own disks formatted with xfs. Mythtv
generally tries to spread simultaneous recodings over different file
systems. The tests were on a different system though.
Janne
--
I am going to give the patch a shot. I run dual tuners after all, so I do get multiple streams recording while doing playback at the same time. -- Len Sorensen --
Hi, Lennart, Could you try one more test, please? Switch back to CFQ and set /sys/block/sdX/queue/iosched/slice_idle to 0? I'm not sure how the applications you are running write to disk, but if they interleave I/O between processes, this could help. I'm not too confident that this will make a difference, though, since CFQ changed to time-slice based instead of quantum based before 2.6.18. Still, it would be another data point if you have the time. Thanks in advance! -Jeff --
I actually am running cfq at the moment, but with Mark's (I think it was) preload library to change fsync calls to at most one per 5 seconds instead of 10 per second. So far that has certainly made things a lot better as far as I can tell. Maybe not as good as anticipatory seemed to be but certainly better. I can try your suggestion too. Well when recording two shows at once, there will be two processes streaming to seperate files, and usually there will be two commercial flagging processes following behind reading those files and doing mysql No problem. If it solves this bad behaviour, it will be all worth it. -- Len Sorensen --
.. Yeah, I think the sync_file_range() patch is the way to go. It seems to be smooth enough here with four or five simultaneous recordings, a couple of commflaggers, and an HD playback all happening at once. Cheers --
Well would be worth a try. So far I am not sure if the slice_idle works or not. I will have to try playback when I get home and see how it feels. -- Len Sorensen --
You could convert it to xfs now. xfs is probably the file system with the lowest complaints usage ratio within the mythyv community. MythTV calls fsync every few seconds on ongoing recordings to prevent stalls due to large cache writebacks on ext3. cheers Janne (MythTV developer) --
It should use sync_file_range(SYNC_FILE_RANGE_WRITE). That will - have minimum latency. It tries to avoid blocking at all. - avoid writing metadata - avoid syncing other unrelated files within ext3 - avoid waiting for the ext3 commit to complete. --
MythTV actually uses fdatasync, not fsync (or at least that's what it did last time I looked at the source). Not sure how the behavior of fdatasync compares to sync_file_range. Either way - forcing the data to be synced to disk a couple times every second is a hack and causes fragmentation in filesystems without delayed allocation. Fragmentation really goes up if you are recording multiple shows at once. -Dave --
fdatasync() _waits_ for the data to hit the disk. sync_file_range() just starts writeout. It _can_ do more - you can also ask for it to wait for previous write-out in order to start _new_ writeout, or wait for the result, but you wouldn't want to, not for something like this. sync_file_range() is really a much nicer interface, and is a more extended fdatasync() that actually matches what the kernel does internally. You can think of fdatasync(fd) as a sync_file_range(fd, 0, ~0ull, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); and then you see why fdatasync is such a horrible interface. Linus --
The file layout issue is unrelated to the frequency of fdatasync() - the block allocation is done at the time of write(). ext3 _should_ handle this case fairly well nowadays - I thought we fixed that. However it would probably benefit from having the size of the block reservation window increased - use ioctl(EXT3_IOC_SETRSVSZ). That way, each file gets a decent-sized hunk of disk "reserved" for its ongoing appending. Other files won't come in and intermingle their blocks with it. --
How big of a chore would it be, to use this code to implement i_op->fallocate() for ext3, I wonder? Jeff --
Check out posix_fallocate(3). Not appropriate for every situation, might eat additional disk bandwidth... But if you are looking to combat fragmentation, pre-allocation (manual or kernel-assisted) is a relevant technique. Plus, overwriting existing data blocks is a LOT cheaper than appending to a file. fsync's more quickly to disk, too. Jeff --
Personally that is also one of my MythTV pet peeves. A hack added to MythTV to work around a crappy ext3 latency bug that also causes these large files to get heavily fragmented. That and the fact that yo have to patch MythTV to eliminate those forced fdatasyncs - there is no knob to turn it off if you're running MythTV on a filesystem which doesn't suffer from ext3's data=ordered fsync stalls. -Dave --
For any filesystem it is quite sensible for an application to manage the amount of dirty memory which the kernel is holding on its behalf, and based upon the application's knowledge of its future access patterns. But MythTV did it the wrong way. A suitable design for the streaming might be, every 4MB: - run sync_file_range(SYNC_FILE_RANGE_WRITE) to get the 4MB underway to the disk - run fadvise(POSIX_FADV_DONTNEED) against the previous 4MB to discard it from pagecache. --
Here's an example. I call it "overwrite.c" for obvious reasons.
Except I used 8MB ranges, and I "stream" random data. Very useful for
"secure delete" of harddisks. It gives pretty optimal speed, while not
destroying your system experience.
Of course, I do think the kernel could/should do this kind of thing
automatically. We really could do something like this with a "dirty LRU"
queue. Make the logic be:
- if you have more than "2*limit" pages in your dirty LRU queue, start
writeout on "limit" pages (default value: 8MB, tunable in /proc).
Remove from LRU queues.
- On writeback IO completion, if it's not on any LRU list, insert page
into "done_write" LRU list.
- if you have more than "2*limit" pages on the done_write LRU queue,
try to just get rid of the first "limit" pages.
It would probably work fine in general. Temp-files (smaller than 8MB
total) would go into the dirty LRU queue, but wouldn't be written out to
disk if they get deleted before you've generated 8MB of dirty data.
But this does the queue-handling by hand, and gives you a throughput
indicator. It should get fairly close to disk speeds.
Linus
---
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <linux/fs.h>
#define BUFSIZE (8*1024*1024ul)
int main(int argc, char **argv)
{
static char buffer[BUFSIZE];
struct timeval start, now;
unsigned int index;
int fd;
mlockall(MCL_CURRENT | MCL_FUTURE);
fd = open("/dev/urandom", O_RDONLY);
if (read(fd, buffer, BUFSIZE) != BUFSIZE) {
perror("/dev/urandom");
exit(1);
}
close(fd);
fd = open(argv[1], O_RDWR | O_CREAT, 0666);
if (fd < 0) {
perror(argv[1]);
exit(1);
}
gettimeofday(&start, NULL);
for (index = 0; ;index++) {
double s;
unsigned long MBps;
unsigned long MB;
if (write(fd, buffer, BUFSIZE) != BUFSIZE)
break;
sync_file_range(fd, index*BUFSIZE, BUFSIZE, ...Oh, except my example doesn't do the fadvise. Instead, I make sure to throttle the writes and the old range with SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER which makes sure that the old pages are easily dropped by the VM - and they will be, since they end up always being on the cold list. I _wanted_ to add a SYNC_FILE_RANGE_DROP but I never bothered because this particular load it didn't matter. The system was perfectly usable while overwriting even huge disks because there was never more than 8MB of dirty data in flight in the IO queues at any time. Linus --
Dumb VM question, then: I understand the logic behind the write-throttling part (some of my own userland code does something similar), but, Does this imply adding fadvise to your overwrite.c example is (a) not noticable, (b) potentially less efficient, (c) potentially more efficient? Or IOW, does fadvise purely put pages on the cold list as your sync_file_range incantation does, or something different? Thanks, Jeff, who is already using sync_file_range in some server-esque userland projects --
For _that_ particular load it was more of a "it wasn't the issue". I
wanted to get timely writeouts, because otherwise they bunch up and become
unmanageable (with even the people who are not actually writing end up
waiting for the writeouts).
Once the pages are clean, it just didn't matter. The VM did the balancing
right enough that I stopped caring. With other access patterns (ie if the
pages ended up on the active list) the situation might have been
sync_file_range() doesn't actually put the pages on the inactive list, but
since the program was just a streaming one, they never even left it.
But no, fadvise actually tries to actually invalidate the pages (ie gets
rid of them, as opposed to moving them to the inactive list).
Another note: I literally used that program just for whole-disk testing,
so the behavior on an actual filesystem may or may not match. But I just
tested on ext3 on my desktop, and got
1.734 GB written in 30.38 (58 MB/s)
until I ^C'd it, and I didn't have any sound skipping or anything like
that. Of course, that's with those nice Intel SSD's, so that doesn't
really say anything.
Feel free to give it a try. It _should_ maintain good write speed while
not disturbing the system much. But I bet if you added the "fadvise()" it
would disturb things even _less_.
My only point is really that you _can_ do streaming writes well, but at
the same time I do think the kernel makes it too hard to do it with
"simple" applications. I'd love to get the same kind of high-speed
streaming behavior by just doing a simple "dd if=/dev/zero of=bigfile"
And I really think we should be able to.
And no, we clearly are _not_ able to do that now. I just tried with "dd",
and created a 1.7G file that way, and it was stuttering - even with my
nice SSD setup. I'm in my MUA writing this email (obviously), and in the
middle it just totally hung for about half a minute - because it was
obviously doing some fsync() for temporary ...On Thu, 2 Apr 2009 15:42:51 -0700 (PDT) The thing which has always worried me about trying to do smart drop-behind is the cost of getting it wrong - and sometimes it _will_ get it wrong. Someone out there will have an important application which linearly writes a 1G file and then reads it all back in again. They will get really upset when their runtime doubles. --
Yes. The good news is that it would be a pretty easy tunable to have a "how soon do we writeback and how soon would we drop". And I do suspect that _dropping_ should default to off (exactly because of the kind of situation you bring up). As mentioned, at least in my experience the VM is pretty good at dropping the right pages anyway. It's when they are dirty or locked that we end up stuttering (or when we do fsync). And "start background writeout earlier" improves that case regardless of drop-behind. But at the same time it is also unquestionably true that the current behavior tends to maximize throughput performance. Delaying the writes as long as possible is almost always the right thing for througput. In my experience, at least on desktops, latency is a lot more important than throughput is. And I don't think anybody wants to start the writes _immediately_. Linus --
Attached is my slightly-modified version of overwrite.c, modded to bound
the file size and to use fadvise().
On a 128GB, 3.0 Gbps no-name SATA SSD, x86-64, ext3, 2.6.29 vanilla kernel:
+ ./overwrite -b 3000 /spare/tmp/test.dat
writing 3000 buffers of size 8m
23.429 GB written in 1019.25 (23 MB/s)
real 17m0.211s
user 0m0.028s
sys 1m5.800s
+ ./overwrite -b 3000 -f /spare/tmp/test.dat
using fadvise()
writing 3000 buffers of size 8m
23.429 GB written in 1060.54 (22 MB/s)
real 17m41.446s
user 0m0.036s
sys 1m9.016s
The most interesting thing I found: the SSD does 80 MB/s for the first
~1 GB or so, then slows down dramatically. After ~2GB, it is down to 32
MB/s. After ~4GB, it reaches a steady speed around 23 MB/s.
--------------------------------------------------
On a 500GB, 3.0Gbps Seagate SATA drive, x86-64, ext3, 2.6.29 vanilla kernel:
+ ./overwrite -b 3000 /garz/tmp/test.dat
writing 3000 buffers of size 8m
23.429 GB written in 539.06 (44 MB/s)
real 9m0.348s
user 0m0.064s
sys 1m2.704s
+ ./overwrite -b 3000 -f /garz/tmp/test.dat
using fadvise()
writing 3000 buffers of size 8m
23.429 GB written in 535.08 (44 MB/s)
real 8m55.971s
user 0m0.044s
sys 1m6.600s
There is a similar performance fall-off for the Seagate, but much less
pronounced:
After 1GB: 52 MB/s
After 2GB: 44 MB/s
After 3GB: steady state
There appears to be a small increase in system time with "-f" (use
fadvise), but I'm guessing time(1) does not really give a good picture
of overall system time used, when you include background VM activity.
Jeff
Are you sure that isn't an effect of double and triple indirect blocks etc? The metadata updates get more complex for the deeper indirect blocks. Or just our page cache lookup? Maybe our radix tree thing hits something It would also be good to just compare it to something like time sh -c "dd + sync" (Which in my experience tends to fluctuate much more than the steady state thing, so I suspect you'd need to do a few runs to make sure the numbers are stable). Linus --
Indirect block overhead increased as the file grew to 23 GB, I'm sure... I should probably re-test pre-creating the file, _then_ running overwrite.c. That would at least guarantee the filesystem isn't allocating new blocks and metadata. I was really surprised the performance was so high at first, then fell off so dramatically, on the SSD here. Unfortunately I cannot trash these blkdevs, so the raw blkdev numbers I'll add that to the next run... Jeff --
Well, one rather simple explanation is that if you hadn't been doing lots of writes, then the background garbage collection on the Intel SSD gets ahead of the game, and gives you lots of bursty nice write bandwidth due to having a nicely compacted and pre-erased blocks. Then, after lots of writing, all the pre-erased blocks are gone, and you are down to a steady state where it needs to GC and erase blocks to make room for new writes. So that part doesn't suprise me per se. The Intel SSD's definitely flucutate a bit timing-wise (but I love how they never degenerate to the "ooh, that _really_ sucks" case that the other SSD's and the rotational media I've seen does when you do random writes). The fact that it also happens for the regular disk does imply that it's Hey, understood. I don't think raw block accesses are even all that interesting. But you might try to write the file backwards, and see if you see the same pattern. Linus --
23MB/s seems a bit low though, I'd try with O_DIRECT. ext3 doesn't do writepages, and the ssd may be very sensitive to smaller writes (what Jeff if you blktrace it I can make up a seekwatcher graph. My bet is that pdflush is stuck writing the indirect blocks, and doing a ton of seeks. You could change the overwrite program to also do sync_file_range on the block device ;) -chris --
I didn't realize that Jeff had a non-Intel SSD. THAT sure explains the huge drop-off. I do see Intel SSD's fluctuating Actually, that won't help. 'sync_file_range()' works only on the virtually indexed page cache, and I think ext3 uses "struct buffer_head *" for all it's metadata updates (due to how JBD works). So sync_file_range() will do nothing at all to the metadata, regardless of what mapping you execute it on. Linus --
Even the intel ones have cliffs for long running random io workloads (where the bottom of the cliff is still very fast), but something like The buffer heads do end up on the block device inode's pages, and ext3 is letting pdflush do some of the writeback. Its hard to say if the sync_file_range is going to help, the IO on the metadata may be random enough for that ssd that it won't really matter who writes it or when. Spinning disks might suck, but at least they all suck in the same way...tuning for all these different ssds isn't going to be fun at all. -chris --
Yeah, it's a no-name SSD. I've attached 'hdparm -I' in case anyone is curious. It's from newegg.com, so nothing NDA'd or sekrit. Jeff
Hmm. Does it do ok on the "random write" test? There's a few non-intel controllers that are fine - apparently the newer samsung ones, and the one from Indilinx. But I _think_ G.SKILL uses those horribly broken JMicron controllers. Judging by your performance numbers, it's the slightly fancier double controller version (ie basically an internal RAID0 of two identical JMicron controllers, each handling half of the flash chips). Try a random write test. If it's the JMicron controllers, performance will plummet to a few tens of kilobytes per second. Linus --
Quoting from the review at http://www.bit-tech.net/hardware/storage/2008/12/03/g-skill-patriot-and-intel-ssd-test/2 "Cracking the drive open reveals the PCB fitted with sixteen Samsung 840, 8GB MLC NAND flash memory modules, linked to a J-Micron JMF 602 Since I am hacking on osdblk currently, I was too slack to code up a But I guess seeks are not very helpful on an SSD :) Any pre-built random write tests out there? Regards, Jeff --
Afaik, bonnie does it all in the page cache, and only tests random reads "fio" does well: http://git.kernel.dk/?p=fio.git;a=summary and I think it comes with a few example files. Here's the random write file that Jens suggested, and that works pretty well.. It first creates a 2GB file to do the IO on, then does random 4k writes to it with O_DIRECT. If your SSD does badly at it, you'll just want to kill it, but it shows you how many MB/s it's doing (or, in the sucky case, how many kB/s). Linus --- [global] filename=testfile size=2g create_fsync=1 overwrite=1 [randwrites] # make rw= 'randread' for random reads, 'read' for reads, etc rw=randwrite bs=4k direct=1 --
heh, so far, the SSD is poking along... Jobs: 1 (f=1): [w] [2.5% done] [ 0/ 282 kb/s] [eta 02h:24m:59s] Compared to the same job file, started at the same time, on the Seagate 500GB SATA: Jobs: 1 (f=1): [w] [9.9% done] [ 0/ 1204 kb/s] [eta 26m:28s] Regards, Jeff --
On Fri, Apr 03, 2009 at 01:14:00PM -0700, Linus Torvalds wrote: > But I _think_ G.SKILL uses those horribly broken JMicron controllers. > Judging by your performance numbers, it's the slightly fancier double > controller version (ie basically an internal RAID0 of two identical > JMicron controllers, each handling half of the flash chips). > > Try a random write test. If it's the JMicron controllers, performance will > plummet to a few tens of kilobytes per second. I got the 64GB variant of Jeff's g-skill SSD. When I first got it, I ran aio-stress on it. The numbers from the smaller blocksize tests are pitiful. To the extent that after running for 24hrs, I ctrl-c'd the test. Really, really abysmal. Dave --
.. That's odd. I kind of expected to see the sector size, cache size, and perhaps media rotation rate reported there.. Can you update your hdparm (sourceforge) and repost? There might be other useful features of that drive, which some of us are quite curious to know about! :) Thanks --
Here's output of hdparm 9.12, from Fedora rawhide. I was unaware that both read-ahead and writeback caching were disabling on this drive, until that was pointed out to me in email. huh. I'll have to redo my tests... Jeff
Attached are some additional tests using sync_file_range, dd, an SSD and a normal SATA disk. The test program -- overwrite.c -- is unchanged from my last posting, basically the same as Linus's except with posix_fadvise() Observations: * the no-name SSD does seem to burst the first ~1GB of writes rapidly, but degrades to a much lower sustained level, as observed before. Repeated tests do not produce ~80 MB/s, only the first test, which lends credence to the theory about background activity. * For the SSD, overwrite is noticeably faster than dd. * For the Seagate NCQ hard drive, dd is noticeably faster than overwrite. * fadvise() appears to help, but mostly the results are either inconclusive or lost in the noise: A slight increase in throughput, and a slight increase in system time. The test sequence for both SATA devices was the following: 3 x dd 3 x overwrite 3 x overwrite w/ fadvise(don't need) System setup: Intel Nahalem(sp?) x86-64, ICH10, Fedora 10, ext3 filesystem (mounted defaults + noatime), 2.6.29 vanilla kernel. Regards, Jeff
Oh, and, as run-test.sh shows, these tests were done with the file pre-allocated and sync'd to disk. The dd and overwrite invocations that follow the first dd invocation do /not/ require the fs to allocate new blocks. Jeff --
Hmm, I don't know what you have in mind. page cache lookup should be several orders of magnitude faster than a disk can write the pages out? Dirty/writeout/clean cycle still has to lock the radix tree to change tags, but that's really not going to be significantly contended (nor does it synchronise with simple lookups). --
Isn't that the kernel IO queue, and the dd averaging of transfer speed? For example, once you hit the dirty ratio limit, that is when it starts writing to disk. So, the first bit you'll see really fast speeds, as it goes to memory, but it averages out over time to a slower speed. As an example... tdamac ~ # dd if=/dev/zero of=/tmp/bigfile bs=1M count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.00489853 s, 214 MB/s tdamac ~ # dd if=/dev/zero of=/tmp/bigfile bs=1M count=10 10+0 records in 10+0 records out 10485760 bytes (10 MB) copied, 0.242217 s, 43.3 MB/s Those are with /proc/sys/vm/dirty_bytes set to 1M... echo $((1024*1024*1)) > /proc/sys/vm/dirty_bytes It's probably better to set it much higher though. --
overwrite.c is a special program that does this, in a loop: write(buffer-N) data to pagecache start buffer-N write-out to storage wait for buffer-(N-1) write-out to complete It uses the sync_file_range() system call, which is like fsync() on steroids, wearing cool sunglasses. Regards, Jeff --
.. Note that for mythtv, this may not be the best behaviour. A common use scenario is "watching live TV", a few minutes behind real-time so that the commercial-skipping can work its magic. In that scenario, those pages are going to be needed again within a short while, and it might be useful to keep them around. But then Myth itself could probably decide whether to discard them or not, not based upon that kind of knowledge. --
Well I really never watch live TV. I watch shows when I want to, not when they happen to be on the air. So I certainly couldn't care less -- Len Sorensen --
.. A *true* myth dev! (pretenders use LiveTV, *real* devs don't!) But mythcommflag also benefits from having the pages hang around for an extra short time. Cheers --
Yes. I suspect that Myth could do heuristics like "when watching live TV, do drop-behind about 30s after the currently showing stream". That still allows for replay, but older stuff you've watched really likely isn't all that interesting and migth be worth dropping in order to make room for more data. And you can use posix_fadvise() for that, since it's now no longer connected with "wait for background IO to complete" at all. The reason for wanting "SYNC_FILE_RANGE_DROP" was simply that I was doing the "wait after write" anyway, and thinking I wanted to get rid of the pages while I was already handling them. But that was for an app where I _new_ the data was uninteresting as soon as it was on disk. Doing a secure delete is different from recording video ;) Linus --
Yep, you're right. sync_file_range is perfect for what MythTV wants to do. Though there are cases where MythTV can read data it wrote out not too long ago, for example, when commercial flagging, so fadvise(POSIX_FADV_DONTNEED) may not be warranted. -Dave --
So use XFS or ext4, and use fallocate() to get the disk blocks allocated ahead of time. That completely avoids the fragmentation problem, altogether. If you are using ext3 on a dedicated MythTV box, I would certainly advise mounting with data=writeback, which will also avoid the latency bug. - Ted --
Yeah, but I am not ready to give xfs another change yet. The nasty bugs back in 2.6.8ish days still hurt. Locking up the filesystem when doing Yeah. What I have been seeing since 2.6.24 or 2.6.25 or so is that it sometimes simply doesn't start playback on a file, and after 15 seconds or so times out, and then you ask it to try again and it works the next time just fine. Then at times it will stop responding to the keyboard or remote in mythtv for up to 2 minutes, and then suddenly it will respond to whatever you hit 2 minutes ago. Fortunately that doesn't seem to happen that often. I was hoping to see if 2.6.28 helped that, but lirc didn't seem to work on my remote with that version, so I went back to 2.6.26 again. I haven't tried 2.6.29 on it yet since I am currently trying to fix the debian nvidia-driver build against the new kbuild only much too clever linux-headers-2.6.29 package they have come up with. I think I have got that figured out though so I should be able to upgrade that now. -- Len Sorensen --
.. Oooh.. a myth dev! With the HVR-1600 card, myth calls fsync() *ten* times a second while recording analog TV (digital is fine). Any chance you could track down and fix that ? It might be the same thing that's biting Lennart's system with his PVR-500 card. Cheers --
Well, ext4 will be an interim solution you can convert to first. It will be best with a backup/reformat/restore pass, better if you enable extents (at least for new files, but then you won't be able to go back to ext3), but you'll get improvements even if you just mount an ext3 filesystem as ext4. - Ted --
Well I did pickup a 1TB external USB/eSATA drive for pretty much such a task. I wasn't sure if ext4 was ready or stable enough to play with yet though. -- Len Sorensen --
To play with, definitely. For production use, I'll have to let you make your own judgements. I've been using it on my laptop since July. At the moment, there's only one bug which I'm very concerned about, being worked here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/330824 But a number of community distro's will be supporting it within the next month or two. So it's definitely getting there. As we increase the user base, we'll turn up more of the harder-to-reproduce bugs, but hopefully we'll get them fixed quickly. - Ted --
Well I made a 75GB ext4 just to store temporary virtual machine images Well pretty soon I will probably consider switching to that. btrfs sounds neat and all, but I will wait for the disk format to get finalized first. -- Len Sorensen --
Yeah, well, it's caused by data=ordered, which is an ext3 unique thing; no other filesystem (or operating system) has such a feature. I'm beginning to wish we hadn't implemented it. Yeah, it solved a security problem (which delayed allocation also solves), but it trained application programs to be careless about fsync(), and it's caused us so many other problems, including the fsync() and unrelated commit latency problems. We are where we are, though, and people have been trained to think they don't need fsync(), so we're going to have to deal with the problem by having these implied fsync for cases like replace-via-rename, and in addition to that, some kind of hueristic to force out writes early to avoid these huge write latencies. It would be good to make it be autotuning it so that filesystems that don't do ext3 data=ordered don't have to pay the price of having to force out writes so aggressively early (since in some cases if the file subsequently is deleted, we might be able to optimize out the write altogether --- and that's good for SSD endurance). - Ted --
Oh, for the love of a whole range of mythological figures. ext3 didn't train application programmers that they could be careless about fsync(). It gave them functionality that they wanted, ie the ability to do things like rename a file over another one with the expectation that these operations would actually occur in the same order that they were generated. More to the point, it let them do this *without* having to call fsync(), resulting in a significant improvement in filesystem usability. I'm utterly and screamingly bored of this "Blame userspace" attitude. The simple fact of the matter is that ext4 was designed without paying any attention to how the majority of applications behave. fsync() isn't the interface people want. ext3 demonstrated that a filesystem could be written that made life easier for application authors. Why on earth would anyone think that taking a step back by requiring fsync() in a wider range of circumstances was a good idea? -- Matthew Garrett | mjg59@srcf.ucam.org --
Matthew, There were plenty of applications that were written for Unix *and* Linux systems before ext3 existed, and they worked just fine. Back then, people were drilled into the fact that they needed to use fsync(), and fsync() wan't expensive, so there wasn't a big deal in terms of usability. The fact that fsync() was expensive was precisely because of ext3's data=ordered problem. Writing files safely meant that you had to check error returns from fsync() *and* close(). In fact, if you care about making sure that data doesn't get lost due to disk errors, you *must* call fsync(). Pavel may have complained that fsync() can sometimes drop errors if some other process also has the file open and calls fsync() --- but if you don't, and you rely on ext3 to magically write the data blocks out as a side effect of the commit in data=ordered mode, there's no way to signal the write error to the application, and you are *guaranteed * to lose the I/O error indication. I can tell you quite authoritatively that we didn't implement data=ordered to make life easier for application writers, and application writers didn't come to ext3 developers asking for this convenience. It may have **accidentally** given them convenience that I'm not blaming userspace. I'm blaming ourselves, for implementing an attractive nuisance, and not realizing that we had implemented an attractive nuisance; which years later, is also responsible for these latency problems, both with and without fsync() ---- *and* which have also traied people into believing that fsync() is always expensive, and must be avoided at all costs --- which had not previously been true! If I had to do it all over again, I would have argued with Stephen about making data=writeback the default, which would have provided behaviour on crash just like ext2, except that we wouldn't have to fsck the partition afterwards. Back then, people lived with the potential security exposure on a crash, and they lived with the fact that ...
And now life is better. UNIX's error handling has always meant that it's effectively impossible to ensure that data hits disk if you wander into a variety of error conditions, and by and large it's simply not worth worrying about them. You're generally more likely to hit a kernel bug or suffer hardware failure than find an error condition that can actually be handled in a sensible way, and the probability/effectiveness ratio is sufficiently low that there are better ways to spend your time unless you're writing absolutely mission critical code. So let's not focus on the risk of data loss from failing to check certain error conditions. It not only gave them that convenience, it *guaranteed* that convenience. And with ext3 being the standard filesystem in the Linux world, and every other POSIX system being by and large irrelevant[1], But you're still arguing that applications should start using fsync(). I'm arguing that not only is this pointless (most of this code will never be "fixed") but it's also regressive. In most cases applications don't want the guarantees that fsync() makes, and given that we're going to have people running on ext3 for years to come they also don't want the performance hit that fsync() brings. Filesystems should just do the right thing, rather than losing people's data and then claiming that Well, no. fsync() didn't appear in early Unix, so what people were actually willing to live with was restoring from backups if the system crashed. I'd argue that things are somewhat better these days, especially now that we're used to filesystems that don't require us to fsync(), close(), fsync the directory and possibly jump through even more hoops if faced with a pathological interpretation of POSIX. Progress is a good thing. The initial behaviour of ext4 in this respect wasn't progress. And, really, I'm kind of amused at someone arguing for a given behaviour on the basis of POSIX while also suggesting that sync() is in any way ...
And, hey, fsync didn't make POSIX proper until 1996. It's not like authors were able to depend on it for a significant period of time before ext3 hit the scene. (It could be argued that most relevant Unices implemented fsync() even before then, so its status in POSIX was broadly irrelevant. The obvious counterargument is that most relevant Unix filesystems ensure that data is written before a clobbering rename() is carried out, so POSIX is again not especially releant) -- Matthew Garrett | mjg59@srcf.ucam.org --
Fsync() was in BSD 4.3 and it was in much earlier Unix specifications,
such as SVID, well before it appeared in POSIX. If an interface was
Nope, not true. Most relevant Unix file systems sync'ed data blocks
on a 30 timer, and metadata on 5 second timers. They did *not* force
data to be written before a clobbering rename() was carried you;
you're rewriting history when you say that; it's simply not true.
Rename was atomic *only* where metadata was concerned, and all the
talk about rename being atomic was because back then we didn't have
flock() and you built locking primitives open(O_CREAT) and rename();
but that was only metadata, and that was only if the system didn't
crash.
When I was growing up we were trained to *always* check error returns
from *all* system calls, and to *always* fsync() if it was critical
that the data survive a crash. That was what competent Unix
programmers did. And if you are always checking error returns, the
difference in the Lines of Code between doing it right and doing
really wasn't that big --- and again, back then fsync() wan't
expensive. Making fsync expensive was ext3's data=ordered mode's
fault.
Then again, most users or system administrators of Unix systems didn't
tolerate device drivers that would crash your system when you exited a
game, either.... and I've said that I recognize the world has changed
and that crappy application programmers outnumber kernel programers,
which is why I coded the workaround for ext4. That still doesn't make
what they are doing correct.
- Ted
--
But there are a lot of applications for which the survival of the data is not this critical as long as the old data is still available. Data are the important stuff, metadata helps to find them. Even though there are a lot of cases, where the information is just stored in the metadata. If you write metadata for not-yet-existing data to disk, then these are inconsistent, corrupt, dirty. Why don't you just delay the writing of these dirty metadata, too, until they are clean? So nothing is written until the next sync and then 1) write the data to the nicely allocated places. 2) journal the metadata for consistency 3) write the metadata 4) cleanup the journal That way you can have sophisticated allocation and keep a consistent filesystem without data loss due to re-ordering. Clean metadata-changes which don't have delayed data might be written/journaled immediately. That rises the question, whether dirty metadata changes should be skipped or whether a dirty metadata change should block later clean metadata changes to inhibit the re-ordering of changes. This should be a mount-option IMHO. Keeping the order of fs-changes has a big advantage in many cases. Syncing data on renames would decrease your performance which you want to increase with delayed allocation. Delayed metadata would mostly keep this performance gain, right? Andreas --
And if a behaviour is in ext3, then for the vast majority of practical purposes it exists everywere. Users of non-Linux POSIX operating systems No, you're missing my point. The other Unix file systems are irrelevant. The number of people running them and having any real risk of system When my grandmother was growing up she had to use an outside toilet. No, look, you're blaming userspace again. Stop it. -- Matthew Garrett | mjg59@srcf.ucam.org --
Not checking for errors is not "progress" its indiscipline aided by languages and tools that permit it to occur without issuing errors. It's why software "engineering" is at best approaching early 1950's real engineering practice ("hey gee we should test this stuff") and has yet to grow up and get anywhere into the world of real engineering and quality. Alan --
No. Not *having* to check for errors in the cases that you care about is progress. How much of the core kernel actually deals with kmalloc failures sensibly? Some things just aren't worth it. -- Matthew Garrett | mjg59@srcf.ucam.org --
I'm glad to know thats how you feel about my data, it explains a good deal about the state of some of the desktop software. In kernel land we actually have tools that go looking for kmalloc errors and missing tests to try and check all the paths. We run kernels with kmalloc randomly failing to make sure the box stays up: because at the end of the day *kmalloc does fail*. The kernel also tries very hard to keep the fail rate low - but this doesn't mean you don't check for errors. Everything in other industry says not having to check for errors is missing the point. You design systems so that they do not have error cases when possible, and if they have error cases you handle them and enforce a policy that prevents them not being handled. Standard food safety rules include Labelling food with dates Having an electronic system so that any product with no label cannot escape Checking all labels to ensure nothing past the safe date is sold Having rules at all stages that any item without a label is removed and is flagged back so that it can be investigated Now you are arguing for "not having to check for errors" So I assume you wouldn't worry about food that ends up with no label on it somehow ? Or when you get a "permission denied" do you just assume it didn't happen ? If the bank says someone has removed all your money do you assume its an error you don't need to check for ? The two are *not* the same thing. You design failure out when possible You implement systems which ensure all known failure cases must be handled You track failure rates to prove your analysis Where you don't handle a failure (because it is too hard) you have detailed statistical and other analysis based on rigorous methodologies as to whether not handling it is acceptable (eg ALARP) and unfortunately at big name universities you can still get a degree or masters even in software "engineering" without actually studying any of this stuff, which any real engineering discipline would consider ...
The context was situations like errors on close() not occuring unless you've fsync()ed first. I don't think that error case is sufficiently common to warrant the cost of an fsync() on every single close, especially since doing so would cripple any application that ever tried to run on ext3. -- Matthew Garrett | mjg59@srcf.ucam.org --
On Fri, 27 Mar 2009 16:28:41 +0000 The fsync if you need to see all errors on close case has been true since before V7 unix. Its the normal default behaviour on these systems so anyone who assumes otherwise is just broken. There is a limit to the extent the OS can clean up after completely broken user apps. Besides which a properly designed desktop clearly has a single interface of the form happened = write_file_reliably(filename|NULL, buffer, len, flags) happened = replace_file_reliably(filename|NULL, buffer, len, flags (eg KEEP_BACKUP)); which internally does all the error handling, reporting to user, offering to save elsewhere, ensuring that the user can switch app and make space and checking for media errors. It probably also has an asynchronous version you can bind event handlers to for completion, error, etc so that you can override the default handling but can't fail to provide something by default. That would be designing failure out of the system. IMHO the real solution to a lot of this actually got proposed earlier in the thread. Adding "fbarrier()" allows the expression of ordering without blocking and provides something new apps can use to get best performance. Old properly written apps continue to work and can be improved, and sloppy garbage continues to mostly work. The file system behaviour is constrained heavily by the hardware, which at this point is constrained by the laws of physics and the limits of materials. Alan --
If user applications should always check errors, and if errors can't be reliably produced unless you fsync() before close(), then the correct behaviour for the kernel is to always flush buffers to disk before returning from close(). The reason we don't is that it would be an unacceptable performance hit to take in return for an uncommon case - in exactly the same way as always calling fsync() before close() is an If every application that does a clobbering rename has to call fbarrier() first, then the kernel should just guarantee to do so on the application's behalf. ext3, ext4 and btrfs all effectively do this, so we should just make it explicit that Linux filesystems are expected to behave this way. If people want to make their code Linux specific then that's their problem, not the kernel's. -- Matthew Garrett | mjg59@srcf.ucam.org --
You make a few assumptions here Unfortunately: - close() occurs many times on a file - the kernel cannot tell which close() calls need to commit data - there are many cases where data is written and there is a genuine situation where it is acceptable over a crash to lose data providing media failure is rare (eg log files in many situations - not banks obviously) The kernel cannot tell them apart, while fsync/close() as a pair allows the user to correctly indicate their requirements. Even "fsync on last close" can backfire horribly if you happen to have a handle that is inherited by a child task or kept for reading for a long period. For an event driven app you really want some kind of threaded or async fsync then close (fbarrier isn't quite enough because you don't get told when the barrier is passed). That could be implemented using threads in the relevant desktops libraries with the thread doing fsync() poke event thread exit (or indeed for most cases as part of the more general Rename is a different problem - and a nastier one. Unfortunately even in posix fsync says nothing about how metadata updating is handled or what the ordering rules are between two fsync() calls on different files. There were problems with trying to order rename against data writeback. fsync ensures the file data and metadata is valid but doesn't (and cannot) connect this with the directory state. So if you need to implement write data ensure it is committed rename it after the rename is committed then ... you can't do that in POSIX. Linux extends fsync() so you can fsync a directory handle but that is an extension to fix the problem rather than a standard behaviour. (Also helpful here would be fsync_range, fdatasync_range and Agreed - which is why close should not happen to do an fsync(). That's their problem for writing code thats specific to some random may happen behaviour on certain Linux releases - and unfortunately with no obvious cheap ...
Alan. Repeat after me: "fsync()+close() is basically useless for any app that expects user interaction under load". Don't be silly. If you want data corruption, then you make people write threaded applications. Yes, you may work for Intel now, but that doesn't mean that you have to drink the insane cool-aid. Threading is HARD. Async stuff is HARD. We kernel people really are special. Expecting normal apps to spend the kind of effort we do (in scalability, in error handling, in security) is I do agree that close() shouldn't do an fsync - simply for performance reasons. But I also think that the "we write meta-data synchronously, but then the actual data shows up at some random later time" is just crazy talk. That's simply insane. It _guarantees_ that there will be huge windows of times where data simply will be lost if something bad happens. And expecting every app to do fsync() is also crazy talk, especially with the major filesystems _sucking_ so bad at it (it's actually a lot more realistic with ext2 than it is with ext3). So look for a middle ground. Not this crazy militant "user apps must do fsync()" crap. Because that is simply not a realistic scenario. Linus --
Which is why you do it once in a library and express it as events. The gtk desktop already does this and the event model it provides is rather Agreed - apps not checking for errors is sloppy programming however given they make errors we don't want to make it worse. I wouldn't argue with that - for the same reason that cars are designed on the basis that their owners are not competent to operate them ;) --
This is a fact for ext3 with data=ordered mode. Which is the default and dominant filesystem today, yes. But it's not true for most other filesystems. Hopefully at some point we will migrate people off of ext3 to something better. Ext4 is available today, and is much better at this than ext4. In the long run, btrfs will be better yet. The issue then is how do we transition people away from making assumptions that were essentially only true for ext3's data=ordered mode. Ext4, btrfs, XFS, all will have the property that if you fsync() a small file, it will be fast, and it won't inflict major delays for other programs running on the same system. You've said for a long that that ext3 is really bad in that it inflicts this --- I agree with you. People should use other filesystems which are better. This includes ext4, which is completely format compatible with ext3. They don't even have to switch on extents support to get better behaviour. Just mounting an ext3 filesystem with ext4 will result in better behaviour. So maybe we can't tell application writers, *today*, that they should use fsync(). But in the future, we should be able to tell them that. Or maybe we can tell them that if they want, they can use some new interface, such as a proposed fbarrier() that will do the right thing (including perhaps being a no-op on ext3) no matter what the filesystem might be. I do believe that the last thing we should do is tell people that because of the characteristics of ext3s, which you yourself have said sucks, and which we've largely fixed for ext4, and which isn't a problem with other filesystems, including some that may likely replace ext3 *and* ext4, that we should give people advice that will lock applications into doing some very bad things for the indefinite future. And I'm not blaming userspace; this is at least as much, if not entirely, ext3's fault. What that means is we need to work on a way of providing a transition path back to a better place for the ...
Would making close imply fbarrier() rather than fsync() work for this ? That would give people the ordering they want even if they are less careful but wouldn't give the media error cases - which are less interesting. Alan --
The thought that I had was to create a new system call, fbarrier() which has the semantics that it will request the filesystem to make sure that (at least) changes that have been made data blocks to date should be forced out to disk when the next metadata operation is committed. For ext3 in data=ordered mode, this would be a no-op. For other filesystems that had fast/efficient fsync()'s, it could simply be an fsync(). For other filesystems, it could trigger an asynchronous writeout, if the journal commit will wait for the writeout to complete. For yet other filesystems, it might set a flag that will cause the filesystem to start a synchronous writeout of the file as part of the commit operations. The bottom line was that what we could *then* tell application programmers to do is open/write/fbarrier/close/rename. (And for operating systems where they don't have fbarrier, they can use autoconf magic to replace fbarrier with fsync.) We could potentially make close() imply fbarrier(), but there are plenty of times when that might not be such a great idea. If we do that, we're back to requiring synchronous data writes for all files on close(), which might lead to huge latencies, just as ext3's data=ordered mode did. And in many cases, where the files in questions can be easily regenerated (such as object files in a kernel tree build), there really is no reason why it's a good idea to force the blocks to disk on close(). In the highly unusual case where we crash in the middle of a kernel build; we can do a "make clean; make" and regenerate the object files. The fundamental idea here is not all files need to be forced to disk on close. Not all files need fsync(), or even fbarrier(). We can make the system go much more quickly if we can make a distinction between these two cases. It can also make SSD drives last longer if we don't force blocks to disk for non-precious files. If people disagree with this premise, we can go back to something very much like ext3's ...
fbarrier() on close() would only mean, that the data shouldn't be written after the metadata and new metadata shouldn't be written _before_ old metadata, so you can also delay the committing of the "dirty" metadata until the real data are written. You don't need An fbarrier() on close() would reflect the thinking of a lot of developers. You might call them stupid and incompetent, but they surely are the majority. When closing A before creating B, they don't expect seeing B without a completed A, even though they might expect that neither A nor B may be written yet, if the system crashes. If you have smart developers, you might give them something new, so they could speed things up with some extra code, e.g. when they create data, which may be restored by other means, but the default behavior of automatic fbarrier() on close() would be better. Andreas --
It also happens to be what pretty much all network filesystems end up implementing. That said, there's a reason many people prefer local filesystems to even high-performance NFS - latency (especially for metadata which even modern versions of NFS cannot cache effectively) just sucks when you have to go over the network. It pretty much doesn't matter _how_ fast your network or server is. One thing that might make sense is to make "close()" start background writeout for that file (modulo issues like laptop mode) with low priority. No, it obviously doesn't guarantee any kind of filesystem coherency, but it _does_ mean that the window for the bad cases goes from potentially 30 seconds down to fractions of seconds. That's likely quite a bit of improvement in practice. IOW, no "hard barriers", but simply more of a "even in the absense of fsync we simply aim for the user to have to be _really_ unlucky to ever hit any bad cases". Linus --
I'm curious about the exact semantics that you are suggesting.
Do you mean that
1/ any data block in any file will be forced out before any metadata
for any file? or
2/ any data block for 'this' file will be forced out before any
metadata for any file? or
3/ any data block for 'this' file will be forced out before any
metadata for this file?
I assume the contents of directories are metadata. If 3 is that case
do we included the metadata of any directories known to contain this
file? Recursively?
I think that if we do introduce new semantics, they should be as weak
as possibly while still achieving the goal, so that fs designers have
as much freedom as possible. It should also be as expressive as
possible so that we don't find we want to extend it later.
What would you think of:
fcntl(fd, F_BEFORE, fd2)
with the semantics that it sets up a transaction dependency between fd
and fd2 and more particularly the operations requested through each
fd.
So if 'fd' is a file, and 'fd2' is the directory holding that file,
then
fcntl(fd, F_BEFORE, fd2)
write(fd, stuff)
renameat(fd2, 'file', fd2, 'newname')
would ensure that the writes to the file were visible on storage
before the rename.
You could also do
fd1 = open("afile", O_RDWR);
fd2 = open("afile", O_RDWR);
fcntl(fd1, F_BEFORE, fd2);
then use write(fd1) to write journal updates to one part of the
(database) file, and write(fd2) to write in-place updates,
and it would just "do the right thing". (You might want to call
fcntl(fd2, F_BEFORE, fd1) as well ... I haven't quite thought through
the details of that yet).
If you gave AT_FDCWD as the fd2 in the fcntl, then operations on fd1
would be ordered before any namespace operations which did not specify a
particular directory, which would be fairly close to option 2 above.
A minimal implementation could fsync fd1 before allowing any operation
on fd2. A more sophisticated implementation could record set up
dependencies in ...Ohkay. But in a 'make xconfig' of 2.6.28.9, how much of ext4 can be turned on without rendering the old ext3 fstab defaults incompatible should I be forced to boot a kernel with no ext4 support? -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Never look a gift horse in the mouth. -- Saint Jerome --
Ext4 doesn't make any non-backwards compatible changes to the filesystem. So if you just take an ext3 filesystem, and mount it as ext4, it will work just fine; you will get delayed allocation, you will get a slightly boosted write priority for kjournald, and then when you unmount it, that filesystem will work *just* *fine* on a kernel with no ext4 support. You can mount it as an ext3 filesystem. If you use tune2fs to enable various ext4 features, such as extents, etc., then when you mount the filesystem as ext4, you will get the benefit of extents for any new files which are created, and once you do that, the filesystem can't be mounted on an ext3-only system, since ext3 doesn't know how to deal with extents. And of course, if you want *all* of ext4's benefits, including the full factor of 6-8 improvement in fsck times, then you will be best served by creating a new ext4 filesystem from scratch and doing a backup/reformat/restore pass. But if you're just annoyed by the large latencies in Ingo's "make -j32" example, simply taking the ext3 filesystem and mounting it as ext4 should make those problems go away. And it won't make any incompatible changes to the filesystem. (This didn't use to be true in the pre-2.6.26 days, but I insisted on getting this fixed so people could always mount an ext2 or ext3 filesystems using ext4 without the kernel making any irreversible filesystem format changes behind the user's back.) - Ted --
Does a newly create ext4 partition have all the various goodies enabled that I'd want, or do I also need to tune2fs some parameters to get an "optimal" setup? -- Aaron --
A newly created ext4 partition created with e2fsprogs 1.41.x will have all of the various goodies enabled. Note that some of what "goodies" are enabled are controlled by the mke2fs.conf file, which some distribution packages treat as a config file, so you need to make sure it is appropriately updated when you update e2fsprogs. - Ted --
Thanks Ted, I will build 2.6.28.9 with this: [root@coyote linux-2.6.28.9]# grep EXT .config [...] CONFIG_PAGEFLAGS_EXTENDED=y CONFIG_EXT2_FS=m CONFIG_EXT2_FS_XATTR=y CONFIG_EXT2_FS_POSIX_ACL=y # CONFIG_EXT2_FS_SECURITY is not set CONFIG_EXT2_FS_XIP=y CONFIG_EXT3_FS=m CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y CONFIG_EXT3_FS_SECURITY=y CONFIG_EXT4_FS=y # CONFIG_EXT4DEV_COMPAT is not set # CONFIG_EXT4_FS_XATTR is not set CONFIG_GENERIC_FIND_NEXT_BIT=y Anything there that isn't compatible? I'll build that, but only switch the /amandatapes mount in fstab for testing tonight unless you spot something above. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Losing your drivers' license is just God's way of saying "BOOGA, BOOGA!" --
Well, if you need extended attributes (if you are using SELinux, then you need extended attributes) you'll want to enable CONFIG_EXT4_FS_XATTR. If you want to use ext4 on your root filesystem, you may need to take some special measures depending on your distribution. Using the boot command-line option rootfstype=ext4 will work on many distributions, but I haven't tested all of them. It definitely works on Ubuntu, and it should work if you're not using an initial ramdisk. Oh yeah; the other thing I should warn you about is that 2.6.28.9 won't have the replace-via-rename and replace-via-truncate workarounds. So if you crash and your applications aren't using fsync(), you could end up seeing the zero-length files. I very much doubt that will make a big difference for your /amandatapes partition, but if you want to use this for the filesystem where you have home directory, you'll probably want the workaround patches. I've heard reports of KDE users telling me that when they initial start up their desktop, literally hundreds of files are rewritten by their desktop, just starting it up. (Why? Who knows? It's not good for SSD endurance, in any case.) But if you crash while initially logging in, your KDE configuration files might get wiped out w/o the OK, so you're not worried about your root filesystem, and presumably the issue with your home directory won't be an issue for you either. The only question then is whether you need extended attribute support. Regards, - Ted --
Thanks Ted, its building w/o the extra CONFIG_EXT4_FS_XATTR atm, but I'll enable that and do it again before I reboot. I had just fired off the build when I saw your answer. NBD, my 'makeit' script is pretty complete. Thank you. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) The only rose without thorns is friendship. --
Why are we even arguing about standards? POSIX, as all other standards, is a common _denominator_ and absolutely the _minimal_ requirement for a compliant operating system. It does not tell you how to design the best systems in the real world. For God's sake, can't we aim for something higher than a piece of literature written some 20 years ago? And stop making excuses please? The fact is, most software is crap, and most software developers are lazy and stupid. Same as most customers are stupid too. A technically correct operating system isn't necessarily the most successful and accepted operating system. Have a sense of pragmatism if you are developing something that is not just a fancy research project. And it's especially true for ext4. I bet nobody would care about what it did if it called itself bloody-fast-next-gen-fs, and of course probably nobody would use it either. But since it's putting the "ext" and "next default Linux filesystem in all distros" hat on, it'd better take both the glory and the crap with it. So, no matter whether ext3 made some mistakes, you can't just throw it all away while keeping its name to give people the false sense of comfort. I am really glad that Theodore changed ext4 to handle the common practice of truncate/rename sequences. It's absolutely necessary. It's not a "favor for stupid user space", but a mandatory requirement if you even remotely want it to be a general-purpose file system. In the end, it doesn't matter how standard compliant you are - people will only choose the filesystem that is the most reliable, fastest, and works with the most number of applications. Hua --
It would probably be good to think about something like this, because
there are currently really two totally different cases of "fsync()" users.
(a) The "critical safety" kind (aka the "traditional" fsync user), where
there is a mail server or similar that will reply "all done" to the
sender, and has to _guarantee_ that the file is on disk in order for
data to simply not be lost.
This is a very different case from most desktop uses, and it's a evry
hard "we have to wait until the thing is physically on disk"
situation. And it's the only case where people really traditionally
used "fsync()".
(b) The non-traditional UNIX usage where people historically didn't use
fsync() for: people editing their config files either
programmatically or by hand.
And this one really doesn't need at all the same kind of hard "wait
for it to hit the disk" semantics. It may well want a much softer
kind of "at least don't delete the old version until the new version
is stable" kind of thing.
And Alan - you can argue that fsync() has been around forever, but you
cannot possibly argue that people have used fsync() for file editing.
That's simply not true. It has happened, but it has been very rare. Yes,
some editors (vi, emacs) do it, but even there it's configurable. And
outside of databases, server apps and big editors, fsync is virtually
unheard of. How many sed-scripts have you seen to edit files? None of them
ever used fsync.
And with the ext3 performance profile for it, it sure is not getting any
more common either. If you have a desktop app that uses fsync(), that
application is DEAD IN THE WATER if people are doing anything else on the
machine. Those multi-second pauses aren't going to make people happy.
So the fact is, "people should always use fsync" simply isn't a realistic
expectation, nor is it historically accurate. Claiming it is is just
obviously bogus. And claiming that people ..... and looking at history, it's even pretty modern. From the vim logs: Patch 6.2.499 Problem: When writing a file and halting the system, the file might be lost when using a journalling file system. Solution: Use fsync() to flush the file data to disk after writing a file. (Radim Kolar) Files: src/fileio.c so it looks (assuming those patch numbers mean what they would seem to mean) that 'fsync()' in vim is from after 6.2 was released. Some time in 2004. So traditionally, even solid "good" programs like major editors never tried to fsync() their files. Btw, googling for that 6.2.499 patch also shows that people were rather unhappy with it. Why? It causes disk spinups in laptop mode etc. Which is very much not what you want to see for power reasons. So there are other, really fundamental, reasons why applications that don't have the "mailspool must not be lost" kind of critical issues to absolutely NOT use fsync(). Those applications would be much better off with some softer hint that can take things like laptop mode into account. Linus --
Far too many people don't - and it is unfortunate but people should learn Rename is a really nasty case and the standards don't help at all here so I agree entirely. There *isn't* a way to write a correct portable application that achieves that guarantee without the kernel making it for you. --
You're ignoring reality. Your definition of "quality software" is PURE SH*T. Look at that laptop disk spinup issue. Look at the performance issue. Look at something as nebulous as "usability". If adding fsync's makes software unusable (and it does), then you shouldn't call that "quality software". Alan, just please face that reality, and think about it for a moment. If fsync() was instantaneous, this discussion wouldn't exist. But read the thread. We're talking 3-5s under NORMAL load, with peaks of minutes. Linus --
Actually "pure sh*t" is most of the software currently written. The more code I read the happier I get that the lawmakers are finally sick of it and going to make damned sure software is subject to liability law. Boy The peaks of minutes is a bug. The 3-5 seconds is the thread discussion. Alan --
I really think you're gilding the edges of those old memories. The software 20 years ago wasn't that great. I'd say it was on the whole a whole lot crappier than it is today. It's just that we have much higher expectations, and our problem sizes have grown a _lot_ faster than rotating disk latencies have improved. People didn't worry about having a hundred megs of dirty data and doing an 'fsync' twenty years ago. Even on big hardware (if you _had_ a hundred megs of dirty data you didn't worry about latencies of a few seconds), never mind in the Linux world. This particular problem really largely boils down to "average memory capacity has expanded a _lot_ more than harddisk speeds have gone up". Linus --
On Fri, Mar 27, 2009 at 8:40 PM, Linus Torvalds We are looking at the wrong problem, the problem is not "should userspace apps do fsync", the problem is "how do we ensure reliable data where it's needed". It would be great if as a user I could have the option to set an fsync level and say; look, I have a fast fs, and I really care about data reliability in this server, so, level=0; or, hmm, what is this data reliability thing? I just want my phone to don't be so damn slow, level=5. -- Felipe Contreras --
On the other side of the coin, major desktop apps Firefox and Thunderbird already use it: Firefox uses sqlite to log open web pages in case of a crash, and sqlite in turn sync's its journal as any good database app should. [I think tytso just got them to use fdatasync and a couple other improvements, to make this not-quite-so-bad] Thunderbird hits the disk for each email received -- always wonderful with those 1000-email git-commit-head downloads... :) So, arguments about "people should..." aside, existing desktops apps _do_ fsync and we get to deal with the bad performance :/ Jeff --
I spent a very productive hour-long conversation with the Sqlite maintainer last weekend. He's already checked in a change to use fdatasync() everywhere, and he's looking into other changes that would help avoid needing to do a metadata sync because i_size has changed. One thing that will definitely help is if applications send the sqlite-specific SQL command "PRAGMA journal_mode = PERSIST;" when they first startup the Sqlite database connection. This will cause Sqlite to keep the rollback journal file to stick around instead of being deleted and then recreated for each Sqlite transaction. This avoids at least one fsync() of the directory containing the rollback journal file. Combined with the change in Sqlite's development branch to use fdatasync() everwhere that fsync() is used, this should definitely be a huge improvement. In addition, Firefox 3.1 is reportedly going to use an union of an on-disk database and an in-memory database, and every 15 or 30 minutes or so (presumably tunable via some config parameter), the in-memory database changes will be synched out to the on-disk database. This will *definitely* help a lot, and also help improve SSD endurance. (Right now Firefox 3.0 writes 2.5 megabytes each time you click on a URL, not counting the Firefox cache; I have my Firefox cache directory symlinked to /tmp to save on unnecessary SSD writes, and I was still recording 2600k written to the filesystem each time I clicked on a HTML link. This means that for every 400 pages that I visit, Firefox is currently generating a full gigabyte of (in my view, unnecessary) writes to my SSD, all in the name of maintaining Firefox's "Awesome Bar". This rather nasty behaviour should hopefully be significantly improved with Firefox 3.1, or so the Sqlite maintainer tells me.) - Ted --
Definitely, though it will be an interesting balance once user feedback starts to roll in... Firefox started doing this stuff because, when it or the window system or OS crashed, users like my wife would not lose the 50+ tabs they've opened and were actively using. :) So it's hard to see how users will react to going back to the days when firefox crashes once again mean lost work. [referring to the 15-30 min delay, not fsync(2)] Jeff --
You do know that Firefox had to _disable_ fsync() exactly because not disabling it was unacceptable? That whole "why does firefox stop for 5 No they don't. Read up on it. Really. Guys, I don't understand why you even argue. I've been complaining about fsync() performance for the last five years or so. It's taken you a long time to finally realize, and you still don't seem to "get it". PEOPLE LITERALLY REMOVE 'fsync()' CALLS BECAUSE THEY ARE UNACCEPTABLE FOR USERS. It really is that simple. Linus --
What is in Fedora 10 and Debian lenny's iceweasel both definitely sync to disk, as of today, according to my own tests. I'm talking about what's in real world user's hands today, not some hoped-for future version in developer CVS somewhere, depending on build options and who knows what else... Jeff --
Hmm. Go to "about:config" and check your "toolkit.storage.synchronous" setting. It _should_ say default integer 0 and that is what it says for me (yes, on Fedora 10). The values are: 0 = off, 1 = normal, 2 = full. If you don't have that "toolkit.storage.synchronous" entry, that means that you have an older version of firefox-3. And if you have some other value, it either means somebody changed it, or that Fedora is shipping with multiple different versions (the "official" Firefox source code defaults to 1, I think, but they suggested distributions change the default to 0). Linus --
Of course, I don't actually know that "off" really means "never fsync". It may be that it only cuts down on the number of fsync's. I do know that firefox with the original defaults ("fsync everywhere") was totally unusable, and that got fixed. But maybe it got fixed to "only pauses occasionally" rather than "every single page load brings everything to a screetching halt". Of course, your browsing history database is an excellent example of something you should _not_ care about that much, and where performance is a lot more important than "ooh, if the machine goes down suddenly, I need to be 100% up-to-date". Using fsync on that thing was just stupid, even regardless of any ext3 issues. Linus --
If you are doing a ton of web-based work with a bunch of tabs or windows open, you really like the post-crash restoration methods that Firefox now employs. Some users actually do want to checkpoint/restore their web work, regardless of whether it was the browser, the window system or the OS that crashed. You may not care about that, but others do care about the integrity of the database that stores the active FF state (Web URLs currently open), a database which necessarily changes for each URL visited. As an aside, I find it highly ironic that Firefox gained useful session management around the same time that some GNOME jarhead no-op'd GNOME session management[1] in X. Jeff [1] http://np237.livejournal.com/22014.html --
From: Jeff Garzik <jeff@garzik.org> Great, now all the KDE boo-birds might have to switch back, or even go to xfce4. If KDE and GNOME both make a bad release at the same time, then we'll really be in trouble. :-) --
.. fsync() isn't going to affect that one way or another unless the entire kernel freezes and dies. Firefox locks up the GUI here from time to time, but the kernel still flushes pages to disk, and even more quickly when alt-sysrq-s is used. Cheers --
To get work done which one really cares about, one can always choose a system which does not crash frequently. Those who run unstable drivers for thrills surely do it on boxes on which nothing important is being done, one would think. -- Stefan Richter -=====-=-=== -=-= -==-= http://arcgraph.de/sr/ --
Once software is perfect, there is definitely a lot of useless crash protection code to remove. Jeff --
Well, for the time being, why not base considerations for performance, interactivity, energy consumption, graceful restoration of application state etc. on the assumption that kernel crashes are suitably rare? (At least on systems where data loss would be of concern.) -- Stefan Richter -=====-=-=== -=-= -==-= http://arcgraph.de/sr/ --
The better solution seems to be the rather obvious one: the filesystem should commit data to disk before altering metadata. Much easier and more reliable to centralize it there, rather than rely (falsely) upon thousands of programs each performing numerous performance-killing fsync's. Cheers --
In more general terms: If overall system reliability is known insufficient, attempt to increase reliability of lower layers first. If this approach alone would be too costly in implementation or use, then also look at how to increase reliability of upper layers too. (Example: Running a suitably reliable kernel on a desktop for "mission-critical web browsing" is possible at low cost, at least if early decisions, e.g. for well-supported video hardware, went right.) Sure. I forgot: Not only the frequency of I/O disruption (e.g. due to kernel crash) factors into system reliability; the particular impact of such disruption is a factor too. (How hard is recovery? Will at least old data remain available? ...) -- Stefan Richter -=====-=-=== -=-= -==-= http://arcgraph.de/sr/ --
I suspect (at least from my own anecdotal evidence) that a lot of system crashes are basically X hanging. If you use the system as a desktop, at that point it's basically dead - and the difference between an X hang and a kernel crash is almost totally invisible to users. Us kernel people may walk over to another machine and ping or ssh in to see, but ask yourself how many normal users would do that - especially since DOS and Windows has taught people that they need to power-cycle (and, in all honesty, especially since there usually is very little else you can do even under Linux if X gets confused). And then part of the problem ends up being that while in theory the kernel can continue to write out dirty stuff, in practice people press the power button long before it can do so. The 30 second thing is really too long. And don't tell me about sysrq. I know about sysrq. It's very convenient for kernel people, but it's not like most people use it. But I absolutely hear you - people seem to think that "correctness" trumps all, but in reality, quite often users will be happier with a faster system - even if they know that they may lose data. They may curse themselves (or, more likely, the system) when they _do_ lose data, but they'll make the same choice all over two months later. Which is why I think that if the filesystem people think that the "data=ordered" mode is too damn fundamentally hard to make fast in the presense of "fsync", and all sane people (definition: me) think that the 30-second window for either "data=writeback" or the ext4 data writeout is too fragile, then we should look into something in between. Because, in the end, you do have to balance performance vs safety when it comes to disk writes. You absolutely have to delay things for performance, but it is always going to involve the risk of losing data that you do care about, but that you aren't willing (or able - random apps and tons of scripting comes to mind) to do a fsync ...
What if you added another phase in the journaling, after the data is written to the kernel, but before block allocation. As I understand, the current scenario goes like this: 1) A program writes a bunch of data to a file. 2) The kernel holds the data in buffer cache, delaying allocation. 3) Kernel updates file metadata in journal. 4) Some time later, kernel allocates blocks and writes data. If things go boom between 3 and 4, you have the files in an inconsistent state. If the program does an fasync(), then the kernel has to write ALL data out to be consistent. What if you could do this: 1) A program writes a bunch of data to a file. 2) The kernel holds the data in buffer cache, delaying allocation. 3) The kernel writes a record to the journal saying "This data goes with this file, but I've not allocated any blocks for it yet." 4) Kernel updates file metadata in journal. 5) Sometime later, kernel allocates blocks for data, and notes the allocation in the journal. 6) Sometime later still the kernel commits the data to disk and update the journal. It seems to me this would be a not-unreasonable way to have both the advantages of delayed allocation AND get the data onto disk quickly. If the user wants to have speed over safety, you could skip steps 3 and 5 (data=ordered). You want safety, you force everything through steps 3 and 5 (data=journaled). You want a middle ground, you only do steps 3 and 5 for files where the program has done an fasync() (data=ordered + program calls fasync()). And if you want both speed and safety, you get a big battery-backed up RAM disk as the journal device and journal everything. --
Firstly, the FS data/metadata write-out order says nothing about when the write-out is started by the OS. It only implies consistency in the face of a crash during write-out. Hooray for BSD soft-updates. If the write-out is started immediately during or after write(2), congratulations, you are on your way to reinventing synchronous writes. If the write-out does not start immediately, then you have a many-seconds window for data loss. And it should be self-evident that userland application writers will have some situations where design requirements dictate minimizing or eliminating that window. Secondly, this email sub-thread is not talking about thousands of programs, it is talking about Firefox behavior. Firefox is a multi-OS portable application that has a design requirement that user data must be protected against crashes. (same concept as your word processor's auto-save feature) The author of such a portable application must ensure their app saves data against Windows Vista kernel crashes, HPUX kernel crashes, OS X window system crashes, X11 window system crashes, application crashes, etc. Can a portable app really rely on what Linux kernel hackers think the underlying filesystem _should_ do? No, it is either (a) not going to care at all, or (b) uses fsync(2) or FlushFileBuffers() because if guarantees provided across the OS spectrum, in light of the myriad OS filesystem caching, flushing, and ordering algorithms. Was the BSD soft-updates idea of FS data-before-metadata a good one? Yes. Obviously. It is the cornerstone of every SANE journalling-esque database or filesystem out there -- don't leave a window where your metadata is inconsistent. "Duh" :) But that says nothing about when a userland app's design requirements include ordered writes+flushes of its own application data. That is the common case when a userland app like Firefox uses a transactional database such as sqlite or db4. Thus it is the height of ...
Your idea of 'consistent' seems a bit fuzzy. Soft updates, afaiu, leave plenty of windows and reasons to run fsck. They only guarantee that all those windows result in lost space - data allocations without any references. It certainly prevents the worst problems, but I would use a different word for it. :) Jörn -- Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats. -- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by Raph Levien, 1979 --
Generalities are bad. For example: write(); unlink(); <do more stuff> close(); This is a clear case where you want metadata changed before data is committed to disk. In many cases, you don't even want the data to hit the disk here. Similarly, rsync does the magic open,write,close,rename sequence without an fsync before the rename. And it doesn't need the fsync, either. The proposed implicit fsync on rename will kill rsync The filesystem should batch the fsyncs efficiently. if the filesystem doesn't handle fsync efficiently, then it is a bad filesystem choice for that workload.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I agree. But unfortunately, I think we're going to be bullied into data=ordered semantics for the open/write/close/rename sequence, at least as the default. Ext4 has a noauto_da_alloc mount option (which Eric Sandeen suggested we rename to "no_pony" :-), for people who mostly run sane applications that use fsync(). For people who care about rsync's performance and who assume that they can always restart rsync if the system crashes while the rsync is running could, rsync could add Yet Another Rsync Option :-) which explicitly unlinks the target file before the rename, which would All I can do is apologize to all other filesystem developers profusely for ext3's data=ordered semantics; at this point, I very much regret that we made data=ordered the default for ext3. But the application writers vastly outnumber us, and realistically we're not going to be able to easily roll back eight years of application writers being trained that fsync() is not necessary, and actually is detrimental for ext3. - Ted --
I am slightly confused by the "data=ordered" thing that everyone is mentioning of late. In theory, it made sense to me before I tried it. I switched to mounting my ext3 as ext4, and I'm still seeing seriously delayed fsyncs. Theodore, I used a modified version of your fsync-tester.c to bench 1M writes, while doing a dd, and I'm still getting *almost* as bad of "fsync" performance as I was on ext3. On ext3, the fsync would usually not finish until the dd was complete. I am currently using Linus' tree at v2.6.29, in x86_64 mode. If you need more info, let me know. tdamac ~ # mount /dev/mapper/s-sys on / type ext4 (rw) dd if=/dev/zero of=/tmp/bigfile bs=1M count=2000 Your modified fsync test renamed to fs-bench... tdamac kernel-sluggish # ./fs-bench --sync write (sync: 1) time: 0.0301 write (sync: 1) time: 0.2098 write (sync: 1) time: 0.0291 write (sync: 1) time: 0.0264 write (sync: 1) time: 1.1664 write (sync: 1) time: 4.0421 write (sync: 1) time: 4.3212 write (sync: 1) time: 3.5316 write (sync: 1) time: 18.6760 write (sync: 1) time: 3.7851 write (sync: 1) time: 13.6281 write (sync: 1) time: 19.4889 write (sync: 1) time: 15.4923 write (sync: 1) time: 7.3491 write (sync: 1) time: 0.0269 write (sync: 1) time: 0.0275 ... This topic is important to me, as it has been affecting my home machine quite a bit. I can test things as I have time. Lastly, is there any way data=ordered could be re-written to be "smart" about not making other processes wait on fsync? Or is that sort of thing only handled in the scheduler? (not a kernel hacker here) Sorry if I'm interrupting. Perhaps I should even be starting another thread? --
How much memory do you have? On my 4gig X61 laptop, using a 5400 rpm laptop drive, I see typical times of 1 to 1.5 seconds, with a few outliers at 4-5 seconds. With ext3, the fsync times immediately jumped up to 6-8 seconds, with the outliers in the 13-15 second range. (This is with a filesystem formated as ext3, and mounted as either ext3 or ext4; if the filesystem is formatted using "mke2fs -t ext4", what you see is a very smooth 1.2-1.5 seconds fsync latency, indirect blocks for very big files end up being quite inefficient.) So I'm seeing a definite difference --- but also please remember that "dd if=/dev/zero of=bigzero.img" really is an unfair, worst-case scenario, since you are dirtying memory as fast as your CPU will dirty pages. Normally, even if you are running distcc, the rate at which you can dirty pages will be throttled at your local network speed. You might want to try more normal workloads and see whether you are seeing distinct fsync latency differences with ext4. Even with the worst-case dd if=/dev/zero, I'm seeing major differences in my testing. - Ted --
Oh. I thought I had read somewhere that mounting ext4 over ext3 would solve the problem. Not sure where I read that now. Sorry for wasting Yes, I realize that. When trying to find performance problems I try to be as *unfair* as possible. :D Thanks Ted. --
Well, I believe it should solve it for most realistic workloads (where I don't think "dd if=/dev/zero of=bigzero.img" is realistic). Looking more closely at the statistics, the delays aren't coming from trying to flush the data blocks in data=ordered mode. If we disable delayed allocation (mount -o nodelalloc), you'll see this when you look at /proc/fs/jbd2/<dev>/history: R/C tid wait run lock flush log hndls block inlog ctime write drop close R 12 23 3836 0 1460 2563 50129 56 57 R 13 0 5023 0 1056 2100 64436 70 71 R 14 0 3156 0 1433 1803 40816 47 48 R 15 0 4250 0 1206 2473 57623 63 64 R 16 0 5000 0 1516 1136 61087 67 68 Note the amount of time in milliseconds in the flush column. That's time spent flusing the allocated data blocks to disk. This goes away once you enable delayed allocation: R/C tid wait run lock flush log hndls block inlog ctime write drop close R 56 0 2283 0 10 1250 32735 37 38 R 57 0 2463 0 13 1126 31297 38 39 R 58 0 2413 0 13 1243 35340 40 41 R 59 3 2383 0 20 1270 30760 38 39 R 60 0 2316 0 23 1176 33696 38 39 R 61 0 2266 0 23 1150 29888 37 38 R 62 0 2490 0 26 1140 35661 39 40 You may see slightly worse times since I'm running with a patch (which will be pushed for 2.6.30) that makes sure that the blocks we are writing during the "log" phase are written using WRITE_SYNC instead of WRITE. (Without this patch, the huge amount of writes caused by the VM trying to keep up with pages being dirtied at CPU speeds via "dd if=/dev/zero..." will interfere with writes to the journal.) During the log phase (which is averaging around 2 seconds for nodealloc, and 1 seconds with delayed allocation enabled), we write the metadata to the ...
Pardon my french, but that is a fucking joke. You are making a judgement call that one application is more important than another application and trying to impose that on everyone. You are saying that we should perturb a well designed and written backup application that is embedded into critical scripts all around the world for the sake of desktop application that has developers that are too fucking lazy to fix their bugs. If you want to trade rsync performance for desktop performance, do it in the filesystem that is aimed at the desktop. Don't fuck rename up for filesystems that are aimed at the server market and don't want to implement performance sucking hacks to work around fucked up desktop applications. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
You are welcome to argue with the desktop application writers (and Linus, who has sided with them). I *knew* this was a fight I was not going to win, so I implemented the replace-via-rename workaround, even before I started trying to convince applicaiton writers that they should write more portable code that would be safe on filesystems such as, say, XFS. And it looks like we're losing that battle as well; it's hard to get people to write correct, portable code! (I *told* the application writers that I was the moderate on this one, even as they were flaming me to a crisp. Given that I'm taking flak from both sides, it's to me a good indication that the design choices made for What I did was create a mount option for system administrators interested in the server market. And an rsync option that unlinks the target filesystem first really isn't that big of a deal --- have you seen how many options rsync already has? It's been a running joke with the rsync developers. :-) If XFS doesn't want to try to support the desktop market, that's fine --- it's your choice. But at least as far as desktop application programmers, this is not a fight we're going to win. It makes me sad, but I'm enough of a realist to understand that. - Ted --
It seems you still didn't get the point. ext3 data=ordered is not the problem. The problem is that the average developer doesn't expect the fs to _re-order_ stuff. This is how most common fs did work long before ext3 has been introduced. They just know that there is a caching and they might lose recent data, but they expect the fs on disk to be a snapshot of the fs in memory at some time before the crash (except when crashing while writing). But the re-ordering brings it to the state that never has been in memory. data=ordered is just reflecting this thinking. With data=writeback as the default the users would have lost data and would have simply chosen a different fs instead of twisting the params. Or the distros would have made data=ordered the default to prevent beeing blamed for the data loss. And still I don't know any reason, why it makes sense to write the metadata to non-existing data immediately instead of delaying that, too. --
No it isn´t. Standard Unix file systems made no such guarantee and would write out data out of order. The disk scheduler would then further re-order things. If you think the ¨guarantees¨ from before ext3 are normal defaults you´ve been writing junk code --
You surely know that better: Did fs actually write "later" data quite long before "earlier" data? During the flush data may be re-ordered, but I'm still on ReiserFS since it was considered stable in some SuSE 7.x. And I expected it to be fairly ordered, but as a network protocol programmer I didn't rely on the ordering of fs write-outs yet. --
People keep forgetting that storage (even on your commodity s-ata class of drives) has very large & volatile cache. The disk firmware can hold writes in that cache as long as it wants, reorder its writes into anything that makes sense and has no explicit ordering promises. This is where the write barrier code comes in - for file systems that care about ordering for data, we use barrier ops to impose the required ordering. In a similar way, fsync() gives applications the power to impose their own ordering. If we assume that we can "save" an fsync cost with ordering mode, we have to keep in mind that the file system will need to do the expensive With reiserfs, you will have barriers on by default in SLES/opensuse which will keep (at least fs meta-data) properly ordered.... ric --
.. Hi Ric, No, we don't forget about those drive caches. But in practice, for nearly everyone, they don't actually matter. The kernel can crash, and the drives, in practice, will still flush their caches to media by themselves. Within a second or two. Sure, there are cases where this might not happen (total power fail), but those are quite rare for desktop users -- and especially for the most common variety of desktop user: notebook users (whose machines have built-in UPSs). Cheers --
Here I disagree - nearly everyone has their critical data being manipulated in large data centers on top of Linux servers. We all can routinely suffer when linux crashes and loses data at big sites like google, amazon, hospitals or your local bank. Even with desktops, I am not positive that the drive write cache survives a kernel crash without data loss. If I remember correctly, Chris's tests used crashes (not power outages) to display the data corruption that happened without Unless of course you push your luck with your battery and run it until really out of power, but in general, I do agree that laptops and notebook users have a reasonably robust built in UPS. ric --
.. Linux f/s barriers != drive write caches. Drive write caches are an almost total non-issue for desktop users, except on the (very rare) event of a total, sudden power failure during extended write outs. Very rare. Yes, a huge problem for server farms. No question. But the majority of Linux systems are probably (still) desktops/notebooks. Cheers --
Heck, even I have lost power on a plane, while a laptop in laptop mode But it doesn't really matter who is what majority, does it? At the present time at least, we have not designated any filesystems "desktop only", nor have we declared Linux a desktop-only OS. Any generalized decision that hurts servers to help desktops would be short-sighted. Robbing Peter, to pay Paul, is no formula for OS success. Jeff --
I am confused as to why you think that barriers (flush barriers specifically) are not equivalent to drive write cache. We disable barriers when the write cache is off, use them only to insure that our ordering for fs transactions survives any power loss. No one should be enabling barriers on linux file systems if your write cache is disabled or if you have a battery backed write cache (say on an enterprise class disk array). Chris' test of barriers (with write cache enabled) did show for desktop class boxes that you would get file system corruption (i.e., need to fsck the disk) a huge percentage of the time. Sudden power failures are not rare for desktops in my personal experience, I see them several times a year in New England both at home (ice, tree limbs, etc) or at work (unplanned outages for repair, broken AC, etc). Ric --
.. Sure, no doubt there. But it's due to the kernel crash, not due to the write cache on the drive. Anything in the drive's write cache very probably made it to the media within a second or two of arriving there. So with or without a write cache, the same result should happen for those tests. Of course, if you disable barriers *and* write cache, then you are no longer testing the same kernel code. I'm not arguing against battery backup or UPSs, or *for* blindly trusting write caches without reliable power. Just pointing out that they're not the evil that some folks seem to believe they are. Cheers --
A modern S-ATA drive has up to 32MB of write cache. If you lose power or suffer a sudden reboot (that can reset the bus at least), I am pretty sure that your Here, I still disagree. All of the test that we have done have shown that write cache enabled/barriers off will provably result in fs corruption. It would be great to have Chris revise his earlier barrier/corruption test to I run with write cache and barriers enabled routinely, but would not run without working barriers on any desktop box when the drives have write cache enabled having spent too many hours watching fsck churn :-) ric --
At least traditionally, it's worth to note that 32MB of on-disk cache is not the same as 32MB of kernel write cache. The drive caches tend to be more like track caches - you tend to have a few large cache entries (segments), not something like a sector cache. And I seriously doubt the disk will let you fill them up with writes: it likely has things like the sector remapping tables in those caches too. It's hard to find information about the cache organization of modern drives, but at least a few years ago, some of them literally had just a single segment, or just a few segments (ie a "8MB cache" might be eight segments of one megabyte each). The reason that matters is that those disks are very good at linear throughput. The latency for writing out eight big segments is likely not really noticeably different from the latency of writing out eight single sectors spread out across the disk - they both do eight operations, and the difference between an op that writes a big chunk of a track and writing a single sector isn't necessarily all that noticeable. So if you have a 8MB drive cache, it's very likely that the drive can flush its cache in just a few seeks, and we're still talking milliseconds. In contrast, even just 8MB of OS caches could have _hundreds_ of seeks and take several seconds to write out. Linus --
.. I spent an entire day recently, trying to see if I could significantly fill up the 32MB cache on a 750GB Hitach SATA drive here. With deliberate/random write patterns, big and small, near and far, I could not fill the drive with anything approaching a full second of latent write-cache flush time. Not even close. Which is a pity, because I really wanted to do some testing related to a deep write cache. But it just wouldn't happen. I tried this again on a 16MB cache of a Seagate drive, no difference. Bummer. :) --
Try it with laptop drives. You might get to a second, or at least hundreds of ms (not counting the spinup delay if it went to sleep, obviously). You probably tested desktop drives (that 750GB Hitachi one is not a low end one, and I assume the Seagate one isn't either). You'll have a much easier time getting long latencies when seeks take tens of ms, and the platter rotates at some pitiful 3600rpm (ok, I guess those drives are hard to find these days - I guess 4200rpm is the norm even for 1.8" laptop harddrives). And also - this is probably obvious to you, but it might not be immediately obvious to everybody - make sure that you do have TCQ going, and at full depth. If the drive supports TCQ (and they all do, these days) it is quite possible that the drive firmware basically limits the write caching to one segment per TCQ entry (or at least to something smallish). Why? Because that really simplifies some of the problem space for the firmware a _lot_ - if you have at least as many segments in your cache as your max TCQ depth, it means that you always have one segment free to be re-used without any physical IO when a new command comes in. And if I were a disk firmware engineer, I'd try my damndest to keep my problem space simple, so I would do exactly that kind of "limit the number of dirty cache segments by the queue size" thing. But I dunno. You may not want to touch those slow laptop drives with a ten-foot pole. It's certainly not my favorite pastime. Linus --
.. Oh yes, absolute -- I tried with and without NCQ (the SATA replacement for old-style TCQ), and with varying NCQ queue depths. No luck keeping the darned thing busy flushing afterwards for anything more than perhaps a few hundred millseconds. I wasn't really interested in anything under a second, so I didn't measure it exactly though. The older and/or slower notebook drives (4200rpm) tend to have smaller onboard caches, too. Which makes them difficult to fill. I suspect I'd have much better "luck" with a slow-ish SSD that has a largish write cache. Dunno if those exist, and they'll have to get cheaper before I pick one up to deliberately bash on. :) Cheers --
I had some fun trying things with this, and I've been able to reliably trigger stalls in write cache of ~60 seconds on my seagate 500GB sata drive. The worst I saw was 214 seconds. It took a little experimentation, and I had to switch to the noop scheduler (no idea why). Also, I had to watch vmstat closely. When the test first started, vmstat was reporting 500kb/s or so write throughput. After the test ran for a few minutes, vmstat jumped up to 8MB/s. My guess is that the drive has some internal threshold for when it decides to only write in cache. The switch to 8MB/s is when it switched to cache only goodness. Or perhaps the attached program is buggy and I'll end up looking silly...it was some quick coding. The test forks two procs. One proc does 4k writes to the first 26MB of the test file (/dev/sdb for me). These writes are O_DIRECT, and use a block size of 4k. The idea is that we fill the cache with work that is very beneficial to keep in cache, but that the drive will tend to flush out because it is filling up tracks. The second proc O_DIRECT writes to two adjacent sectors far away from the hot writes from the first proc, and it puts in a timestamp from just before the write. Every second or so, this timestamp is printed to stderr. The drive will want to keep these two sectors in cache because we are constantly overwriting them. (It's worth mentioning this is a destructive test. Running it on /dev/sdb will overwrite the first 64MB of the drive!!!!) Sample output: # ./wb-latency /dev/sdb Found tv 1238434622.461527 starting hot writes run starting tester run current time 1238435045.529751 current time 1238435046.531250 ... current time 1238435063.772456 current time 1238435064.788639 current time 1238435065.814101 current time 1238435066.847704 Right here, I pull the power cord. The box comes back up, and I run: # ./wb-latency -c /dev/sdb Found tv 1238435067.347829 When -c is passed, it just reads the timestamp out of the ...
.. I'd be more interested in how you managed that (above), than the quite different test you describe below. Yes, different, I think. The test below just times how long a single chunk of data might stay in-drive cache under constant load, rather than how long it takes to flush the drive cache on command. Right? --
That's right, it is testing for starvation in a single sector, not for
how long the cache flush actually takes. But, your remark from higher
up in the thread was this:
>
> Anything in the drive's write cache very probably made
> it to the media within a second or two of arriving there.
>
Sorry if I misread things. But the goal is just to show that it really
does matter if we use a writeback cache with or without barriers. The
test has two datasets:
1) An area that is constantly overwritten sequentially
2) A single sector that stores a critical bit of data.
#1 is the filesystem log, #2 is the filesystem super. This isn't a
specialized workload ;)
-chris
--
.. Yeah, but that was in the context of how long the drive takes to clear out it's cache when there's a (brief) break in the action. Still, it's really good to see hard data on a drive that actually .. Good points. I'm thinking of perhaps acquiring an OCZ Vertex SSD. The 120GB ones apparently have 64MB of RAM inside, much of which is used to cache data heading to the flash. I wonder how long it takes to empty out that sucker! Cheers --
I remember cfq having a bug (or a feature?) that prevents queue depths deeper than 1.. so with noop you get more ios to the queue. -- Pasi --
Well, when it comes to disk caches, it really does make sense to start looking at what breaks. For example, it is obviously true that any half-way modern disk has megabytes of caches, and write caching is quite often enabled by default. BUT! The write-caches on disk are rather different in many very fundamental ways from the kernel write caches. One of the differences is that no disk I've ever heard of does write- caching for long times, unless it has battery back-up. Yes, yes, you can probably find firmware that has some odd starvation issue, and if the disk is constantly busy and the access patterns are _just_ right the writes can take a long time, but realistically we're talking delaying and re-ordering things by milliseconds. We're not talking seconds or tens of seconds. And that's really quite a _big_ difference in itself. It may not be qualitatively all that different (re-ordering is re-ordering, delays are delays), but IN PRACTICE there's an absolutely huge difference between delaying and re-ordering writes over milliseconds and doing so over 30s. The other (huge) difference is that the on-disk write caching generally fails only if the drive power fails. Yes, there's a software component to it (buggy firmware), but you can really approximate the whole "disk write caches didn't get flushed" with "powerfail". Kernel data caches? Let's be honest. The kernel can fail for a thousand different reasons, including very much _any_ component failing, rather than just the power supply. But also obviously including bugs. So when people bring up on-disk caching, it really is a totally different thing from the kernel delaying writes. So it's entirely reasonable to say "leave the disk doing write caching, and don't force flushing", while still saying "the kernel should order the writes it does". Thinking that this is somehow a black-and-white issue where "ordered writes" always has to imply "cache flush commands" is simply wrong. It is ...
Largely correct above - most disks will gradually destage writes from their cache. Large, sequential writes might entirely bypass the write cache and be sent (more or less) immediately out to permanent storage. I still disagree strongly with the don't force flush idea - we have an absolute and critical need to have ordered writes that will survive a power failure for any file system that is built on transactions (or data base). The big issues are that for s-ata drives, our flush mechanism is really, really primitive and brutal. We could/should try to validate a better and less onerous I spent a very long time looking at huge numbers of installed systems (millions of file systems deployed in the field), including taking part in weekly analysis of why things failed, whether the rates of failure went up or down with a given configuration, etc. so I can fully appreciate all of the ways drives (or SSD's!) can magically eat your data. What you have to keep in mind is the order of magnitude of various buckets of failures - software crashes/code bugs tend to dominate, followed by drive failures, followed by power supplies, etc. I have personally seen a huge reduction in the "software" rate of failures when you get the write barriers (forced write cache flushing) working properly with a Again, you have to focus on the errors that happen in order of the prevalence. The number of boxes, over a 3 year period, that have an unexpected power loss is much, much higher than the number of boxes that have a disk head crash (probably the number one cause of hard disk failure). I do agree that we need to do other (background) tasks to detect things like the that drives can have (lots of neat terms that give file system people nightmare in the drive industry: "adjacent track erasures", "over powered seeks", "hi fly writes" just to name my favourites). Having full checksumming for data blocks and metadata blocks in btrfs will allow This is pretty much a double ...
Read that sentence of yours again. In particular, read the "we" part, and ponder. YOU have that absolute and critical need. Others? Likely not so much. The reason people run "data=ordered" on their laptops is not just because it's the default - rather, it's the default _because_ it's the one that avoids most obvious problems. And for 99% of all people, that's what they want. And as mentioned, if you have to have absolute requirements, you absolutely MUST be using real RAID with real protection (not just RAID0). Not "should". MUST. If you don't do redundancy, your disk _will_ eventually eat your data. Not because the OS wrote in the wrong order, or the disk cached writes, but simply because bad things do happen. But turn that around, and say: if you don't have redundant disks, then pretty much by definition those drive flushes won't be guaranteeing your That's one of the issues. The cost of those flushes can be really quite high, and as mentioned, in the absense of redundancy you don't actually Well, I can go mainly by my own anecdotal evidence, and so far I've actually had more catastrophic data failure from failed drives than anything else. OS crashes in the middle of a "yum update"? Yup, been there, done that, it was really painful. But it was painful in a "damn, I need to force a re-install of a couple of rpms". Actual failed drives that got read errors? I seem to average almost one a year. It's been overheating laptops, and it's been power outages that Sure. And those "write flushes" really only cover a rather small percentage. For many setups, the other corruption issues (drive failure) are not just more common, but generally more disastrous anyway. So why The software rate of failures should only care about the software write barriers (ie the ones that order the OS elevator - NOT the ones that actually tell the disk to flush itself). Linus --
My "we" is meant to be the file system writers - we build our journalled file systems on top of these assumptions about ordering. Not having them punts this Simply not true. To build reliable systems, you need reliable components. It is perfectly normal to build non-raided systems that are components of a larger storage pool that don't do raid. Easy example would be two desktops using rsync, most "cloud" storage systems do something similar at the whole file level (i.e., write out my file 3 times). If you acknowledge back to a client a write, then have a power outage, the They do in fact provide that promise for the extremely common case of power I have measured the costs of the write flushes on a variety of devices, routinely, a cache flush is on the order of 10-20 ms with a healthy s-ata drive. Compared to the write speed of writing any large file from DRAM to storage, one 20ms cost to make sure it is on disk is normally in the noise. The trade off is clearly not as good for small files. And I will add, my data is built on years of real data from commodity hardware running normal Linux kernels - no special hardware. There are also a lot of good papers that the USENIX FAST people have put out (looking at failures in NetApp gear, the HPC servers in national labs and at google) that can help provide Heat is a major killer of spinning drives (as is severe cold). A lot of times, drives that have read errors only (not failed writes) might be fully recoverable if you can re-write that injured sector. What you should look for is a peak in the remapped sectors (via hdparm) - that usually is a moderately good indicator The elevator does not issue write barriers on its own - those write barriers are sent down by the file systems for transaction commits. I could be totally confused at this point, but I don't know of any sequential ordering needs that CFQ, etc have for their internal needs. ric --
.. Err, no. Yes, the flush itself will be very quick, since the drive is nearly always keeping up with the I/O already (as we are discussing in a separate subthread here!). But.. the cost of that FLUSH_CACHE command can be quite significant. To issue it, we first have to stop accepting R/W requests, and then wait for up to 32 of them currently in-flight to complete. Then issue the cache-flush, and wait for that to complete. Then resume R/W again. And FLUSH_CACHE is a PIO command for most libata hosts, so it has a multi-microsecond CPU hit as well as the I/O hit, whereas regular R/W commands will usually use less CPU because they are usually done via an automated host command queue. Tiny, but significant. And more so on smaller/slower end-user systems like netbooks than on datacenter servers, perhaps. Cheers --
No they really effectively don't. Not if the end result is "oops, the whole track is now unreadable" (regardless of whether it happened due to a write durign power-out or during some entirely unrelated disk error). Your "flush" didn't result in a stable filesystem at all, it just resulted in a dead one. That's my point. Disks simply aren't that reliable. Anything you do with It's not worked for me, and yes, I've tried. Maybe I've been unlucky, but every single case I can remember of having read failures, that drive has been dead. Trying to re-write just the sectors with the error (and around it) didn't do squat, and rewriting the whole disk didn't work either. I'm sure it works for some "ok, the write just failed to take, and the CRC was bad" case, but that's apparently not what I've had. I suspect either the track markers got overwritten (and maybe a disk-specific low-level reformat would have helped, but at that point I was not going to trust the drive anyway, so I didn't care), or there was actual major physical damage You yourself said that software errors were your biggest issue. The write Right. But "elevator write barrier" vs "sending a drive flush command" are two totally independent issues. You can do one without the other (although doing a drive flush command without the write barrier is admittedly kind of pointless ;^) And my point is, IT MAKES SENSE to just do the elevator barrier, _without_ the drive command. If you worry much more about software (or non-disk component) failure than about power failures, you're better off just doing the software-level synchronization, and leaving the hardware alone. Linus --
They actually are reliable in this way, I have not seen disks fail as you seem to think that they do after a simple power failure. With barriers (and barrier flushes enabled), you don't get that kind of bad reads for tracks after a normal power outage. Some of the odd cases come from hot spotting of drives (say, rewriting the same sector over and over again) which can over many, many writes impact the integrity of the adjacent tracks. Or, you can get IO errors from temporary vibration (dropped the laptop or rolled a new machine down the data center). Those temporary errors are the ones that can be repaired. I don't know how else to convince you (lots of good wine? beer? :-)), but I have personally looked at this in depth. Certainly, "Trust me, I know disks" is not Lap top drives are more likely to fail hard - you might have really just had a bad head or similar issue. Mark Lord hacked in support for doing low level writes into hdparm - might be How you bucket software issues in a hardware company (old job, not here at Red Hat) would include things like "file system corrupt, but disk hardware good" which results from improper barrier configuration. A disk hardware failure would be something like the drive does not spin up, it has bad memory in the write cache, a broken head (actually, one of the most I guess we have to agree to disagree. File systems need ordering for transactions and recoverability. Doing barriers just in the elevator will appear to work well for casual users, but in any given large population (including desktops here), will produce more corrupted file systems, manual recoveries after power failure, etc. File systems people can work harder to reduce fsync latency, but getting rid of these fundamental building blocks is not really a good plan in my opinion. I am pretty sure that we can get a safe and high performing file system balance here that will not seem as bad as you have experienced. Ric --
But this is apples and oranges isn't it? All of the effort that goes into metadata journalling in ext3, ext4, xfs, reiserfs, jfs ... is to save us from the fsck time on restart, and ensure a consistent filesystem framework (metadata, that is, in general), after an unclean shutdown. That could be due to a system crash or a power outage. This is much more common in my personal experience than a drive failure. That journalling requires ordering guarantees, and with large drive write caches, and no ordering, it's not hard for it to go south to the point where things *do* get corrupted when you lose power or the drive resets in the middle of basically random write cache destaging. See Chris Mason's tests from a year or so ago, proving that ext3 is quite vulnerable to this - it likely explains some of the random htree corruption that occasionally gets reported to us. And yes, sometimes drives die, and then you are really screwed, but that's orthogonal to all of the above, I think. -Eric --
It's worked here. It would be nice to have a device mapper module that can just insert itself between the disk and the higher device mapper layer and "scrub" the disk, fetching unreadable sectors from Maybe a stupid question, but aren't tracks so small compared to the disk head that a physical head crash would take out multiple tracks at once? (the last on I experienced here took out a major part of the disk) Another case I have seen years ago was me writing data to a disk while it was still cold (I brought it home, plugged it in and started using it). Once the drive came up to temperature, it could no longer read the tracks it just wrote - maybe the disk expanded by more than it is willing to seek around for tracks due to thermal correction? Low level formatting the drive made it work perfectly and I kept using it until it was just No argument there. I have seen NCQ starvation on SATA disks, with some requests sitting in the drive for seconds, while the drive was busy handling hundreds of requests/second elsewhere... -- All rights reversed. --
If certain requests are hanging out in the drive's wbcache longer than others, that increases the probability that OS filesystem-required, elevator-provided ordering becomes skewed once requests are passed to drive firmware. The sad, sucky fact is that NCQ starvation implies FLUSH CACHE is more important than ever, if filesystems want to get ordering correct. IDEALLY, according to the SATA protocol spec, we could issue up to 32 NCQ commands to a SATA drive, each marked with the "FUA" bit to force the command to hit permanent media before returning. In theory, this NCQ+FUA mode gives the drive maximum ability to optimize parallel in-progress commands, decoupling command completion and command issue -- while also giving the OS complete control of ordering by virtue of emptying the SATA tagged command queue. In practice, NCQ+FUA flat out did not work on early drives, and performance was way under what you would expect for parallel write-thru command execution. I haven't benchmarked NCQ+FUA in a few years; it might be worth revisiting. Jeff --
Jeff Garzik wrote:
But are there drives out there that actually supports FUA?
The only cases I've seen dmesg DIFFERENT from something like
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
^^^^^^^^^^^^^^^^^^^^^^^^^^
is with SOME SCSI drives. Even most modern SAS drives I've seen
reports lack of support for DPO or FUA. Or at least kernel
reports that.
In the SATA world, I've seen no single case. Seagate (7200.9..7200.11,
Barracuda ES and ES2), WD (Caviar CE, Caviar Black, Caviar Green,
RE2 GP), Hitachi DeskStar and UltraStar (old and new), some others --
all the same, no DPO or FUA.
/mjt
--
Depends on your source of information: if you judge from probe messages, libata_fua==0 will imply !FUA-support. Jeff --
..
As your other post points out, lots of drives already support FUA,
but libata deliberately disables it by default (due to the performance
impact, similar to mounting a f/s with -osync).
For the curious, you can use this command to see if your hardware has FUA:
hdparm -I /dev/sd? | grep FUA
It will show lines like this for the drives that support it:
* WRITE_{DMA|MULTIPLE}_FUA_EXT
Cheers
--
If your drive supports NCQ, it is highly likely it supports FUA. By default, the libata driver _pretends_ your drive does not support FUA. grep the kernel source for libata_fua and check out the module parameter 'fua' Jeff --
Probably. My experiences (not _that_ many drives, but more than one) have I've had one drive that just stopped spinning. On power-on, it would make these pitiful noises trying to get the platters to move, but not actually ever work. If I recall correctly, I got the data off it by letting it just cool down, then powering up (successfully) and transferring all the data I _thought_ we stopped feeding new requests while the flush was active, so if you actually do a flush, that should never actually happen. But I didn't check. Linus --
You want to start using 'md' :-) With raid0,1,4,5,6,10, if it gets a read error, it find the data from elsewhere and tries to over-write the read error and then read back. If that all works, then it assume the drive is still good. This happens during normal IO and all when you 'scrub' the array which e.g. Debian does on the first Sunday of the month by default. NeilBrown --
The really sad thing about that one is that the SCSI vendors had this problem over ten years ago with TCQ - and fixed it in the drives. --
How about the far more regular crash case ? We may be pretty reliable but we are hardly indestructible especially on random boxes with funky BIOSes or low grade hardware builds. For the generic sane low end server/high end desktop build with at least two drive software RAID the hardware failure for data loss case is pretty rare. Crashes yes, having to reboot to recover from a RAID failure sure but data loss far less so --
The regular crash case doesn't need to care about the disk write-cache AT ALL. The disk will finish the writes on its own long after the kernel crashed. That was my _point_. The write cache on the disk is generally a whole lot safer than the OS data cache. If there's a catastrophic software failure (outside of the disk firmware itself ;), then the OS data cache is gone. But the disk write cache will be written back. Of course, if you have an automatic and immediate "power-off-on-oops", you're screwed, but if so, you have bigger problems anyway. You need to wait at _least_ a second or two before you power off. Linus --
BSD FFS/UFS and earlier file systems could leave you with all sorts of ordering that was not guaranteed - you did get data written within about 30 seconds but no order guarantees and a crash/fsck could give you interesting partial updates .. really interesting. renaming was one fairly safe case as BSD FFS/UFS did rename synchronously for the most part. --
Here I have the same question, I don't expect or demand that anything be done in a particular order unless I force it so, and I expect there to be some corner case where the data is written and the metadata doesn't reflect that in the event of a failure, but I can't see that it ever a good idea to have the metadata reflect the future and describe what things will look like if everything goes as planned. I have had enough of that BS from financial planners and politicians, metadata shouldn't try to predict the future just to save a ms here or there. It's also necessary to have the metadata match reality after fsync(), of course, or even the well behaved applications mentioned in this thread haven't a hope of staying consistent. Feel free to clarify why clairvoyant metadata is ever a good thing... -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
it's not that it's deliberatly pushing metadata out ahead of file data, but say you have the following sequence write to file1 update metadata for file1 write to file2 update metadata for file2 if file1 and file2 are in the same directory your software can finish all four of these steps before _any_ of the data gets pushed to disk. then when the system goes to write the metadata for file1 it is pushing the then-current copy of that sector to disk, which includes the metadata for file2, even though the data for file2 hasn't been written yet. if you try to say 'flush all data blocks before metadata blocks' and have a lot of activity going on in a directory, and have to wait until it all stops before you write any of the metadata out, you could be blocked from writing the metadata for a _long_ time. Also, if somone does a fsync on any of those files you can end up waiting a long time for all that other data to get written out (especially if the files are still being modified while you are trying to do the fsync). As I understand it, this is the fundamental cause of the slow fsync calls on ext3 with data=ordered. David Lang --
Understood that it's not deliberate just careless. The two behaviors
which are reported are (a) updating a record in an existing file and
having the entire file content vanish, and (b) finding some one else's
old data in my file - a serious security issue. I haven't seen any
report of the case where a process unlinks or truncates a file, the disk
space gets reused, and then the systems fails before the metadata is
updated, leaving the data written by some other process in the file
If you mean "write all data for that file" before the metadata, it would
seem to behave the way an fsync would, and the metadata should go out in
Your analysis sounds right to me,
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
"You are disgraced professional losers. And by the way, give us our money back."
- Representative Earl Pomeroy, Democrat of North Dakota
on the A.I.G. executives who were paid bonuses after a federal bailout.
--
ext3 eliminates this security issue by writing the data before the metadata. ext4 (and I thing XFS) eliminate this security issue by not allocating the blocks until it goes to write the data out. I don't know except if another file in the directory gets modified while it's writing out the first two, that file now would need to get written out as well, before the metadata for that directory can be written. if you have a busy system (say a database or log server), where files are getting modified pretty constantly, it can be a long time before all the file data is written out and the system is idle enough to write the metadata. --
Thank you, David, for this use case, but I think the problem could be solved quite easily: At any write-out time, e.g. after collecting enough data for delayed allocation or at fsync() 1) copy the metadata in memory, i.e. snapshot it 2) write out the data corresponding to the metadata-snapshot 3) write out the snapshot of the metadata In that way subsequent metadata changes should not interfere with the metadata-update on disk. Andreas --
the problem with this approach is that the dcache has no provision for there being two (or more) copies of the disk block in it's cache, adding this would significantly complicate things (it was mentioned briefly a few days ago in this thread) David Lang --
It seems that it's obviously the "right way" to solve the problem though. How much does the dcache need to know about this "in flight" block (ok, blocks - I can imagine a pathological case where there were a stack of them all slightly different in the queue)? You'd be basically reinventing MVCC-like database logic with transactional commits at that point - so each fs "barrier" call would COW all the affected pages and write them down to disk. Bron. --
but if only one filesystem needs this caability is it really worth one aspect of mvcc systems is that they eat up space and require 'garbage collection' type functions. that could cause deadlocks if you aren't careful. David Lang --
No, it's not necessary. It should be possible for the specific fs to keep the metadata copy internally. And as long as these blocks are written immediately after writing the data, there should be no "queue" of copies, depending on how fsyncs are handled while the fs is committing. There might be one copy for the current commit and (at most) one copy corresponding to the most recent pending fsync. If there are multiple fsyncs before the commit is finished, the "pending copy" could simply be overwritten. Andreas --
Depends if that one filesystem is expected to have 90% of the
installed base or not, I guess. If not, then it's not worth
it. If having something like this makes that one filesystem
I guess the nice thing here is that the only consumer for the older
versions is the disk flushing thread, so figuring out when to cleanup
wouldn't be so hard as in a concurrent-users database.
But I'm speculating with no little hands-on experience with the
code. I just know I'd like the result...
Bron ( creating consistent pages on disk that never really
existed in memory sounds... exciting )
--
I think the sync point should be between the file system and the dcache,
with the data only going into the dcache when it's time to write it.
That also opens the door to doing atime better at no cost, atime changes
would be kept internal to the file system, and only be written at close
or fsync, even on a mount which does not use noatime or relatime. The
file system can keep that information and only write it when appropriate.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
"You are disgraced professional losers. And by the way, give us our money back."
- Representative Earl Pomeroy, Democrat of North Dakota
on the A.I.G. executives who were paid bonuses after a federal bailout.
--
I've been wondering about that during the last days. How abut JFS and data loss (files containing zeroes after a crash), as compared to ext3, ext4, ordered and writeback journal modes? Is is safe? -- Hilsen Harald. --
El Thu, 02 Apr 2009 00:00:04 +0200 i have had zeroed conf files with jfs (shell history) and corrupted firefox history files too after power outages and the like. --
if you don't do a fsync you can (and will) loose data if there is a crash period, end of statement, with all filesystems for all filesystems except ext3 in data=ordered or data=journaled modes journaling does _not_ mean that your files will have valid data in them. all it means is that your metadata will not be inconsistant (things like one block on disk showing up as being part of two different files) this guarantee means that a crash is not likely to scramble your entire disk, but any data written shortly before the crash may not have made it to disk (and the files may contain garbage in the space that was allocated but not written). as such it is not nessasary to do a fsck after every crash (it's still a good idea to do so every once in a while) that's _ALL_ that journaling is protecting you from. delayed allocateion and data=ordered are ways to address the security problem that the garbage data that could end up as part of the file could contain sensitive data that had been part of other files in the past. data=ordered and data=journaled address this security risk by writing the data before they write the metadata (at the cost of long delays in writing the metadata out, and therefor long fsync times) XFS and ext4 solve the problem by not allocating the data blocks until they are actually ready to write the data. David Lang --
.. Err, no actually. I want a consistent disk state, either all old or all new data after a crash. Not loss of BOTH new and old data. And the example above is trying to show, what?? Looks like a temporary file case, except the code is buggy and should be doing the unlink() before the write() call. But thanks for looking at this stuff! --
Dave is right that if you write to a file and unlink the same file, so that the data are orphaned. In that case you don't want the orphaned data to be written on disk. But Mark is right, too. Because in that case you probably also don't want any metadata to be written to the disk, unless the open() was already commited. You might have to update timestamps for the directory. So rephrasing it: The filesystem should not alter the metadata before writing the _linked_ data. --
Sorry, I'm afraid that rsync falls into the same category as the kde/gnome apps here. There are a lot of backup programs built around rsync, and every one of them risks losing the old copy of the file by renaming an unflushed new copy over it. rsync needs the flushing about a million times more than gnome and kde, and it doesn't have any option to do it automatically. It does have the option to create backups, which is how a percentage of people are using it, but I wouldn't call its current setup safe outside of ext3. -chris --
I wouldn't make it to be the default, but as an option, if the backup script would take responsibility for restarting rsync if the server crashes, and if the rsync process executes a global sync(2) call when it is complete, an option to make rsync delete the target file before doing the rename to defeat the replace-via-rename hueristic could be justifiable. - Ted --
If you crash while rsync is running, then the state of the copy is garbage anyway. You have to restart from scratch and rsync will detect such failures and resync the file. gnome/kde have no And therein lies the problem with a "flush-before-rename" semantic.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
If this were the recovery system they had in mind, then why use rename at all? They could just as easily overwrite the original in place. Using rename implies they want to replace the old with a complete new version. There's also the window where you crash after the rsync is done but Here I was just talking about a rsync --flush-after-rename or something, not an option from the kernel. -chris --
It is not a recovery system. The renaming procedure is almost atomic with e.g. reiser or ext3 (ordered), but simple overwriting would always Sure, but in that case you have only lost some of your _mirrored_ data. The original will usually be untouched by this. So after the restart you just start the mirroring process again, and hopefully, this time you get a perfect copy. In KDE and lots of other apps the _original_ config files (and not any copies) are "overlinked" with the new files by the rename. That's the difference. Andreas --
Well, we're considering a future where ext3 and reiser are no longer used, and applications are responsible for the flushing if they want renames atomic for data as well as metadata. If we crash during the rsync, the backup logs will yell. If we crash just after the rsync, the backup logs won't know. The data could still We don't run backup programs because we can use the original as a backup for the backup ;) From an rsync-for-backup point of view, the backup is the only copy. Yes, rsync could easily be fixed. Or maybe people just aren't worried, its hard to say. Having the ext3 style flush with the rename makes the system easier to use, and easier to predict how it will react. rsync was originally brought up when someone asked about applications that do renames and don't care about atomic data replacement. If the flushing is a horrible thing, there must be a lot more examples? -chris --
As long as you only consider it, all will be fine ;-). As a user I don't want to use a filesystem which leaves a long gap between renaming the metadata and writing the data for it, that is having dirty, inconsistent metadata overwriting clean metadata. So Ted's quick pragmatic approach to patch it in the first step was good, even if it's possible that it's not be the final solution. Flushing in applications is not a suitable solution. Maybe barriers could be a solution, but to get something like this into _all_ the multitude of applications is very unlikely. There might be filesystems which use a delayed, but ordered mode. They could provide "atomic" renames, and perform much better, if applications do not flush with every file update. Andreas --
So have rsync call the sync() system call before it exits. Not a big deal, and not all that costly. So basically what I would suggest doing for people who are really worried about rsync performance with flush-on-rename is to create a patch to rsync which creates a new flag, --unlink-before-rename, which will defeat the flush-on-rename hueristic; and if this patch also causes rsync to call sync() when it is done, it should be quite safe. - Ted --
sync() isn't guaranteed to be synchronous. Treating it as such isn't portable. -- Matthew Garrett | mjg59@srcf.ucam.org --
Absolutely! That's what I thought all the time when following this (meanwhile quite grotesque) discussion. Even for ordinary home/office/laptop/desktop users (!=kernel developers), kernel crashes are simply not a realistic scenario any more to optimize anything for (which is due to the good work you guys are doing in making/keeping the kernel stable). Alex --
Good point. We should throw away all the journaling junk and just go back to ext2. Why pay the extra cost for something we shouldn't optimize for? --
The previous two posts were about assumptions at the level of application software, not at the kernel level. -- Stefan Richter -=====-=-=== -=-= -==-= http://arcgraph.de/sr/ --
as one of those users with many windows tabs open (a couple hundred normally), even the curent firefox behavior isn't good enough because it doesn't let me _not_ load everything back in when a link I go to triggers a crash in firefox every time it loads. so what I do is do a git commit in cron every min of the history file. git can do the fsync as needed to get it to disk reasonably without firefox needing to do it _for_every_click_ like laptop mode, you need to be able to define "I'm willing to loose this much activity in the name of performance/power" ted's suggestion (in his blog) to tweak fsync to 'misbehave' when laptop mode is enabled (only pushing data out to disk when the disk is awake anyway, or the time has hit) would really work well for most users. servers (where you have the data integrity fsync useage) don't use laptop mode. desktops could use 'laptop mode' with a delay of 0.5 or 1 second and get prety close the the guarentee that users want without a huge performance hit. --
The existential struggle is overall amusing: Application writers start using userland transactional databases for crash recovery and consistency, and in response, OS writers work to undercut the consistency guarantees currently provided by the OS. More seriously, if we get sqlite, db4 and a few others behaving sanely WRT fsync, you cover a wide swath of apps all at once. I absolutely agree that db4, sqlite and friends need to be smarter in the case of laptop mode or overall power saving. Jeff --
Actually, it makes a lot of sense, if you think about it in this way. The requirement is this; by default, data which is critical shouldn't be lost. (Whether this should be done by the filesystem performing magic, or the application/database programmer being careful about using fsync --- and whether we should treat all files as critical and to hell with performance, or only those which the application has designated as precious or nonprecious --- there is some dispute.) However, the system administrator should be able to say, "I want laptop mode functionality", and with the turn of a single dial, be able to say, "In order to save batteries, I'm OK with losing up to X seconds/minutes worth of work." I would envision a control panel GUI where there is one checkbox, "enable laptop mode", and another checkbox, "enable laptop mode only when on battery" (which is greyed out unless the first is checkbox is enabled), and then a slidebar which allows the user to set how many seconds and/or minutes the user is willing to lose if the system crashes. At that point, it's up to the user. Maybe the defaults should be something like 15 seconds; maybe the defaults should be 5 seconds. Maybe the defaults should be automatically set to different values by different distributions, depending on whether said distro is willing to use badly unstable proprietary bindary video drivers that crash if you look at them funny. The advantage of such a scheme is that there's a single knob for the user to control, instead one for each application. And fundamentally, it should be OK for a user of the desktop and/or the system administrator to make this tradeoff. That's where the choice belongs; not to the application writer, and not to the filesystem maintainer, or OS programmers in general. If I have an Lenovo X61s which is rock solid stable, with Intel video drivers, I might be willing to risk lose up to 10 minutes of work, secure in the knowledge it's highly unlikely to happen. If I'm ...
Overall I agree, but I would rewrite that as: it's fair game as long as the OS doesn't undercut the deliberate write ordering performed by the userland application. When the "laptop mode fsync plug" is uncorked, writes should not be merged across an fsync(2) barrier; otherwise it becomes impossible to build transactional databases with any consistency guarantees at all. Jeff --
This is all about tradeoff. I guess everybody can afford loosing the last 30 seconds of history (or 5mn ...). That's not that much of lost work... --
Definitely a difference! 1 for both, here. Deb is a fresh OS install and fresh homedir, but my F10 has been through many OS and ff config upgrades over the years. Jeff --
Hmm. I wonder where firefox gets its defaults then. I can well imagine that Debian has a different firefox build, with different defaults. But if your F10 thing also is set to 1, and still shows as "default", then that's odd, considering that mine shows 0. I have 'rpm -q firefox': firefox-3.0.7-1.fc10.x86_64. Is yours a 32-bit one? Maybe it comes with different defaults? And maybe firefox just has a very odd config setup and I don't understand what "default" means at all. Gene says he doesn't have that toolkit.storage.synchronous thing at all. Linus --
In my case the toolkit.storage.synchronous is present in both, set to 1 in Deb and bolded and set to 1 in F10 (firefox-3.0.7-1.fc10.x86_64). The latter's bold typeface makes me think my F10 FF toolkit.storage.synchronous setting is NOT set to the F10 default -- although I have never heard of this setting, and have certainly not manually tweaked it. The only FF setting I manually tweak is cache directory. Jeff --
I just let FF update itself to 3.0.8 (from mozilla, not fedora) and there is no 'toolkit' stuff whatsoever in about:config. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Jacquin's Postulate on Democratic Government: No man's life, liberty, or property are safe while the legislature is in session. --
El Fri, 27 Mar 2009 23:55:06 -0400 I do not have it either FF 3.0.8 Ubuntu 8.10. it does not appear searching with --
.. Okay, I'll bite. Exactly which version of FF has that variable? Cuz it ain't in the FF 3.0.8 that I'm running here. Thanks --
I _thought_ it was there since rc2 of FF-3, but clearly there are odd things afoot. You're the second person to report it not there. I'd suspect that I mistyped it, but I just cut-and-pasted it from my email to make sure. Maybe you did. What happens if you just write "sync" in the Filter: box? Nothing matches? Do you see firefox pausing a lot under disk load? If you just add that "toolkit.storage.synchronous" value by hand (right-click in the preference window, do "New" -> "Integer"), and write it in as zero, does it change behavior? Linus --
No, not with my iceweasel 3.0.7 (Debian/testing).
I couldn't find anything in the Debian patch to the source code, but the
source code contains
toolkit/components/contentprefs/src/nsContentPrefService.js 733-746:
// Turn off disk synchronization checking to reduce disk churn and
speed up
// operations when prefs are changed rapidly (such as when a user
repeatedly
// changes the value of the browser zoom setting for a site).
//
// Note: this could cause database corruption if the OS crashes or
machine
// loses power before the data gets written to disk, but this is
considered
// a reasonable risk for the not-so-critical data stored in this
database.
//
// If you really don't want to take this risk, however, just set the
// toolkit.storage.synchronous pref to 1 (NORMAL synchronization) or 2
// (FULL synchronization), in which case
mozStorageConnection::Initialize
// will use that value, and we won't override it here.
if (!this._prefSvc.prefHasUserValue("toolkit.storage.synchronous"))
dbConnection.executeSimpleSQL("PRAGMA synchronous = OFF");
Probably they preferred the default value "off" so much that they even
I see iceweasel pausing/blocking a lot when loading stalling webpages,
but that's a different topic.
--
Are you telling us that the "Linux compatible" really means "Linux compatible, but only on ext3, only on x86, only on Ubuntu, only Gnome or KDE [1]"? If a program crashes on other setups, is it not a problem of the program but of the environment? sigh cate [1]Yes, I just see a installation script that expect one of the two environment. --
This is a fairly narrow view of correct and possible. How can you make
"cat" fsync? grep? sort? How do they know they're not dealing with
critical data? Apps in general don't know, because "criticality" is a
property of the data itself and how its used, not the tools operating on it.
My point isn't that "there should be a way of doing fsync from a shell
script" (which is probably true anyway), but that authors can't
generally anticipate when their program is going to be dealing with
something important. The conservative approach would be to fsync all
data on every close, but that's almost certainly the wrong thing for
everyone.
If the filesystem has reasonably strong inherent data-preserving
properties, then that's much better than scattering fsync everywhere.
fsync obviously makes sense in specific applications; it makes sense to
fsync when you're guaranteeing that a database commit hits stable
storage, etc. But generic tools can't reasonably perform fsyncs, and
its not reasonable to say that "important data is always handled by
special important data tools".
J
--
Isn't it possible to compile a program that simply calls open()/fsync()/close() on a given file name? If yes, then in your scripts, you can do whatever you want with existing tools on a _scratch_ file, then call your fsync program on that scratch file and then rename it to the real file. No? In other words, given that you know that your data is critical, you will write processed data to another file, while preserving the original, store the new file safely and then rename it to the original. Just like the apps that know that their files are critical are supposed to do using the API. -- Bojan --
And yet, FreeBSD seems to have a command just like that: http://www.freebsd.org/cgi/man.cgi?query=fsync&sektion=1&manpath=FreeBSD+7.1-R... -- Bojan --
I was thinking something like "munge_important_stuff | fsync > output" -
ie, cat which fsyncs on close. In fact, its vaguely surprising that GNU
cat doesn't have this already.
J
--
Yeah, after I wrote my initial comment, I noticed you were saying essentially the same thing in your original post. I know, I should I have no idea why we don't have that either. FreeBSD code seems really straightforward. -- Bojan --
I just tried using dd with conv=fsync option and that kinda does what you mentioned. I see this at the end of strace: --------------------------------- write(1, "<some data...>"..., 512) = 512 read(0, ""..., 512) = 0 fsync(1) = 0 close(0) = 0 close(1) = 0 --------------------------------- So, maybe GNU folks just don't want to have yet another tool for this. -- Bojan --
Huh, didn't know dd had grown that. Confusingly similar to the
completely different conv=sync, so its a perfect dd addition. Ooh,
fdatasync too.
J
--
Well... fsync is quite expensive. If your disk is down, it costs 3+ and 3J+. If your disk is up, it will only take 20msec+. OTOH the rename trick on ext3 costs approximately nothing... Imagine those desktops where they want windows layout preserved. Having 30 second old layout is acceptable, loosing layout altogether is not. If you add fsync to the window manager, user will see those 3seconds+ delays, unless window manager gets multithreaded. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
IOW, what we could reasonably do is something along the lines of: - start off with some reasonable value for max background dirty (per block device) that defaults to something sane (quite possibly based on simply memory size). - assume that "foreground dirty" is just always 2* background dirty. - if we hit the "max foreground dirty" during memory allocation, then we shrink the background dirty value (logic: we never want to have to wait synchronously) - if we hit some maximum latency on writeback, shrink dirty aggressively and based on how long the latency was (because at that point we have a real _measure_ of how costly it is with that load). - if we start doing background dirtying, but never hit the foreground dirty even in dirty balancing (ie when a writer is actually _writing_, as opposed to hitting it when allocating memory by a non-writer), then slowly open up the window - we may be limiting too early. .. add heuristics to taste. The point being, that if we do this based on real loads, and based on hitting the real problems, then we might actually be getting somewhere. In particular, if the filesystem sucks at writeout (ie the limiter is not the _disk_, but the filesystem serialization), then it should automatically also shrink the max dirty state. The tunable then could become the maximum latency we accept or something like that. Or the hysteresis limits/rules for the soft "grow" or "shrink" events. At that point, maybe we could even find something that works for most people. Linus --
hm. It may not be too hard to account for seekiness. Simplest case: if we dirty a page and that page is file-contiguous to another already dirty page then don't increment the dirty page count by "1": increment it by 0.01. Another simple case would be to keep track of the _number_ of dirty inodes rather than simply lumping all dirty pages together. And then there's metadata. The dirty balancing code doesn't account for dirty inodes _at all_ at present. (Many years ago there was a bug wherein we could have zillions of dirty inodes and exactly zero dirty pages, and the writeback code wouldn't trigger at all - the inodes would just sit there until a page got dirtied - this might still be there). Then again, perhaps we don't need all those discrete heuristic things. Maybe it can all be done in mark_buffer_dirty(). Do some clever math+data-structure to track the seekiness of our dirtiness. Delayed allocation would mess that up though. --
Which is "all the time" in some configurations. It really needs to be self tuning internally based on the observed achieved rates (just as you don't use a script to tune your network bandwidth each day) --
We do a lot of dirty accounting on a per-backing_device basis. This
was added to stop slow devices from sucking up too much for the "40%
dirty" space. The allowable dirty space is now shared among all
devices in rough proportion to how quickly they write data out.
My memory of how it works isn't perfect, but we count write-out
completions both globally and per-bdi and maintain a fraction:
my-writeout-completions
--------------------------
total-writeout-completions
That device then gets a share of the available dirty space based on
the fraction.
The counts decay some-how so that the fraction represents recent
activity.
I shouldn't be too hard to add some concept of total time to this.
If we track the number of write-outs per unit time and use that together
with a "target time for fsync" to scale the 'dirty_bytes' number, we
might be able to auto-tune the amount of dirty space to fit the speeds
of the drives.
We would probably start with each device having a very low "max dirty"
number which would cause writeouts to start soon. Once the device
demonstrates that it can do n-per-second (or whatever) the VM would
allow the "max dirty" number to drift upwards. I'm not sure how best
to get it to move downwards if the device slows down (or the kernel
over-estimated). Maybe it should regularly decay so that the device
keeps have to "prove" itself.
We would still leave the "dirty_ratio" as an upper-limit because we
don't want all of memory to be dirty (and 40% still sounds about
right). But we would not have a time-based value to set a more
realistic limit when there is enough memory to keep the devices busy
for multiple minutes.
Sorry, no code yet. But I think the idea is sound.
NeilBrown
--
I have not had this problem since I applied Arjan's (for some reason repeatedly rejected) patch to change the ioprio of the various writeback daemons. Under some loads changing to the noop I/O scheduler also seems If this is a VM problem why does fixing the I/O priority of the various daemons seem to cure at least some of it ? Alan --
"Give kjournald a IOPRIO_CLASS_RT io priority" October 2007 (yes its that old) And do the same as per discussion to the writeback tasks. Which isn't to say there are not also vm problems - look at the I/O patterns with any kernel after about 2.6.18/19 and there seems to be a serious problem with writeback from the mm and fs writes falling over each other and turning the smooth writeout into thrashing back and forth as both try to write out different bits of the same stuff. <Rant> Really someone needs to sit down and actually build a proper model of the VM behaviour in a tool like netlogo rather than continually keep adding ever more complex and thus unpredictable hacks to it. That way we might better understand what is occurring and why. </Rant> Alan --
thx. A more recent submission from Arjan would be:
http://lkml.org/lkml/2008/10/1/405
Resolution was that Tytso indicated it went into some sort of ext4
patch queue:
| I've ported the patch to the ext4 filesystem, and dropped it into
| the unstable portion of the ext4 patch queue.
|
| ext4: akpm's locking hack to fix locking delays
but 6 months down the line and i can find no trace of this upstream
anywhere.
<let-me-rant-too>
The thing is ... this is a _bad_ ext3 design bug affecting ext3
users in the last decade or so of ext3 existence. Why is this issue
not handled with the utmost high priority and why wasnt it fixed 5
years ago already? :-)
It does not matter whether we have extents or htrees when there are
_trivially reproducible_ basic usability problems with ext3.
Ingo
--
It's all there in that Oct 2008 thread.
The proposed tweak to kjournald is a bad fix - partly because it will
elevate the priority of vast amounts of IO whose priority we don't _want_
elevated.
But mainly because the problem lies elsewhere - in an area of contention
between the committing and running transactions which we knowingly and
reluctantly added to fix a bug in
commit 773fc4c63442fbd8237b4805627f6906143204a8
Author: akpm <akpm>
AuthorDate: Sun May 19 23:23:01 2002 +0000
Commit: akpm <akpm>
CommitDate: Sun May 19 23:23:01 2002 +0000
[PATCH] fix ext3 buffer-stealing
Patch from sct fixes a long-standing (I did it!) and rather complex
problem with ext3.
The problem is to do with buffers which are continually being dirtied
by an external agent. I had code in there (for easily-triggerable
livelock avoidance) which steals the buffer from checkpoint mode and
reattaches it to the running transaction. This violates ext3 ordering
requirements - it can permit journal space to be reclaimed before the
relevant data has really been written out.
Also, we do have to reliably get a lock on the buffer when moving it
between lists and inspecting its internal state. Otherwise a competing
read from the underlying block device can trigger an assertion failure,
and a competing write to the underlying block device can confuse ext3
journalling state completely.
was not a fix at all. It was a known-buggy hack which I proposed simply to
remove that contention point to let us find out if we're on the right
track. IIRC Ric was going to ask someone to do some performance testing of
that hack, but we never heard back.
The bottom line is that someone needs to do some serious rooting through
the very heart of JBD transaction logic and nobody has yet put their hand
up. If we do that, and it turns out to be just too hard to fix then yes,
perhaps that's the time to start looking at palliative ...Its a huge improvement in practice because it both fixes the stupid stalls and smooths out the rest of the I/O traffic. I spend a lot of my time looking at what the disk driver is getting fed and its not a good mix. Even more revealing is the noop scheduler and the fact this frequently outperforms all the fancy I/O scheduling we do even on relatively dumb hardware (as well as showing how mixed up our I/O Which is all the more reason to use a temporary fix in the meantime so the OS is usable. I think its pretty poor that for over a year those in the know who need a good performing system are having to apply out of tree trivial patches rejected on the basis that "eventually like maybe whenever perhaps we'll possibly some day you know consider fixing this, but don't hold your breath" There is a second reason to do this: If ext4 is the future then it is far better to fix this stuff in ext4 properly and leave ext3 clear of extremely invasive high risk fixes when a quick bandaid will do just fine for the remaining lifetime of fs/jbd Also not kjournald is only one of the afflicted threads - the same is true of the crypto, and of the vm writeback. Also note the other point about the disk scheduler defaults being terrible for some streaming I/O patterns and the patch for that is also stuck in bugzilla. If picking "no-op" speeds up my generic x86 box with random onboard SATA we are doing something very non-optimal --
Well, let's be clear here. The contention between committing and running transaction is an issue, even if we solved this problem, it wouldn't solve the issue of fsync() taking a long time in ext3's data=ordered mode in the case of massive write starvation caused by a read-heavy workload, or a vast number of dirty buffers associated with an inode which is about to be committed, and a process triggers an fsync(). So fixing this issue wouldn't have solved the problem which Ingo complained about (which was an editor calling fsync() leading to long delay when saving a file during or right after a distcc-accelerated kernel compile) or the infamous Firefox 3.0 bug. Fixing this contention *would* fix the problem where a normal process which is doing normal file I/O could end up getting stalled unnecessarily, but that's not what most people are complaining about --- and shortening the amount of time that it takes do a commit (either with ext4's delayed allocation or ext3's data=writeback mount option) would also address this problem. That doesn't mean that it's not worth it to fix this particular contention, but there are multiple issues going on here. (Basically we're here: http://www.kernel.org/pub/linux/kernel/people/paulmck/Confessions/FOSSElephant.html ... in Paul Mckenney's version of parable of the blind men and the elephant: Ric did do some preliminary performance testing, and it wasn't encouraging. It's still in the unstable portion of the ext4 patch queue, and it's in my "wish I had more time to look at it; I don't get I disagree that they are _just_ palliative bandaids, because you need these in order to make sure fsync() completes in a reasonable time, so that people like Ingo don't get cranky. :-) Fixing the contention between the running and committing transaction is a good thing, and I hope someone puts up their hand or I magically get the time I need to really dive into the jbd layer, but it won't help the Firefox 3.0 problem or Ingo's problem with ...
I've looked at this a bit. I suppose you mean the contention arising from us taking the buffer lock in do_get_write_access()? But it's not obvious to me why we'd be contending there... We call this function only for metadata buffers (unless in data=journal mode) so there isn't huge amount of these blocks. This buffer should be locked for a longer time only when we do writeout for checkpoint (hmm, maybe you meant this one?). In particular, note that we don't take the buffer lock when committing this block to journal - we lock only the BJ_IO buffer. But in this case we wait when the buffer is on BJ_Shadow list later so there is some contention in this case. Also when I emailed with a few people about these sync problems, they wrote that switching to data=writeback mode helps considerably so this would indicate that handling of ordered mode data buffers is causing most of the slowdown... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
There isn't a huge number of those blocks, but if inode #1220 was
modified in the previous transaction which is now being committed, and
we then need to modify and write out inode #1221 in the current
contention, and they share the same inode table block, that would
cause the contention. That probably doesn't happen that often in a
synchronous code path, but it probably happens more often that you're
thinking. I still think the fsync() problem is the much bigger deal,
and solving the contention problem isn't going to solve the fsync()
Yes, but we need to be clear whether this was an fsync() problem or
some other random delay problem. If it's the fsync() problem,
obviously data=writeback will solve the fsync() latency delay problem.
(As will using delayed allocation in ext4 or XFS.)
- Ted
--
The fsync() problem is really annoying, but what is doubly annoying is that sometimes one process doing fsync() (or sync) seems to cause other processes to hickup too. Now, I personally solved that problem by moving to (good) SSD's on my desktop, and I think that's indeed the long-term solution. But it would be good to try to figure out a solution in the short term for people who don't have new hardware thrown at them from random companies too. I suspect it's a combination of filesystem transaction locking, together with the VM wanting to write out some unrelated blocks or inodes due to the system just being close to the dirty limits. Which is why the system-wide hickups then happen especially when writing big files. The VM _tries_ to do writes in the background, but if the writepage() path hits a filesystem-level blocking lock, that background write suddenly becomes largely synchronous. I suspect there is also some possibility of confusion with inter-file (false) metadata dependencies. If a filesystem were to think that the file size is metadata that should be journaled (in a single journal), and the journaling code then decides that it needs to do those meta-data updates in the correct order (ie the big file write _before_ the file write that wants to be fsync'ed), then the fsync() will be delayed by a totally irrelevant large file having to have its data written out (due to data=ordered or whatever). I'd like to think that no filesystem designer would ever be that silly, but I'm too scared to try to actually go and check. Because I could well imagine that somebody really thought that "size" is metadata. Linus --
Bug #5942 (interaction with anticipatory io scheduler) Bug #9546 (with reproducer & logs) Bug #9911 including a rather natty tester (albeit in java) Bug #7372 (some info and figures on certain revs it seemed to get worse) Bug #12309 (more info, including kjournald hack fix using ioprio) General consensus seems to be 2.6.18 is where the manure intersected with the air impeller --
On Wed, Mar 25, 2009 at 10:29 AM, Linus Torvalds Throwing SSDs at it only increases the limit before which it becomes an issue. They hide the underlying issue and are only a workaround. Create enough dirty data and you'll get the same latencies, it's just that that limit is now a lot higher. Your Intel SSD will write streaming data 2-4 times faster than your typical disk - and can be an It certainly "feels" like that is the case from the workloads I have that generate high latencies. -Dave --
Don't even bother with streaming data. The problem is _never_ streaming data. Even a suck-ass laptop drive can write streaming data fast enough that people don't care. The problem is invariably that writes from different Umm. More like two orders of magnitude or more. Random writes on a disk (even a fast one) tends to be in the hundreds of kilobytes per second. Have you worked with an Intel SSD? It does tens of MB/s on pure random writes. The problem really is gone with an SSD. And please realize that the problem for me was never 30-second stalls. For me, a 3-second stall is unacceptable. It's just very annoying. Linus --
Actually, not just writes. The IO priority thing is almost certainly that _reads_ (which get higher priority by default due to being synchronous) get interspersed with the writes, and then even if you _could_ be having streaming writes, what you actually end up with is lots of seeking. Again, good SSD's don't care. Disks do. It doesn't matter if you have a FC disk array that can eat 300MB/s when streaming - once you start seeking, that 300MB/s goes down like a rock. Battery-protected write caches will help - but not a whole lot when streaming more data than they have RAM. Basic queuing theory. Linus --
Subtly more complex than that. If your mashed up I/O streams fit into the 2GB or so of cache (minus one stream to disk) you win. You also win because you take a lot of fragmented OS I/O and turn it into bigger chunks of writing better scheduled. The latter win arguably shouldn't happen but it does occur (I guess in part that says we suck) and it occurs big time when you've got multiple accessors to a shared storage system (where the host OS's can't help) Alan --
The other thing that can impact random writes on arrays is their internal "track" size - if the random write is of a partial track, it forces a read-modify-write with a back end disk read. Some arrays have large internal tracks, others have smaller ones. Again, not unlike what you see with some SSD's and their erase block size - give them even multiples of that and they are quite happy. Ric --
This is actually not really true - random writes to an enterprise disk array will make your Intel SSD look slow. Effectively, they are extremely large, battery backed banks of DRAM with lots of fibre channel ports. Some of the bigger ones can have several hundred GB of DRAM and dozens of fibre channel ports to feed them. Of course, if your random writes exceed the cache capacity and you fall back to their internal disks (SSD or traditional), your random write speed will drop. Ric --
It's not just the file size; it's the block allocation decisions. Ext3 doesn't have delayed allocation, so as soon as you issue the write, we have to allocate the block, which means grabbing blocks and making changes to the block bitmap, and then updating the inode with those block allocation decisions. It's a lot more than just i_size. And the problem is that if we do this for the big file write, and the small file write happens to also touch the same inode table block and/or block allocation bitmap, when we fsync() the small file, when we end up pushing out the metadata updates associated with the big file write, and so thus we need to flush out the data blocks associated with the big file write as well. Now, there are three ways of solving this problem. One is to use delayed allocation, where we don't make the block allocation decisions until the very last minute. This is what ext4 and XFS does. The problem with this is that when we have unrelated filesystem operations that end up causing zero length files before the file write (i.e., replace-via-truncate, where the application does open/truncate/write/ close) or the after the file write (i.e., replace-via-rename, where the application does open/write/close/rename) and the application omits the fsync(). So with ext4 we has workarounds that start pushing out the data blocks in the for replace-via-rename and replace-via-truncate cases, while XFS will do an implied fsync for replace-via-truncate only, and btrfs will do an implied fsync for replace-via-rename only. The second solution is we could add a huge amount of machinery to try track these logical dependencies, and then be able to "back out" the changes to the inode table or block allocation bitmap for the big file write when we want to fsync out the small file. This is roughly what the BSD Soft Updates mechanisms does, and it works, but at the cost of a *huge* amount of complexity. The amount of accounting data you have to track so that you can partially back ...
The XFS one and the ext4 one that I saw only start an _asynchronous_ writeout. Which is not an implied fsync but snake oil to make the most common complaints go away without providing hard guarantees. IFF we want to go down this route we should better provide strong guranteed semantics and document the propery. And of course implement Note that the rename for atomic commits trick originated in mail severs which always did the proper fsync. When the word spread into the desktop world it looks like this wisdom got lost. --
It actually does the right thing for ext4, because once we allocate the blocks, the default data=ordered mode means that we flush the datablocks before we execute the commit. Hence, in the case of open/write/close/rename, the rename will trigger an async writeout, but before the commit block is actually written, we'll have flushed out the data blocks. I was under the impression that XFS was doing a synchronous fsync before allowing the close() return, but all it is triggering an async writeout, then yes, your concern is correct. The bigger problem from my perspective is that XFS is only doing this for the truncate case, and (from what I've been told) not for the rename case. The truncate is fundamentally racy and application writers that don't do this definitely don't deserve our solicitude, IMHO. But people who do open/write/close/rename, and omit the fsync before the rename, are at least somewhat more deserving for some kind of workaround than the That's something we should talk about at LSF. I'm not all that eager (or happy) about doing this, but I think that, given that the application writers massively outnumber us, we are going to be bullied Yep, agreed. To be fair, though, one problem which Matthew Garrett has pointed out is that if lots of applications issue fsync(), it will have the tendency to wake up the hard drive a lot, and do a real number on power utilization. I believe the right solution for this is an extension to laptop mode which synchronizes the filesystem at a clean point, and then which suppresses fsync()'s until the hard drive wakes up, at which point it should flush all dirty data to the drive, and then freezes writes to the disk again. Presumably that should be OK, because who are using laptop mode are inherently trading off a certain amount of safety for power savings; but then other people who want to run a mysql server on a laptop get cranky, and then if we start implementing ways that applications can exempt themselves from ...
I disagree with this approach. If fsync() means anything other than "Get my data on disk and then return" then we're breaking guarantees to applications. The problem is that you're insisting that the only way applications can ensure that their requests occur in order is to use fsync(), which will achieve that but also provides guarantees above and beyond what the majority of applications want. I've done some benchmarking now and I'm actually fairly happy with the behaviour of ext4 now - it seems that the real world impact of doing the block allocation at rename time isn't that significant, and if that's the only practical way to ensure ordering guarantees in ext4 then fine. But given that, I don't think there's any reason to try to convince application authors to use fsync() more. -- Matthew Garrett | mjg59@srcf.ucam.org --
Due to lack of storage dev writeback cache flushing, we are indeed That remains a true statement... without the *sync* syscalls, you still do not have a _guarantee_ writes occur in a certain order. Jeff --
The interesting case is whether data hits disk before metadata when renaming over the top of an existing file, which appears to be guaranteed in the default ext4 configuration now? I'm sure there are filesystems where this isn't the case, but that's mostly just an argument that it's not sensible to use those filesystems if your system's at any risk of crashing. -- Matthew Garrett | mjg59@srcf.ucam.org --
Then you have just reinvented the transactional userspace API that people often want to replace POSIX API with. Maybe one day they will succeed. But "POSIX API replacement" is an area never short of proposals... :) Jeff --
Well, I think the goal is not to *replace* the POSIX API or even provide "transactional" guarantees. The performance penalty for atomic transactions is pretty high, and most programs (like GIT) don't really give a damn, as they provide that on a higher level. It's like the difference between a modern SMP system that supports memory barriers and write snooping and one of the theoretical "transactional memory" designs that have never caught on. To be honest I think we could provide much better data consistency guarantees and remove a lot of fsync() calls with just a basic per-filesystem barrier() call. Cheers, Kyle Moffett --
Speaking with my 'git' hat on, I can tell that - git was designed to have almost minimal requirements from the filesystem, and to not do anything even half-way clever. - despite that, we've hit an absolute metric sh*tload of filesystem bugs and misfeatures. Some very much in Linux. And some I bet git was the first to ever notice, exactly because git tries to be really anal, in ways that I can pretty much guarantee no normal program _ever_ is. For example, the latest one came from git actually checking the error code from 'close()'. Tell me the last time you saw anybody do that in a real program. Hint: it's just not done. EVER. Git does it (and even then, git does it only for the core git object files that we care about so much), and we found a real data-loss CIFS bug thanks to that. Afaik, the bug has been there for a year and half. Don't tell me nobody uses cifs. Before that, we had cross-directory rename bugs. Or the inexplicable "pread() doesn't work correctly on HP-UX". Or the "readdir() returns the same entry multiple times" bug. And all of this without ever doing anything even _remotely_ odd. No file locking, no rewriting of old files, no lseek()ing in directories, no nothing. Anybody who wants more complex and subtle filesystem interfaces is just crazy. Not only will they never get used, they'll definitely not be The problem is not that we have a lot of fsync() calls. Quite the reverse. fsync() is really really rare. So is being careful in general. The number of applications that do even the _minimal_ safety-net of "create new file, rename it atomically over an old one" is basically zero. Almost everybody ends up rewriting files with something like open(name, O_CREAT | O_TRUNC, 0666) write(); close(); where there isn't an fsync in sight, nor any "create temp file", nor likely even any real error checking on the write(), much less the close(). And if we have a Linux-specific magic system call or sync action, ...
From: Linus Torvalds <torvalds@linux-foundation.org> Emacs does it too, and I know that you consider GNU emacs to be the definition of abnormal :-) That's how we found some misbehaviors in NFS a while ago, we used to return -EAGAIN or something like that from close() on NFS files. This was like 12 years ago and it gave emacs massive heartburn. --
On Wed, Mar 25, 2009 at 11:40 PM, Linus Torvalds Really, I think virtually all of the database programs would be perfectly happy with an "fsbarrier(fd, flags)" syscall, where if "fd" points to a regular file or directory then it instructs the underlying filesystem to do whatever internal barrier it supports, and if not just fail with -ENOTSUPP (so you can fall back to fdatasync(), etc). Perhaps "flags" would allow a "data" or "metadata" barrier, but if not it's not a big issue. I've ended up having to write a fair amount of high-performance filesystem library code which almost never ends up using fsync() quite simply because the performance on it sucks so badly. This is one of the big reasons why so many critical database programs use O_DIRECT and reinvent the the wheel^H^H^H^H^H^H pagecache. The only way you can actually use it in high-bandwidth transaction applications is by doing your own IO-thread and buffering system. You have to have your own buffer ordering dependencies and call fdatasync() or fsync() from individual threads in-between specific ordered IOs. The threading helps you keep other IO in flight while waiting for the flush to finish. For big databases on spinning media (SSDs don't work precisely because they are small and your databases are big) the overhead of a full flush may still be too large. Even with SSDs, with multiple processes vying for IO bandwidth you still want some kind of application-level barrier to avoid introducing bubbles in your IO pipeline. It all comes down to a trivial calculation: if you can't get (bandwidth * latency-to-stable-storage) bytes of data queued *behind* a flush then your disk is going to sit idle waiting for more data after completing it. If a user-level tool needs to enforce ordering between IOs the only tool right now is is a full flush; when database-oriented tools can use a barrier()-ish call instead, they can issue the op and immediately resume keeping the IO queues full. Cheers, Kyle Moffett --
The issue is that sync_file_range doesn't seem to be documented to have any inter-file barrier semantics. Even then, from the manpage it doesn't look like write(fd)+sync_file_range(fd,SYNC_FILE_RANGE_WRITE)+write(fd) would actually prevent the second write from occurring before the first has actually hit disk (assuming both are within the specified range). Cheers, Kyle Moffett --
That's an option, but what would benefit? If rename is expected to preserve ordering (which I think it has to, in order to avoid breaking existing code) then are there any other interesting use cases? -- Matthew Garrett | mjg59@srcf.ucam.org --
The use cases would be programs like GIT (or any other kind of database) where you want to ensure that your new pulled packfile has fully hit disk before the ref update does. If that ordering constraint is applied, then we don't really care when we crash, because either we have a partial packfile update (and we have to pull again) or we have the whole thing. The rename() barrier would ensure that we either have the old ref or the new ref, but it would not check to ensure that the whole packfile is on disk yet. I would imagine that databases like MySQL could also use such support to help speed up their database transaction support, instead of having to run a bunch of threads which fsync() and buffer data internally. Cheers, Kyle Moffett --
You seem to disregard the "write in the right order" approach. Or is that Yes. but at least one problem is, as mentioned, that when the VM calls writepage[s]() to start async writeback, many filesystems do seem to just _block_ on it. So the VM has a really hard time doing anything sanely early - the filesystems seem to take a perverse pleasure in synchronizing things using blocking semaphores. Linus --
Um, no, ext3 shouldn't block on writepage(). Since it doesn't do delayed allocation, it should always be able to push out a dirty page to the disk. - Ted --
Umm. Maybe I'm mis-reading something, but they seem to all synchronize with the journal with "ext3_journal_start/stop". Which will at a minimum wait for 'j_barrier_count == 0' and 't_state != T_LOCKED'. Along with making sure that there are enough transaction buffers. Do I understand _why_ ext3 does that? Hell no. The code makes no sense to me. But I don't think I'm wrong. Look at the sane case (data=ordered): it still does handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); ... err = ext3_journal_stop(handle); around all the IO starting. Never mind that the IO shouldn't be needing any journal activity at all afaik in any common case. Yes, yes, it may need to allocate backing store (a page that was dirtied by mmap), and I'm sure that's the reason for it all, but the point is, most of the time there should be no journal activity at all, yet it looks very much like a simple writepage() will synchronize with a full journal and wait for the journal to get space. No? So tell me again how the VM can rely on the filesystem not blocking at random points. Linus --
Yes, you got it right. Furthermore in ordered mode we need to attach buffers to the running transaction if they aren't there (but for checking whether they are we need to pin the running transaction and we are basically where we started.. damn). But maybe there's a way out of it. We don't have to guarantee data written via mmap are on disk when "the transaction running when somebody decided to call writepage" commits (in case no block allocation happen) and so we could just submit those buffers I can write a patch to make writepage() in the non-"mmapped creation" case non-blocking on journal. But I'll also have to find out whether it really helps something. But it's probably worth trying... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Actually, it really should be easier to make a patch that just does the journal thing if ->set_page_dirty() is called, and buffers weren't already allocated. Then ext3_[ordered|writeback]_writepage() _should_ just become something like if (test_opt(inode->i_sb, NOBH)) return nobh_writepage(page, ext3_get_block, wbc); return block_write_full_page(page, ext3_get_block, wbc); and that's it. The code would be simpler to understand to boot. Linus --
_all_ the problems i ever had with ext3 were 'collateral damage'
type of things: simple writes (sometimes even reads) getting
serialized on some large [but reasonable] dirtying activity
elsewhere - even if the system was still well within its
hard-dirty-limit threshold.
So it sure sounds like an area worth improving, and it's not that
hard to reproduce either. Take a system with enough RAM but only a
single disk, and do this in a kernel tree:
sync
echo 3 > /proc/sys/vm/drop_caches
while :; do
date
make mrproper 2>/dev/null >/dev/null
make defconfig 2>/dev/null >/dev/null
make -j32 bzImage 2>/dev/null >/dev/null
done &
Plain old kernel build, no distcc and no icecream. Wait a few
minutes for the system to reach equilibrium. There's no tweaking
anywhere, kernel, distro and filesystem defaults used everywhere:
aldebaran:/home/mingo/linux/linux> ./compile-test
Thu Mar 26 10:33:03 CET 2009
Thu Mar 26 10:35:24 CET 2009
Thu Mar 26 10:36:48 CET 2009
Thu Mar 26 10:38:54 CET 2009
Thu Mar 26 10:41:22 CET 2009
Thu Mar 26 10:43:41 CET 2009
Thu Mar 26 10:46:02 CET 2009
Thu Mar 26 10:48:28 CET 2009
And try to use the system while this workload is going on. Use Vim
to edit files in this kernel tree. Use plain _cat_ - and i hit
delays all the time - and it's not the CPU scheduler but all IO
related.
I have such an ext3 based system where i can do such tests and where
i dont mind crashes and data corruption either, so if you send me
experimental patches against latet -git i can try them immediately.
The system has 16 CPUs, 12GB of RAM and a single disk.
Btw., i had this test going on that box while i wrote some simple
scripts in Vim - and it was a horrible experience. The worst wait
was well above one minute - Vim just hung there indefinitely. Not
even Ctrl-Z was possible. I captured one such wait, it was hanging
right here:
aldebaran:~/linux/linux> cat /proc/3742/stack
[<ffffffff8034790a>] ...It happened when i tried to Ctrl-C the compile job as well: Thu Mar 26 11:04:05 CET 2009 Thu Mar 26 11:06:30 CET 2009 ^CThu Mar 26 11:07:55 CET 2009 ^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^Caldebaran:/home/mingo/linux/linux> a single Ctrl-C is rarely enough to stop busy Bash shell scripts on Linux. Why? Ingo --
Did you capture the trace of the long delays in the read test case? It can be two things, at least. One is that each little read takes much longer than it should, the other is that we get stuck waiting on a dirty page and hence that slows down the reads a lot. -- Jens Axboe --
On Thu, 26 Mar 2009 12:08:15 +0100 would be interesting to run latencytop during such runs... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Things just drown in the noise in there sometimes, is my experience... And disappear. Perhaps I'm just not very good at reading the output, but it was never very useful for me. I know, not a very useful complaint, I'll try and use it again and come up with something more productive :-) But in this case, I bet it's also the atime updates. If Ingo turned those off, the read results would likely be a lot better (and consistent). -- Jens Axboe --
On Thu, 26 Mar 2009 15:36:18 +0100 -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Ingo,
Interesting. I wonder if the problem is the journal is cycling fast
enough that it is checkpointing all the time. If so, it could be that
a bigger-sized journal might help. Can you try this as an experiment?
Mount the filesystem using ext4, with the mount option nodelalloc.
With an filesystem formatted as ext3, and with delayed allocation
disabled, it should behave mostly the same as ext3; try and make sure
you're still seeing the same problems.
Then could you grab /proc/fs/jbd2/<dev>:8/history and
/proc/fs/jbd2/<dev>:8/info while running your test workload?
^^
So there would have been nothing to ^C; I assume you were running this
with a variant that didn't have the ampersand, which would have run
the whole shell pipeline in a detached background process?
In any case, the workaround for this is to ^Z the script, and then
"kill %" it.
I'm pretty sure this is actually a bash problem. When you send a
Ctrl-C, it sends a SIGINT to all of the members of the tty's
foreground process group. Under some circumstances, bash sets the
signal handler for SIGINT to be SIGIGN. I haven't looked at this
super closely (it would require diving into the bash sources), but you
can see it if you attach an strace to the bash shell driving a script
such as
#!/bin/bash
while /bin/true; do
date
sleep 60
done &
If you do a "ps axo pid,ppid,pgrp,args", you'll see that the bash and
the sleep 60 have the same process group. If you emulate hitting ^C
by sending a SIGINT to pid of the shell, you'll see that it ignores
it. Sleep also seems to be ignoring the SIGINT when run in the
background; but it does honor SIGINT in the foreground --- I didn't
have time to dig into that.
In any case, bash appears to SIGIGN the INT signal if there is a child
process running, and only takes the ^C if bash itself is actually
"running" the shell script. For example, if you run the command
"date;sleep 10;date;sleep 10;date", the ^C only interrupts the sleep
command. ...That was just the example - the real script did not go into the It happens all the time - and it does look like a Bash bug. I reported it to the Bash maintainer one or two years ago. He said he does not see it as he's using MacOS X. Can dig into archives if needed. Ingo
Sure: [root@aldebaran ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 40313964 11595520 26670560 31% / /dev/sda2 403160732 35165460 347515212 10% /home tmpfs 6159096 48 6159048 1% /dev/shm [root@aldebaran ~]# dumpe2fs -h /dev/sda2 | grep Journal dumpe2fs 1.40.8 (13-Mar-2008) Journal inode: 8 Journal backup: inode blocks Journal size: 128M Stock Fedora 9 release/install, updated, and booted to 2.6.29. Ingo --
i tried it: /dev/sda2 on /home type ext4 (rw,nodelalloc) I still see similarly bad latencies in Vim: aldebaran:~> cat /proc/10227/stack [<ffffffff80370cad>] jbd2_log_wait_commit+0xbd/0x110 [<ffffffff8036bc70>] jbd2_journal_stop+0x1f3/0x221 [<ffffffff8036ccb0>] jbd2_journal_force_commit+0x28/0x2c [<ffffffff80352660>] ext4_force_commit+0x2e/0x34 [<ffffffff80346682>] ext4_write_inode+0x3e/0x44 [<ffffffff802eb941>] __sync_single_inode+0xc1/0x2ad [<ffffffff802ebc7a>] __writeback_single_inode+0x14d/0x15a [<ffffffff802ebcb0>] sync_inode+0x29/0x34 [<ffffffff80343e16>] ext4_sync_file+0xf6/0x138 [<ffffffff802eef21>] vfs_fsync+0x78/0xaf [<ffffffff802eef8f>] do_fsync+0x37/0x4d [<ffffffff802eefcc>] sys_fsync+0x10/0x14 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Vim is still almost unusable during this workload - even if i dont write out the source file just use it interactively to edit it. The read-test is somewhat better. There are occasional blips of 4-5 seconds: file # 928 (253560 bytes), reading it took: 0.76 seconds. file # 929 (253560 bytes), reading it took: 3.98 seconds. file # 930 (253560 bytes), reading it took: 3.45 seconds. file # 931 (253560 bytes), reading it took: 0.04 seconds. I have also written a 'vim open' test which does vim -c q, i.e. it just opens a source file and closes it without writing the file. That too takes a lot of time: file # 0 (253560 bytes), Vim-opening it took: 2.04 seconds. file # 1 (253560 bytes), Vim-opening it took: 2.39 seconds. file # 2 (253560 bytes), Vim-opening it took: 2.03 seconds. file # 3 (253560 bytes), Vim-opening it took: 2.81 seconds. file # 4 (253560 bytes), Vim-opening it took: 2.11 seconds. file # 5 (253560 bytes), Vim-opening it took: 2.44 seconds. file # 6 (253560 bytes), Vim-opening it took: 2.04 seconds. file # 7 (253560 bytes), Vim-opening it took: 3.59 seconds. file # 8 (253560 bytes), Vim-opening it took: 2.06 seconds. file # ...
grabbed them after the tests. (not much else ran before and after the tests) Here they are: R/C tid wait run lock flush log hndls block inlog ctime write drop close R 642267 0 5000 0 84 3288 76104 560 562 R 642268 0 5000 956 2024 5760 68891 491 493 R 642269 956 7788 8 5104 6696 182270 667 669 R 642270 8 11800 216 7000 6816 186159 834 837 R 642271 60 13816 0 0 492 45115 2162 2169 R 642272 0 5000 0 0 2144 44278 1266 1270 R 642273 0 5000 0 80 3144 73604 444 446 R 642274 0 5000 0 276 3120 71741 488 490 R 642275 0 5000 0 288 3608 87334 526 528 R 642276 0 5000 0 112 2992 83061 512 514 R 642277 0 5000 0 84 5892 75029 468 470 R 642278 0 5976 0 848 8564 71693 483 485 R 642279 0 9412 340 5432 7104 167415 664 666 R 642280 340 12536 0 8764 3820 270409 906 909 R 642281 0 12584 0 0 576 38603 2175 2182 R 642282 0 5000 0 16 2620 51638 1275 1279 R 642283 0 5000 0 16 2364 58962 376 378 R 642284 0 5000 0 56 2812 66644 442 444 R 642285 0 5000 0 64 2744 61323 479 481 R 642286 0 5000 0 16 2328 61109 439 441 R 642287 0 5000 0 40 2752 69227 471 473 R 642288 0 5000 0 20 2536 60836 454 456 R 642289 0 5000 0 16 2612 63580 440 442 R 642290 0 5000 0 48 2528 72629 463 465 R 642291 0 5000 0 68 2848 75262 498 500 R 642292 0 5000 0 60 2688 77164 468 470 R 642293 0 5000 0 0 2188 60922 458 460 R 642294 0 5000 0 348 3124 79928 528 530 R 642295 0 5100 0 1896 3128 62695 672 674 R 642296 0 5024 0 8 4840 17110 90 91 ...
That would have been a non-backward-compatible change. --
I assume Ingo means the patches to make relatime update atime at least once per day to ensure better compatibility with apps that do use or rely on access times. These patches are already being included by several distros and, FWIW, Debian would like to see them upstream as well because we feel . They were last submitted by Matthew Garrett: http://lkml.org/lkml/2008/11/27/234 http://lkml.org/lkml/2008/11/27/235 Loads of people seem to want this, but even though it's been submitted at least twice and discussed even more often, it never gets anywhere. Cheers, FJP --
Hard-wiring a 24-hour interval into the core VFS for all mounted filesystems is dumb. I (and others) pointed out that it would be better to implement this as a mount option. That suggestion was met with varying sillinesses and that is where things stand. --
Umm. I generally agree witht he "leave policy to user space" people, but this is an area where (a) user space has shown itself to not get it right (ie people don't do even the existing relatime because distros don't) and (b) I'd suggest first just doing the 24 hour thing, and then, IF user space actually ever gets its act together, and people care, and they _ask_ for a mount option, that's when it's worth doing. Linus --
I thought at least some distro's were adding relatime by default; I could be wrong, but I thought Ubuntu was doing this. Personally, I actually think that if we're going to give up on POSIX, I'll go all the way to noatime since it helps even more. I've always thought the right approach would be to have a "atime dirty" flag, and update atime, but never flush it out to disk unless (a) we're about to unmount the disk, or (b) we need to update some other inode in the same inode table block, or (c) we have memory pressure and we're trying to evict the inode from the inode cache. That way we get full POSIX compliance, without taking the I/O hit of atime updates. The atime updates get lost if we crash, but that's allowed by POSIX, and most people don't care about losing atime updates after a crash. Since it's fully backwards (and POSIX) compatible, there would no question about enabling it by default. - Ted --
Yes. -- Jose Celestino | http://japc.uncovering.org/files/japc-pgpkey.asc ---------------------------------------------------------------- "One man’s theology is another man’s belly laugh." -- Robert A. Heinlein --
No, not the "pure" relatime that's in the upstream kernel. And that's the whole point here. See my direct reply to Ted. Cheers, FJP --
I tried to do that a few years ago (ok, probably more than a few by now). It was surprisingly hard. Some of it is absolutely trivial: we already have multiple "dirty" flags for the inode (I_DIRTY_SYNC vs I_DIRTY_DATASYNC vs I_DIRTY_PAGES). Adding a I_DIRTY_ATIME bit for unimportant data was trivial. But at least back then, "sync_inode()" (or whatever) was called without the reason for doing the sync, so it was really hard to decide whether to write things out or not. That may actually have changed these days. We now have that "writeback_control" thing that we pass around for all the IO. Heh. I just looked back in the history. That writeback_control thing was added back in 2002, so it's a _really_ long time since I tried to do that whole atime thing. Maybe it's really easy these days. Linus --
They indeed do have relatime by default, but *only because* they have the additional patch with the 24 hour limit in their kernel [1]. The same is true for Fedora IIUC. Debian would like to activate it by default as well for new installations, but has so far been blocked from doing that because our kernel team has a policy of not including patches that are not upstream (or at least, in the process of being included upstream). And the Debian Installer team has so far felt that it would be irresponsible of activating it by default without this safeguard. Cheers, FJP [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199427 linux (2.6.24-12.18) hardy; urgency=low [...] * build/configs: Enable relatime config option for all flavors --
We wouldn't normally just enable the new feature by default because it changes kernel behaviour. Userspace needs to be changed in some manner to opt-in. One way it's `mount -o remount', the other way it's a poke in /proc. --
What change are you talking about here exactly? The "change relatime to have a 24 hour safeguard" of Matthes's first patch or the "enable relatime by default" options in the second patch? For the first I don't think it's that big a deal as it is a change that makes the behavior of relatime safer and not riskier. Also, it's something people have argued should have been part of the initial functionality of relatime (it was part of the discussion back then), and finally for a lot of users it's already current functionality as major distros already do include the patch. For the second, I can see your point and can understand reservations to make enabling relatime a kernel config option. Speaking exclusively for myself, I would be happy enough if only the first of Matthew's patches would get accepted. Cheers, FJP --
Oh, the feature itself is desirable. But the interface isn't. - It's a magic number. Maybe someone runs tmpwatch twice per day, or weekly, or... - That's fixable by making "24" tunable, but it's still a global thing. Better to make it per-fs. - mount(8) is the standard way of tuning fs behaviour. There's no need to deviate from that here. Note that none of this involves the default setting. With a per-mount tunable we can still make the default for each fs be "on, 24 hours" if we so decide. --
Patches welcome. When did we adopt a mindset that led to code having to satisfy every single user requirement before being accepted, rather than being happy with code that provides an incremental improvement over what exists already? If there are actually users who want to be able to tune this per filesystem then I'm sure someone (possibly even me) will write code to support them, but right now it just sounds like features for the sake of some sense of aesthetic correctness. -- Matthew Garrett | mjg59@srcf.ucam.org --
Shortcomings have been identified. Weaselly verbiage is not a suitable way of addressing shortcomings! Yes, we could (and do) merge things as a halfway step. But when the features are visible to userspace we just can't do that - we have to get the interface right on day one, because interfaces are for ever. A hard-wired global 24-hours constant is in no way superior to a per-mount tunable. If we're going to do this we should do it in the best way we know, and we certainly should not lock ourselves into the inferior implementation for all time by exposing it to userspace. --
What shortcomings? So far we have a hypothetical complaint that some users will want to choose a different value. Right now they have the choice of continuing to not use relatime. Things are no worse for them I don't claim that it's superior, merely that it deals with all the use cases I've had to worry about and so is good enough. If it turns out that there are people in the real world who need the better version then I can write that code, but I'm not going to while it's a hypothetical. -- Matthew Garrett | mjg59@srcf.ucam.org --
It seems to me that, rather than having the kernel maintain a timer (or multiple timers, one per mount) itself, it would make sense to have entries in /sys which, when written to, cause the file system layer to flush all atime data to the mounted volume. Something like /sys /sys/atime /sys/atime/all /sys/atime/<mountpoint id>/flush where <mountpoint id> would be the name of the file system (e.g. /sys/atime/usr/flush). The only sticky part would be how to describe "/" in such a system. (Better still would be a /sys/ system for each file system with the various parameters (e.g. uid, journal) as entries + an entry for flushing atime, but that is beyond the scope of this discussion.) That would truly let userspace set policy, while the kernel provides mechanism. Thus, a script that depends upon atime being accurate could simply tickle the sysfs entries as needed before running. --
Allow atime to be updated once per day even with relatime. This lets
utilities like tmpreaper (which delete files based on last access time)
continue working, making relatime a plausible default for distributions.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Valerie Aurora Henson <vaurora@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
---
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..057c92b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1179,6 +1179,40 @@ sector_t bmap(struct inode * inode, sector_t block)
}
EXPORT_SYMBOL(bmap);
+/*
+ * With relative atime, only update atime if the previous atime is
+ * earlier than either the ctime or mtime or if at least a day has
+ * passed since the last atime update.
+ */
+static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
+ struct timespec now)
+{
+
+ if (!(mnt->mnt_flags & MNT_RELATIME))
+ return 1;
+ /*
+ * Is mtime younger than atime? If yes, update atime:
+ */
+ if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0)
+ return 1;
+ /*
+ * Is ctime younger than atime? If yes, update atime:
+ */
+ if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0)
+ return 1;
+
+ /*
+ * Is the previous atime value older than a day? If yes,
+ * update atime:
+ */
+ if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= 24*60*60)
+ return 1;
+ /*
+ * Good, we can skip the atime update:
+ */
+ return 0;
+}
+
/**
* touch_atime - update the access time
* @mnt: mount the inode is accessed on
@@ -1206,17 +1240,12 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
goto out;
if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
goto out;
- if (mnt->mnt_flags & MNT_RELATIME) {
- /*
- * With relative atime, only update atime if the previous
- * atime is earlier than either the ctime or mtime.
- */
- if ...Good example of overcommented code. --
On Thu, 26 Mar 2009 17:32:14 +0000 And while I think forcing relatime on is a really dumb dangerous idea, providing it so you can enable it (or distro new releases can for new installs etc) is a *very good* one --
The relatime patches are upstream. Both noatime and relatime are handled at the VFS layer, not at the per-filesystem level. The reason why it sin't the default is because of a desire for POSIX compliance, I suspect. Most distributions are putting relatime into /etc/fstab by default, but we haven't changed the mount option. It wouldn't be hard to add an "atime" option to turn on atime updates, and make either "noatime" or "relatime" the default. This is a simple patch to No argument here. I use noatime, myself. It actually saves a lot more than relatime, and unless you are using mutt with local Maildir delivery, relatime isn't really that helpful, and the benefit of noatime is roughly double that of relatime vs normal atime update, in my measurements: http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime/ - Ted --
I don't think this is true. Fedora certainly does not. Not in F10, not in F11. And quite frankly, even if you then _manually_ put 'relatime' in /etc/fstab, the default Fedora install will totally ignore it. Why? Because it mounts the root partition while using initrd, and totally ignores /etc/fstab. In other words, not only do distributions not do it, but you can't even do it by hand afterwards the sane way in the most common distro! There really is reason for the kernel to just say "user space has sh*t for brains, and we'd better change the default - and if some distro really _thinks_ about it, and decides that they really want old-fashioned atime, let them do that". Because right now, I do not believe for a moment that any distro that defaults to "atime" has spent lots of effort thinking about it. Quite the reverse. They probably default to "atime" because they spent no time AT I do agree that "noatime" is better, but with "relatime" you at least are likely to not break anything. A program has to be _really_ odd to care about the "relatime" vs "atime" behavior. Linus --
That works here in openSUSE 11.1. The initrd remounts the rootfs with any options it founds in /etc/fstab. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." --
You can, actually, but it requires hacking /boot/grub/menu.list. The boot command option "rootflags=noatime" should do it, if their initrd scripts are at all sane (and they honor rootfstype, so they probably do also honor rootflags). The question is whether we can make Fedora 11 and OpenSUSE do the right thing now that this has become a highly visible discussion. I'm actually fairly optimistic on this front. (Maybe some distro folks will care to chime in on whether upcoming releases of F11 and OpenSuSE can be changed to DTRT?) Actually, given where F11 is on its release schedule, I suspect it would be *easier* for them to make a change to default boot options in grub's menu.conf than it would be backport a kernel patch, since they will be releasing their beta release within the week, and their final development freeze is in less than two weeks. - Ted --
Not when I tried it. It just causes the initrd to be mounted noatime, and then the real root filesystem gets mounted atime again. And what's the argument for not doing it in the kernel? The fact is, "atime" by default is just wrong. Linus --
Add support for explicitly requesting full atime updates. This makes it
possible for kernels to default to relatime but still allow userspace to
override it.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
---
fs/namespace.c | 6 +++++-
include/linux/fs.h | 1 +
include/linux/mount.h | 1 +
3 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 06f8e63..d0659ec 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -780,6 +780,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
+ { MNT_STRICTATIME, ",strictatime" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
@@ -1932,11 +1933,14 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_RELATIME)
mnt_flags |= MNT_RELATIME;
+ if (flags & MS_STRICTATIME)
+ mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
- MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
+ MS_STRICTATIME);
/* ... and get the mountpoint */
retval = kern_path(dir_name, LOOKUP_FOLLOW, &path);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 92734c0..5bc81c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -141,6 +141,7 @@ struct inodes_stat_t {
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
+#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..51f55f9 100644
--- ...Change the default behaviour of the kernel to use relatime for all filesystems. This can be overridden with the "strictatime" mount option. Signed-off-by: Matthew Garrett <mjg@redhat.com> --- fs/namespace.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index d0659ec..f0e7530 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1920,6 +1920,9 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, if (data_page) ((char *)data_page)[PAGE_SIZE - 1] = 0; + /* Default to relatime */ + mnt_flags |= MNT_RELATIME; + /* Separate the per-mountpoint flags */ if (flags & MS_NOSUID) mnt_flags |= MNT_NOSUID; @@ -1931,8 +1934,6 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, mnt_flags |= MNT_NOATIME; if (flags & MS_NODIRATIME) mnt_flags |= MNT_NODIRATIME; - if (flags & MS_RELATIME) - mnt_flags |= MNT_RELATIME; if (flags & MS_STRICTATIME) mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) -- Matthew Garrett | mjg59@srcf.ucam.org --
On Thu, 26 Mar 2009 17:53:14 +0000 NAK this again There is an expected behaviour pattern that is standards compliant and suddenly breaking that on people when they upgrade could cause serious problems in some server environments. What you propose is basically a bogus ABI change. Fix it in user space (in fact all the distros *are* so this patch is silly and pointless) --
And I don't care. If the distro's had done this right in the year+ that this has been in, I m ight consider your NAK to have some weight. As it is, we know that didn't happen, and we've had three different people from different distributions say that they wanted to use relatime anyway, so it's now the default in my git tree. If you want to live in some dark ages, you can do so with the "strictatime" thing. Linus --
I obviously welcome the inclusion of the first patch and I'm neutral about the second one, but I'm not at all sure that making relatime the default (yet) is the right thing, especially as util-linux' mount command does not yet even support that "strictatime" thing. Shouldn't that at least happen first (and have been supported in distro's stable releases for some time)? I guess users and distros can still elect not to set it as default, but it still seems a bit like going from one extreme to another. --
Why? RELATIME has been around since 2006 now. Nothing has happened. People who think "we should leave it up to user land" lost their credibility long ago. Linus --
> Why? RELATIME has been around since 2006 now.
The workable fixes to relatime (the always update once per 24 hours) you
only just comitted - and did come from a vendor.
It also looks btw that we don't want to have a "relatime" option and a
"strictatime" option and a "relatimebutdoitevery24hrs" option.
All three of these are the same thing so it should (regardless of default
choice) be
relatime=n
n = 0 ('update if its more than 0 seconds out of date') =
strictatime
n = MAXINT (basically equals relatime)
n = 24hrs (the new 'fixed' relatime but not too relative)
n = anything else - user tuned
Alan
--
As I think Andrew already noted, the discussion today is largely a rehash of one in 2007, summarized by lwn [1] and kerneltrap [2]. That's also when Ingo first submitted the patch (based on a suggestion from you). But it has been blocked by others twice, and for exactly the same reasons. relatime *without* the 24-hour safeguard has unanimously been deemed unsuitable as a default by distros. So the real problem is that nobody ever did the work needed to make Ingo's original patch acceptable to the fs devs and the resulting stalemate for the last 1 3/4 years. IMO that's mainly a kernel community failure and not a user land failure. You've now at least broken that stalemate. Your statement is also not quite true. At least Ubuntu has had relatime enabled by default for new installations for a couple of releases. And AFAICT they now even have it enabled by default now in their kernel config, but I'm not entirely sure. For Debian Lenny (current stable release), relatime is a mount option that can be activated during new installs (admittedly only if you look hard enough). All that would have been needed for Debian to enable relatime by default for new installs was to have something like the *first* patch of the three you've now committed to have been included in 2.6.26. Cheers, FJP [1] http://lwn.net/Articles/244829/ [2] http://kerneltrap.org/node/14148 --
'No changes of ABI in stable series?' If you tweaked defaults for ext4, before it was widely used... that would be acceptable I guess. By breaking old setups with kernel change is bad. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
(a) we don't have a stable series any more and (b) this isn't an abi change, it's a system management change. If you don't think we can make those, then I assume that you also claim that we can't do thigns like commit 1b5e62b42, which doubled the writeback dirty thresholds, or any of the things that changed how dirty accounting was done in the first place? I would _love_ for distros to do the sane thing, but they don't. That's a fact. Linus --
Well, stat() syscall no longer returns sane value in st_atime, while all the userland stayed the same; only kernel changed. I believe that Writeback dirty thresholds will only change timing, that was not part But is this a way to do it? Are there maybe better ways? a) Publicly call those distros broken? b) Add nasty printk() to mount to force their attention? bb) Add nasty printk() and mdelay(1000) to really force their attention? :-) c) Modify mount command to do the dirty work instead of changing default in kernel? [as mountflags are not passed as a string by sys_mount(), you are creating pretty nasty situation for users; users with old distro but new kernel will not be even able to get old behaviour back w/o updating /sbin/mount. This would prevent it]. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
No. If you want the old abi, use "strictatime" in your /etc/fstime. No ABI changed. Just the default mount options changed to be what most people (especially non-specialists) would likely want. Deal with it. Linus --
~~~~~~~~~~~~~~~~~~~~~~~~ Maybe. But lets see what awaits me with 2.6.30-rc1 update: root@amd:~# mount /data -oremount,noatime root@amd:~# mount /data -oremount,strictatime mount: /data not mounted already, or bad option ...oh no, my mount is too old. My mount seems to be up-to-date with debian testing. Should I have to install mount from sources just to keep the compatible system settings? There has to be a better way. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
NAK but because if we change the default it is better to change it to the real thing: noatime. I think this can be solved in userland but perhaps changing this in kernel would be a stronger message that atime is officially obsoleted. (and nothing will break, not even mutt users will notice, and if they really do it won't be anything more than aesthetical) About the open(destination, O_TRUNC); write; close, I think it's not worth changing the kernel or the VM in any way to hide buggy programming like that, to the contrary it's great it was found early on (instead of being filed as some obscure not reproducible bug lost in some bugzilla and hitting once in a while with an unlucky power-loss during boot). But solving this bug with fsync so it works for writeback mode too, would make me prefer to gamble and run the the buggy version ;). Not sure if it worth providing any ordering guarantee more than 'ordered' mode in the long term or some proper barrier, but at least ordered mode already allows for renaming the tempfile to be enough and that is clearly the best tradeoff. fsync really should be used only to avoid total loss of information (like when we need to avoid losing the delivery of an email after the smpt client is told the email was already received by the smtp server). Using fsync to tell the kernel in what order to write is dirty pagecache data to disk, is as inefficient as driving a car to travel a 10 meters distance, so rightfully people isn't using it for this even if it's the only way it could work for writeback and ext2 too. --
Actually I liked the previous relatime (without the 24h hack) pretty much. It would have preserved the atime functionality without making something like slocate dirty huge parts of the fs daily. I vote for relatime-without-24h-hack ! Xav --
This makes the assumption that atime is not used, which may be true on your system but isn't on others. I regularly move data between faster and slower storage based on atime, and promote reactivated projects to something faster, while retiring inactive project data elsewhere. Other admins use it to identify unused files which are candidates for backup to offline media or the bit bucket. Let people who want that behavior specify it for existing filesystems, if you want to remove functionality from ext4 or btrfs or some thing place where people have no existing expectations, I still think it's wrong, but I couldn't say I think it might break anything. I did a patch a few years ago which only updated atime on open and write, and that worked about as well as relatime, the inode update on open is cheap, the head is already there, and it was only slightly slower than noatime. The were no programs which kept files open for days and just read them. The the only storage hierarchy was "slow and cheap." ;-) -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
On Thu, 26 Mar 2009 17:49:56 +0000 NAK this is unneccessary complication from a broken ABI change that isn't safe to make anyway. --
It probably was a wrong default - twenty years ago. Actually it may well have been a wrong default in Unix v6 8) However - atime behaviour is SuS required - there are users with systems out there using atime and dependant on proper atime So we can't change the ABI on them any more than we can decide that next week write() should return short values on writes to disk interrupted by signals... Letting distros flip to relatime means new installs and gradual migration occurs and nobody gets spectacularly blown up when their archiving system, their usage profiling and disk balancing tools and the like go wrong. --
SuS says "An implementation may update fields that are marked for update immediately, or it may update such fields periodically. At an update point in time, any marked fields shall be set to the current time and the update marks shall be cleared" but doesn't appear to specify any kind of time limit. A conforming implementation could wait a century before performing the update. So while relatime doesn't conform, the practical difference is meaningless. You can't depend on atime being updated in a timely manner. -- Matthew Garrett | mjg59@srcf.ucam.org --
POSIX says a disk write interrupted by a signal can be a short write. If you do this in practice all hell breaks loose. A conforming implementation needs to conform with expectations not just play lawyer games with users systems. Alan --
I agree, but arguing for something on the basis of a spec isn't terribly convincing if the spec allows effectively identical behaviour. SuS isn't a relevant consideration when it comes to deciding default atime policy. -- Matthew Garrett | mjg59@srcf.ucam.org --
I'd says its or minor relevance. The default expected behaviour we had last week is however of major relevance and that is my big concern. --
Is this the same Alan Cox who thought a couple of months ago that having an insanely low default maximum number epoll instances was a reasonable answer to a theoretical DoS risk, despite it breaking pretty much every reasonable user of the epoll interface? Bron ( what stable interface? ) --
In the short term yes - because security has to be a very high priority. Lesser of two evils. Alan --
So turn the machine off.
It seems to me that having atime turned on is a DoS risk. Any punk
can cause lots of disk IO that will make everyone else's fsync's
turn into molasses simply by reading lots of files. ZOMG (as the
kiddies of today would say) - we'd better fix this DoS risk by
disabling or rate limiting this dangeous vector (eleventyone!)
Bron ( ok, I'm getting a bit silly here - but if we blocked every
potential DoS by making sure a single user could only use a
small percentage of the machine's total capacity at maximum... )
--
> Bron ( ok, I'm getting a bit silly here Yes you are - completely. --
so I propose an other mount option along to strictatime: nowatime: it give the actual time as atime: it is totally useless, but fast *and* POSIX compatible: - no disk writes on accesses - POSIX doesn't mandate the behaviour of other processes, so we simulate that fs are scanned at every fs-tick. - IMHO more programs break, but in this case only This is the real problem. ciao cate --
That's a difference between Fedora's initrd and Debian/Ubuntu's initramfs-tools then. We do respect the mount options in fstab for the root partition when root is mounted from the initrd: $ cat /etc/fstab | grep " / " /dev/mapper/main-root / ext3 relatime,errors=remount-ro 0 1 $ mount | grep " / " /dev/mapper/main-root on / type ext3 (rw,relatime,errors=remount-ro) $ cat /proc/cmdline root=/dev/mapper/main-root ro vga=791 quiet Cheers, FJP --
It should honor /etc/fstab changes, if the initramfs is rebuilt after the change is made. If it doesn't, that's a bug. Bill --
Why the hell should I rebuild initramfs? Anyway, I fixed it. I don't use initramfs any more, after all the idiocies it has done. I had to make everything primary partitions in order to do that, but hey, that solved a lot of other problems too, so that was no loss. Linus --
Well, it's got to find the root fs options somewhere. Pulling them from the modified /etc/fstab in the root fs before you mount it, well... As for why fstab options aren't applied with remount once the root fs has been mounted, 1) historical reasons 2) someone specifies 'data=writeback' or similar can't-be-applied-with-remount flag in /etc/fstab, and then mount refuses to remount it at all, and the system refuses to boot. Arguably pilot error, of course. Bill --
Umm. The _only_ sane thng to do is to mount the root read-only from initramfs, and then re-mount it with the options in the /etc/fstab later when you re-mount it read-write _anyway_ (which may possibly be immediately, of course). Anybody who thinks you should re-write initramfs for something like this really hasn't spent a single second thinking about it. Linus --
Sure, and as said, as soon as you try to specify journal options (and possibly others), this immediately fails. You can apply the options one at a time, and decide some aren't fatal, or you can actually have your later remount have code to drop specific options, requiring implementation knowledge of any filesystem to be used. Or you say people who specify journal options in fstab don't get to boot. But if you blindly attempt to apply fstab options later in the remount, some options will break. Bill --
On Thu, 26 Mar 2009 13:32:38 -0400 Surely it should also look at the real /etc/fstab after mounting root r/o and then flip the options needed so you don't have to. --
When you say "similarly bad", how many seconds were you seeing? I understand that from the user's perspective, the 120 seconds you saw with ext3 isn't going to be that different from 15 seconds (which seems to be the maximum commit time in the jbd2 history file you sent me), but I'm curious if what you saw was just as bad with ext4, or was it somewhat better (i.e., 120 seconds vs 15 or so). Or were you also seeing a net time to save the file using vim of around 120 seconds with ext4? Ext4 in nodelalloc mode is mostly similar to ext3, but it does have some improvements, such as a slightly elevated I/O priority for kjournald, and the ext4's writepage doesn't take the journal handle as it does in ext3. (That's why I was confused about Linus's assertion about ext3 waiting on the journal; ext4 doesn't any more, and I had ext4 on the brain.) Unfortunately, we don't have the /proc/fs/jbd/<dev>/history for ext3, so it would be interesting to compare whether the vim save latencies were improved or not with ext4. If they are, then it might be worth Jan's time to fix up ext3's writepage to not try request journal access if it's not needed. It might also be worth backporting ext4's slightly raised I/O priority patch. Another thing that's worth trying. Suppose you use ionice to raise the priority of kjournald to a real-time I/O priority (which is what Arjan's patch does). How much does that help? Is it more or less compared to what we're seeing with ext4's slightly reaised I/O priority. And if we mount the filesystem noatime, does that change the results Presumably these go away once we mount the filesystem noatime, right? - Ted --
It was in the minute range, iirc. It was totally unusable interactively. I wrote the vim-test script during that workload and i'm still getting annoyed thinking back at the experience. Is that enough to consider it bad? :-) This isnt me streaming gigs of data in and out of the system dirtying 90% of all RAM. This is a trivial workload barely scratching the RAM and CPU capabilities of the system. Do you have a non-tweaked default Fedora install somewhere? These kinds of delays in Vim were easily reproducible in the last 5 years and i saw it reported frequently on various lists. Have you tried to reproduce it? Have you tried CONFIG_LATENCYTOP? We implemented that kernel feature specifically to make it easy for developers to instrument their kernel and keep system latencies down. This isnt some oddball workload or oddball system. These latencies are reproducible on just about any Linux development system i ever tried with ext3. And the thing is, to 99.9% of the people it doesnt matter how scalable we are to 16000 CPUs or whether a directory with 1 million files in it takes 10 or 200 msecs to parse. But it gives a permanent impression how much delay basic everyday operations on the system have. So latency optimizations (and i use the term losely here) have to be the primary development metric in Linux IMHO. ( If i were doing filesystem development i'd sure already have my low-latency filesystem patchset ;-) Ingo --
Have you tried with maxcpus set to say, 2? My guess is you won't see the problems in that case. So I'm not sure saying "barely scratching the CPU capabilities of the system" is completely fair. I can probably get be able to get temporary access to a 16 CPU system, but that's not the kind of system that I normally get to use for my kernel My normal development is not all that different from yours (make -j<numcpus*2>) and I do edit and save files while the compile is going. I use emacs, but it calls fsync() when saving files, just like vim does. The big difference is that for me, numcpus is normally 2. And my machine has 4 gigs of memory, not 12 gigs. So I don't see these problems. I agree that what you have isn't an "oddball workload"; as far as whether it is an "oddball system", it is certainly a system I would lust after. And I acknowledge the world is a bit different from when Linus declared that 99% of the world was 1 or 2 CPU's. I suspect the percentage of machines with 16 CPU's is still somewhat small, though. So I'll try to reproduce it on a 16 CPU system, when I have a chance --- but it's something that I'm going to have to borrow and try to get remote access to play with such a system. Clearly your employer is way more generous with equipment than mine is, at least for personal development machines. :-) In the meantime, if you could run some of the tests and vary some of the variables I requested, I'd appreciate it, and thank you for your help. Otherwise, I'll try to run them when I get remote access to such a machine where I'm allowed to replace kernels and mount random test filesystems. - Ted P.S. Another interesting test would be to plot the vim save latencies versus the number of CPU's enabled when running the kernel build workload. P.P.S. I assume there's no way you could give me remote ssh access to your nice 16-way machine? :-) --
Note, my previous devel box was a single socket quad and it had such delays all the time as well. Havent tried it on a dual-core. (i dont have dual-core systems with enough RAM to be able to build a kernel purely in RAM) Ingo --
On Thu, 26 Mar 2009 19:59:36 -0400 Nope, I saw this with my dual CPU machine too (before I upgraded to quad core)... Just doing kernel builds and/or icecream and/or VMware. It didn't take much. I have 8G of memory now but I used to have less I'm surprised you haven't seen this then... Maybe your journal is bigger? Or some other config difference... -- Jesse Barnes, Intel Open Source Technology Center --
This is something that is really hard to get right. If the shell is running a program when SIGINT arrives, it needs to wait until the program exits, and then try to decide if the program died because of the signal, or actually caught the signal (from the user's perspective), did something useful, and then chose to exit. If the program's exit status shows that it died due to SIGINT, it is easy to know what to do. But lots of non-trivial programs, probably including 'make' catch SIGINT, do some quick cleanup and then exit. In that case the shell has a hard time deciding what to do. I wrote a job-controlling shell many years ago and I think the heuristic I came up with was that if the process exited with the SIGINT status, or with a non-zero error status in less that 3 seconds after the signal actually arrived, then react to the signal and abort any script. However it the process takes longer to exit or returns a zero exit status, assume that it was interactive and handled the interrupt to the user's satisfaction, and continue with any script. I don't know what bash does, and it is possible that it could do a better job. But it is a problem for which there is no straight forward solution (a bit like filesystem data safety it would seem :-) NeilBrown --
Hi Ingo,
Just a data point: I've seen this exact same time for a long time (1-2
years) too even with stock distribution kernels. Never bothered to
investigate it though.
Pekka
--
(gets deja-vu feelings) http://lkml.org/lkml/2003/2/21/10 Maybe you should be running a 2.5.61 kernel. --
I've played with it a bit. I don't have a fast enough machine so that a compile would feed my SATA drive fast enough (and I also have just 2 GB of memory) but copying kernel tree there and back seemed to load it reasonably. I've tried a kernel with and without attached patch which makes writepage So I observed long delays when VIM was saving a file but in all cases it was hanging in fsync() which was committing a large transaction (this was both with and without patch) - not a big surprise. Working on the machine seemed a bit better when the patch was applied - in the kernel with the patch VIM at least didn't hang when just writing into the file. Reads are measurably better with the patch - the test with cat you describe below took ~0.5s per file without the patch and always less than 0.02s with the patch. So it seems to help something. Can you check on your <snip> Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR
The patch looks OK to me. This will attach dirty buffers to a clean page, which is an invalid And if this error happens we'll go on to run redirty_page_for_writepage() which will do the right thing. However if PageMappedToDisk() is working right, we should be able to avoid that newly-added buffer walk. Possibly SetPageMappedToDisk() isn't being run in all the right places though, dunno. --
Yes - actually the page has been dirty just the moment before when we run clear_page_dirty_for_io() - and at this function could have also created Yes, SetPageMappedToDisk is set only by block_read_full_page(), mpage_readpage() and nobh_write_begin(). Obviously not enough... It would be nice to improve that but that's another story... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
That would seem to be a _huge_ improvement. Reads are the biggest issue for starting a new process (eg starting firefox while under load), and if cat'ing that small file improved by that much, then I bet there's a huge practical implication for a lot of desktop uses. The fundamental fsync() latency problem we sadly can't help much with, the way ext3 seems to work. But I do suspect that the whole "don't synchronize with the journal for normal write-outs" may end up helping even fsync just a bit, if only because I suspect it will improve writeout throughput too and thus avoid one particular bottleneck. Linus --
It's strange that we still don't have an ext3_writepages(). Open a transaction, do a large pile of writes, close the transaction again. We don't even have a data=writeback writepages() implementation, which should be fairly simple. Bizarre. Mingming had a shot at it a few years ago and I think Badari did as well, but I guess it didn't work out. Falling back to generic_writepages() on our main local fs is a bit lame. --
Doable but not fairly simple ;) Firstly you have to restart a transaction when you've used up all the credits you originally started with (easy), secondly ext3 uses lock order PageLock -> "transaction start" which is unusable for the scheme you suggest. So we'd have to revert that - which needs larger audit of our locking scheme and that's probably the reason Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
It's also not clear that ext3 can really do much better than the regular generic_writepages() logic. I mean, seriously, what's there to improve on? The transaction code is all normally totally pointless, and I merged the patch that avoids it when not necessary. It might be different if more people used "data=journal", but I don't doubt that is very common. For data=writeback and data=ordered, I bet generic_writepages() is as good as anything ext3-specific could be. Linus --
- opening a single transaction for many pages in the cases when a transaction _is_ needed. - single large BIO versus zillions of single-page BIOs. Relatively minor benefits, but it's a bit odd that we never got around to doing it. It just got quite a bit harder to do, so I expect we won't be doing it. --
This is an update to the ext3+CFQ latency measurements i did early in the merge window - originally with a v2.6.29 based kernel. Today i've repeated my measurements under v2.6.30-rc1, using the exact same system and the exact same workload. The quick executive summary: Here are the specific details, as a reply to my earlier mail: Under .30-rc1 i couldnt hit a single (!) annoying delay during half an hour of trying. The "Vim experience" is _totally_ smooth with a load average of 40+. And this is with default, untweaked ext3 - not even ext4. I'm These delays are definitely below 300 msecs now. (100 msecs is This test is totally smooth now: file # 6 (253560 bytes), reading it took: 0.05 seconds. file # 7 (253560 bytes), reading it took: 0.11 seconds. file # 8 (253560 bytes), reading it took: 0.12 seconds. file # 9 (253560 bytes), reading it took: 0.06 seconds. file # 10 (253560 bytes), reading it took: 0.05 seconds. file # 11 (253560 bytes), reading it took: 0.11 seconds. file # 12 (253560 bytes), reading it took: 0.09 seconds. file # 13 (253560 bytes), reading it took: 0.09 seconds. file # 14 (253560 bytes), reading it took: 0.03 seconds. file # 15 (253560 bytes), reading it took: 0.08 seconds. file # 16 (253560 bytes), reading it took: 0.15 seconds. file # 17 (253560 bytes), reading it took: 0.06 seconds. file # 18 (253560 bytes), reading it took: 0.13 seconds. file # 19 (253560 bytes), reading it took: 0.16 seconds. file # 20 (253560 bytes), reading it took: 0.29 seconds. file # 21 (253560 bytes), reading it took: 0.18 seconds. file # 22 (253560 bytes), reading it took: 0.28 seconds. file # 23 (253560 bytes), reading it took: 0.04 seconds. 290 msecs was the worst in thes series above. The vim read+write test takes longer: aldebaran:~/linux/linux/test-files/src> ./vim-test file # 0 (253560 bytes), Vim-opening it took: 2.35 seconds. file # 1 (253560 bytes), Vim-opening it took: 2.09 seconds. file # 2 (253560 bytes), ...
Here's a quicktest with xfs+CFQ on one of my testing machines, using fsync-tester while running Linus' "bigfile torture test". It underlines your results, showing significant improvement! The xfs partition is mounted with the defaults, and I'm expecting slightly more improvement after mounting with noatime and nobarrier. 2.6.30-rc1 fsync time: 0.4674 fsync time: 1.0473 fsync time: 0.4190 fsync time: 1.0800 fsync time: 1.0132 fsync time: 1.0193 fsync time: 1.0191 fsync time: 1.1318 fsync time: 0.9924 fsync time: 1.0568 fsync time: 1.0676 fsync time: 1.0241 fsync time: 1.0530 fsync time: 0.9709 fsync time: 0.4475 fsync time: 0.6320 fsync time: 1.0906 fsync time: 0.6344 fsync time: 1.0632 fsync time: 1.0455 fsync time: 1.0530 fsync time: 1.0655 fsync time: 1.0032 fsync time: 1.0644 fsync time: 1.1573 fsync time: 1.0197 fsync time: 1.0342 fsync time: 1.0643 fsync time: 0.0342 fsync time: 0.7603 fsync time: 1.0905 fsync time: 0.6340 2.6.29.1 fsync time: 2.1255 fsync time: 2.2851 fsync time: 1.9048 fsync time: 1.0999 fsync time: 2.0117 fsync time: 2.0819 fsync time: 2.0819 fsync time: 0.0225 fsync time: 0.2796 fsync time: 0.3879 fsync time: 0.6584 fsync time: 0.9287 fsync time: 0.2488 fsync time: 2.0994 fsync time: 2.0161 fsync time: 1.9736 fsync time: 2.0231 fsync time: 2.2888 fsync time: 2.1719 fsync time: 1.8452 fsync time: 0.3278 fsync time: 1.0881 fsync time: 0.5202 fsync time: 1.3339 fsync time: 0.4295 fsync time: 1.2772 fsync time: 1.9436 fsync time: 2.1048 fsync time: 1.9376 fsync time: 2.0786 fsync time: 1.9202 --
Hmm. Thinking about that, I'm not so sure. Shouldn't that backing store allocation happen when the page is actually dirtied on ext3? I _suspect_ that goes back to the fact that ext3 is older than the "aops->set_page_dirty()" callback, and nobody taught ext3 to do the bmap's at dirty time, so now it does it at writeout time. Anyway, there we are. Old filesystems do the wrong thing (block allocation while doing writeout because they don't do it when dirtying), and newer filesystems do the wrong thing (block allocations during writeout, because they want to do delayed allocation to do the inode dirtying after doing writeback). And in either case, the VM is screwed, and can't ask for writeout, because it will be randomly throttled by the filesystem. So we do lots of async bdflush threads, which then causes IO ordering problems because now the writeout is all in random order. Linus --
We don't do it currently. We could do it (it would also solve the problem that we currently silently discard users data when he reaches his quota or filesystem gets ENOSPC) but there are problems with it as well: 1) We have to writeout blocks full of zeros on allocation so that we don't expose unallocated data => slight slowdown 2) When blocksize < pagesize we must play nasty tricks for this to work (think about i_size = 1024, set_page_dirty(), truncate(f, 8192), writepage() -> uhuh, not enough space allocated) 3) We'll do allocation in the order in which pages are dirtied. Generally, I'd suspect this order to be less linear than the order in which writepages submit IO and thus it will result in the larger fragmentation of the file. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Why? This is in _no_ way different from a regular "write()" system call. And there, we just attach the buffers to the page. If something crashes before the page actually gets written out, then we'll have hopefully never Good point. I suspect not enough people have played around with "set_page_dirty()" to find these kinds of things. The VFS layer probably doesn't help sufficiently with the half-dirty pages, although the FS can obviously always look up the previously last page and do things manually if it wants to. Yes, that may be the case. Of course, the approach of just checking whether the buffer heads already exists and are mapped (before bothering with anything else) probably works fine in practice. In most loads, pages will have been dirtied by regular "write()" system calls, and then we will have the buffers pre-allocated regardless. Linus --
Yeah, I agree; solving the problem in the case of files being dirtied via write() is going to solve a much percentage of the cases compared to those cases where the pages are dirtied via mmap()'ed pages. I thought we were doing this already, but clearly I should have looked at the code first. :-( - Ted --
Sorry, I wasn't exact enough. We'll attach buffers to the running transaction and they'll get written out at the transaction commit which is usually earlier than when the writepage() is called and then later writepage() will write the data again (this is a consequence of the fact that JBD commit code just writes buffers without calling clear_page_dirty_for_io())... At least ext4 has this fixed because JBD2 already writes out ordered data via writepages(). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Andrew really didn't like Arjan's patch because it forces non-synchronous writes to have a real-time I/O priority. He suggested an alternative approach which I coded up as "akpm's locking hack to fix locking delays"; unfortunately, it doesn't work. In ext4, I quietly put in a mount option, journal_ioprio, and set the default to be slightly higher than the default I/O priority (but no a real-time class priority) to prevent the write starvation problem. This definitely helps for some workloads (when some task is reading enough to starve out the rights). More recently (as in this past weekend), I went back to the ext3 problem, and found a better solution, here: http://lkml.org/lkml/2009/3/21/304 http://lkml.org/lkml/2009/3/21/302 http://lkml.org/lkml/2009/3/21/303 These patches cause the synchronous writes caused by an fsync() to be submitted using WRITE_SYNC, instead of WRITE, which definitely helps in the case where there is a heavy read workload in the background. They don't solve the problem where there is a *huge* amount of writes going on, though --- if something is dirtying pages at a rate far greater than the local disk can write it out, say, either "dd if=/dev/zero of=/mnt/make-lots-of-writes" or a massive distcc cluster driving a huge amount of data towards a single system or a wget over a local 100 megabit ethernet from a massive NFS server where everything is in cache, then you can have a major delay with the fsync(). However, what I've found, though, is that if you're just doing a local copy from one hard drive to another, or downloading a huge iso file from an ftp server over a wide area network, the fsync() delays really don't get *that* bad, even with ext3. At least, I haven't found a workload that doesn't involve either dd if=/dev/zero or a massive amount of data coming in over the network that will cause fsync() delays in the > 1-2 second category. Ext3 has been around for a long time, and it's only been the last couple of years that ...
Nice, thanks for the update! The situation isnt nearly as bleak as i i think the problem became visible via the rise in memory size, combined with the non-improvement of the performance of rotational disks. The disk speed versus RAM size ratio has become dramatically worse - and our "5% of RAM" dirty ratio on a 32 GB box is 1.6 GB - which takes an eternity to write out if you happen to sync on that. When we had 1 GB of RAM 5% meant 51 MB - one or two seconds to flush out - and worse than that, chances are that it's spread out widely on the disk, the whole thing becoming seek-limited as well. That's where the main difference in perception of this problem comes from i believe. The problem was always there, but only in the last 1-2 years did 4G/8G systems become really common for people to notice. SSDs will save us eventually, but they will take up to a decade to trickle through for us to forget about this problem altogether. Ingo --
That's definitely a problem too, but keep in mind that by default the journal gets committed every 5 seconds, so the data gets flushed out that often. So the question is how quickly can you *dirty* 1.6GB of memory? "dd if=/dev/zero of=/u1/dirty-me-harder" will certainly do it, but normally we're doing something useful, and so you're either copying data from local disk, at which point you're limited by the read speed of your local disk (I suppose it could be in cache, but how common of a case is that?), *or*, you're copying from the network, and to copy in 1.6GB of data in 5 seconds, that means you're moving 320 megabytes/second, which if we're copying in the data from the network, requires a 10 gigabit ethernet. Hence my statement that this probably became much more visible with fast ethernets --- but you're right, the huge increase in memory sizes was also a key factor; otherwise, write throttling would have kicked in and the VM would have started pushing the dirty pages to disk much sooner. - Ted --
Say it's a file that you allready have in memory cache read in.. there is plenty of space in 16GB for that.. then you can dirty it at memory-speed.. that about ½sec. (correct me if I'm wrong). Ok, this is probably unrealistic, but memory grows the largest we have at the moment is 32GB and its steadily growing with the core-counts. Then the available memory is used to cache the "active" portion of the filsystems. I would even say that in the NFS-servers I depend on it to do this efficiently. (2.6.29-rc8 delivered 1050MB/s over af 10GbitE using nfsd - send speed to multiple clients). The current workload is based of an active dataset of 600GB where index'es are being generated and written back to the same disk. So there is a fairly high read/write load on the machine (as you said was required). The majority (perhaps 550GB ) is only read once where the Increasingly the case as memory sizes grows. or just around being processed on the 16-32 cores on the system. Jesper -- Jesper --
Doesn't at least ext4 default to the _insane_ model of "data is less important than meta-data, and it doesn't get journalled"? And ext3 with "data=writeback" does the same, no? Both of which are - as far as I can tell - total braindamage. At least with ext3 it's not the _default_ mode. I never understood how anybody doing filesystems (especially ones that claim to be crash-resistant due to journalling) would _ever_ accept the No, you'll still have to get per-page locks etc. If you use mmap(), you'll page-fault on each page, if you use write() you'll do all the page lookups etc. But yes, it can be pretty quick - the biggest cost probably _will_ be the speed of memory itself (doing one-byte writes at each block would change that, and the bottle-neck would become the system call and page lookup/locking path, but it's probably in the same rough cost as cost of writing out one page one page). That said, this is all why we now have 'dirty_*bytes' limits too. The problem is that the dirty_[background_]bytes value really should be scaled up by the speed of IO. And we currently have no way to do that. Some machines can write a gigabyte in a second with some fancy RAID setups. Others will take minutes (or hours) to do that (crappy SSD's that get 25kB/s throughput on random writes). The "dirty_[background_ratio" percentage doesn't scale up by the speed of IO either, of course, but at least historically there was generally a pretty good correlation between amount of memory and speed of IO. The machines that had gigs and gigs of RAM tended to always have fast IO too. So scaling up dirty limits by memory size made sense both in the "we have tons of memory, so allow tons of it to be dirty" sense _and_ in the "we likely have a fast disk, so allow more pending dirty data". Linus --
..
MythTV: rm /some/really/huge/video/file ; sync
## disk light stays on for several minutes..
Note quite the same thing, I suppose, but it does break
the shutdown scripts of every major Linux distribution.
Simple solution for MythTV is what people already do: use xfs instead.
--
It is indeed a different issue. ext3 does a fair bit of IO on a (here 60G file) delete: http://people.redhat.com/~esandeen/rm_test/ext3_rm.png ext4 is much better: and yes, xfs does it very quickly: http://people.redhat.com/~esandeen/rm_test/xfs_rm.png -Eric --
At very high rates other things seem to go pear shaped. I've not traced it back far enough to be sure but what I suspect occurs from the I/O at disk level is that two people are writing stuff out at once - presumably the vm paging pressure and the file system - as I see two streams of I/O I see it with a desktop when it pages hard and also when doing heavy desktop I/O (in my case the repeatable every time case is saving large images in the gimp - A4 at 600-1200dpi). The other one (#8636) seems to be a bug in the I/O schedulers as it goes Yes and in the server environment or for typical enterprise customers this is a *big issue*, especially the risk of it being undetected that they just inadvertently did something like put your medical data into the I need to, so that I can double check none of the open jbd locking bugs are there and close more bugzilla entries (#8147) Thanks for the reply - I hadn't realised a lot of this was getting fixed but in ext4 and quietly Alan --
Surely the elevator should have reordered the writes reasonably? (Or is that what you meant by "the other one -- #8636 (I assume this is a kernel Bugzilla #?) seems to be a bug in the I/O schedulers as it goes Yeah, I could see that doing it. How big is the image, and out of curiosity, can you run the fsync-tester.c program I posted while saving the gimp image, and tell me how much of a delay you end up Where's your bravery, man? :-) I've been using it on my laptop since July, and haven't lost significant amounts of data yet. (The only thing I did lose was bits of a git repository fairly early on, and I was able to repair by True enough; changing the defaults to be data=writeback for the server environment is probably not a good idea. (Then again, in the server environment most of the workloads generally don't end up hitting the nasty data=ordered failure modes; they tend to be More testing would be appreciated --- and yeah, we need to groom the bugzilla. For a long time no one in ext3 land was paying attention to bugzilla, and more recently I've been trying to keep up with the ext4-related bugs, but I don't get to do ext4 work full-time, and Yeah, there are a bunch of things, like the barrier=1 default, which akpm has rejected for ext3, but which we've fixed in ext4. More help in shaking down the bugs would definitely be appreciated. - Ted --
There are two cases there. One is a bug #8636 (kernel bugzilla) which is where things like dump show awful performance with certain I/O scheduler settings. That seems to be totally not connected to the fs but it is a problem (and has a patch) The second one the elevator is clearly trying to sort out but its behaving as if someone is writing the file starting at say 0 and someone else is trying to write it back starting some large distance further down 150MB+ for the pnm files from gimp used as temporaries by Eve (Etch Added to the TODO list once I can set up a suitable test box (my new dev I'm currently doing this on a large scale (closed about 300 so far this run). Bug 8147 might be worth a look as its a case where the jbd locking and the jbd comments seem to disagree (the comments say you must hold a lock but we don't seem to) --
There are different problems leading to this: 1) JBD commit code writes ordered data on each transaction commit. This is done in dirtied-time order which is not necessarily optimal in case of random access IO. IO scheduler helps here though because we submit a lot of IO at once. ext4 has at least the randomness part of this problem "fixed" because it submits ordered data via writepages(). Doing this change requires non-trivial changes to the journaling layer so I wasn't brave enough to do it with ext3 and JBD as well (although porting the patch is trivial). 2) When we do dirty throttling, there are going to be several threads writing out on the filesystem (if you have more pdflush threads which translates to having more than one CPU). Jens' per-BDI writeback threads could help here (but I haven't yet got to reading his patches in detail to be sure). These two problems together result in non-optimal IO pattern. At least that's where I got to when I was looking into why Berkeley DB is so slow. I was trying to somehow serialize more pdflush threads on the filesystem but a stupid solution does not really help much - either I was starving some throttled thread by other threads doing writeback or I didn't quite keep the disk busy. So something like Jens' approach This one is still there. I'll have a look at it tomorrow and hopefully will be able to answer... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs --
Isn't that the same fix? ext4 just defaults to the crappy "writeback" behavior, which is insane. Sure, it makes things _much_ smoother, since now the actual data is no longer in the critical path for any journal writes, but anybody who thinks that's a solution is just incompetent. We might as well go back to ext2 then. If your data gets written out long after the metadata hit the disk, you are going to hit all kinds of bad issues if the machine ever goes down. Linus --
Technically, it's not data=writeback. It's more like XFS's delayed allocation; I've added workarounds so that files that which are replaced via truncate or rename get pushed out right away, which should solve most of the problems involved with files becoming With ext2 after a system crash you need to run fsck. With ext4, fsck isn't an issue, but if the application doesn't use fsync(), yes, there's no guarantee (other than the workarounds for replace-via-truncate and replace-via-rename), but there's plenty of prior history that says that applications that care about data hitting the disk should use fsync(). Otherwise, it will get spread out over a few minutes; and for some files, that really won't make a difference. For precious files, applications that use fsync() will be safe --- otherwise, even with ext3, you can end up losing the contents of the file if you crash right before 5 second commit window. At least back in the days when people were proud of their Linux systems having 2-3 year uptimes, and where jiffies could actually wrap from time to time, the difference between 5 seconds and 3 minutes really wasn't that big of a deal. People who really care about this can turn off delayed allocation with the nodelalloc mount option. Of course then they will have the ext3 slower fsync() problem. You are right that data=writeback and delayed allocation do both mean that data can get pushed out much later than the metadata. But that's allowed by POSIX, and it does give some very nice performance benefits. With either data=writeback or delayed allocation, we can also adjust the default commit interval and the writeback timer settings; if we say, change the default commit interval to be 30 seconds, and change the writeback expire interval to be 15 seconds, it will also smooth out the writes significantly. So that's yet another solution, with a different set of tradeoffs. Depending on the set of applications someone is running on their system, running and the ...
Bah. A corrupt filesystem is a corrupt filesystem. Whether you have to fsck it or not should be a secondary concern. I personally find silent corruption to be _worse_ than the non-silent one. At least if there's some program that says "oops, your inode so-and-so seems to be scrogged" that's better than just silently having bad data in it. Of course, never having bad data _nor_ needing fsck is clearly optimal. data=ordered gets pretty close (and data=journal is unacceptable for performance reasons). But I really don't understand filesystem people who think that "fsck" is the important part, regardless of whether the data is valid or not. That's just stupid and _obviously_ bogus. Linus --
It is always interesting to try to explain to users that just because fsck ran cleanly does not mean anything that they care about is actually safely on disk. The speed that fsck can run at is important when you are trying to recover data from a really hosed file system, but that is thankfully relatively rare for most people. Having been involved in many calls with customers after crashes, what they really want to know is pretty routine - do you have all of the data I wrote? can you prove that it is the same data that I wrote? if not, what data is missing and needs to be restored? We can get help answer those questions with checksums or digital hashes to validate the actual user data of files (open question is when to compute it, where to store, would the SCSI T10 DIF/DIX stuff be sufficient), putting in place some background scrubbers to detect corruptions (which can happen even without an IO error), etc. Being able to pin point what was impacted is actually enormously useful - for example, being able to map a bad sector back into some meaningful object like a user file, meta-data (translation, run fsck) or so on. Ric --
I think I can understand that point of view, at least: More customers complain about hours-long fsck times than they do about Amen. And, personal filesystem pet peeve: please encourage proper FLUSH CACHE use to give users the data guarantees they deserve. Linux's sync(2) and fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee a media write. Jeff P.S. Overall, I am thrilled that this ext3/ext4 transition and associated slashdotting has spurred debate over filesystem data guarantees. This is the kind of discussion that has needed to happen for years, IMO. --
I completely agree. This also applies to nfsd_sync, by the way. What's the right place to implement that? How about sync_blockdev? -- Benny Halevy Software Architect Panasas, Inc. bhalevy@panasas.com Tel/Fax: +972-3-647-8340 Mobile: +972-54-802-8340 Panasas: The Leader in Parallel Storage www.panasas.com --
Erm, no, you don't enable barriers on your drive, they are not a hardware feature. You enable barriers via your filesystem. Stating "fsync already does that" borders on false, because that assumes (a) the user has a fs that supports barriers (b) the user is actually aware of a 'barriers' mount option and what it means (c) the user has turned on an option normally defaulted to off. Or in other words, it pretty much never happens. Furthermore, a blatantly obvious place to flush data to media -- fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the block layer to issue a FLUSH CACHE for __any__ filesystem. But that doesn't happen either. So, no, for 95% of Linux users, fsync does _not_ already do that. If you are lucky enough to use XFS or ext4, you're covered. That's it. Jeff --
Thanks for the lesson Jeff, I'm obviously not aware how that stuff That is true, except if you use xfs/ext4. And this discussion is fine, as was the one a few months back that got ext4 to enable barriers by default. If I had submitted patches to do that back in 2001/2 when the barrier stuff was written, I would have been shot for introducing such a slow down. After people found out that it just wasn't something silly, then you have a way to enable it. I'd still wager that most people would rather have a 'good enough fsync' on their desktops than incur the penalty of barriers or write The point is that you need to expose this choice somewhere, and that 'somewhere' isn't manually editing fstab and enabling barriers or fsync-for-real. And it should be easier. Another problem is that FLUSH_CACHE sucks. Really. And not just on ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and wit for the world to finish. Pretty hard to teach people to use a nicer fdatasync(), when the majority of the cost now becomes flushing the cache of that 1TB drive you happen to have 8 partitions on. Good luck with that. -- Jens Axboe --
And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE is per device (not file system). When you issue an fsync() on a disk with multiple partitions, you will flush the data for all of its partitions from the write cache.... ric --
Exactly, that's what my (vague) 8 partition reference was for :-) A range flush would be so much more palatable. -- Jens Axboe --
Tangential question, but am I right in thinking that BIO_RW_BARRIER similarly bars across all partitions, whereas its WRITE_BARRIER and DISCARD_BARRIER users would actually prefer it to apply to just one? Hugh --
All the barriers refer to just that range which the barrier itself references. The problem with the full device flushes is implementation on the hardware side, since we can't do small range flushes. So it's not as-designed, but rather the best we can do... -- Jens Axboe --
Ah, thank you: then I had a fundamental misunderstanding of them, and need to go away and work that out some more. Though I didn't read it before asking, doesn't the I/O Barriers section Right, that part of it I did get. Hugh --
I'm sensing a miscommunication here... The ordering constraint is across devices, at least that is how it is implemented. For file system barriers (like BIO_RW_BARRIER), it could be per-partition instead. Doing so would involve some changes at the block layer side, not necessarily trivial. So I think you were asking about ordering, I was answering about the write guarantee :-) -- Jens Axboe --
Ah, thank you again, perhaps I did understand after all. So, directing a barrier (WRITE_BARRIER or DISCARD_BARRIER) to a range of sectors in one partition interposes a barrier into the queue of I/O across (all partitions of) that whole device. I think that's not how filesystems really want barriers to behave, and might tend to discourage us from using barriers more freely. But I have zero appreciation of whether it's a significant issue worth non-trivial change - just wanted to get it out into the open. Hugh --
Per-partition definitely makes sense. The problem is that we do sorting on a per-device basis right now. But it's a good point, I'll try and take a look at how much work it would be to make it per-partition instead. It wont be trivial :-) -- Jens Axboe --
Ric Wheeler wrote:> And, as I am sure that you do know, to add insult to SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) pair. We could make use of that. And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could demonstrate clear benefit. Jeff --
What do you mean by well supported? The way the SCSI standard is written, a device can do a complete cache flush when a range flush is requested and still be fully standards compliant. There's no easy way to tell if it does a complete cache flush every time other than by taking the firmware apart (or asking the manufacturer). James --
That's the fear of range flushes, if it was added to t13 as well. Unless that Other OS uses range flushes, most firmware writers would most likely implement any range as 0...-1 and it wouldn't help us at all. In fact it would make things worse, as we would have done extra work to actually find these ranges, unless you went cheap and said 'just flush this partition'. -- Jens Axboe --
Quite true, though wondering aloud... How difficult would it be to pass the "lower-bound" LBA to SYNCHRONIZE CACHE, where "lower bound" is defined as the lowest sector in the range of sectors to be flushed? That seems like a reasonable optimization -- it gives the drive an easy way to skip sync'ing sectors lower than the lower-bound LBA, if it is capable. Otherwise, a standards-compliant firmware will behave as you describe, and do what our code currently expects today -- a full cache flush. This seems like a good way to speed up cache flush [on SCSI], while also perhaps experimenting with a more fine-grained way to pass down write barriers to the device. Not a high priority thing overall, but OTOH, consider the case of placing your journal at the end of the disk. You could then issue a cache flush with a non-zero starting offset: SYNCHRONIZE CACHE (max sectors - JOURNAL_SIZE, ~0) That should be trivial even for dumb disk firmwares to optimize. Jeff --
Actually, the implementation is designed to allow this. The standard says if the number of blocks is zero that means flush from the specified LBA to the end of the device. The sync cache we currently use has LBA 0 We could try it ... I'm still not sure how we'd tell the device is actually implementing it and not flushing the entire device. James --
Yeah, that feature of the spec was what got me thinking. "difficult" was referring more to the kernel side of things... if calculating the lowest LBA of a write barrier is difficult and/or CPU-consuming, the effort may not be worth it. But if we could stick a if (LBA < barrier-lower-bound) barrier-lower-bound = LBA somewhere, then pass that to SYNCHRONIZE CACHE, it could be a cheap way to increase sync-cache speed. It seems extremely unlikely that sync-cache speed would _decrease_: for Is that knowledge necessary? Assuming the lower-bound is super-cheap to calculate, then the two most likely outcomes are: sync-cache speed remains the same, or sync-cache speed increases. If the calculation of lower-bound is costly, I could see the need for that knowledge -- but if the cost is too high, the entire effort it likely to be scuttled, rather than worrying about detecting flush-everything firmwares. Jeff --
It's not impossible, though ... since the drive fw processor is probably pretty slow, but yes, it should hopefully be as fast or faster than full Yes, agreed ... we might as well tell the FW if it's cheap to know I really think, though, it's time to look again at how we implement barriers. Even properly implemented range flushing (if we can do it) is only decreasing the amount of overhead in a flush barrier. If we could make the filesystems tolerant or at least aware that there might be very rare periods during operation when barriers get violated (during error processing or queue full handling) we could look again at implementing barriers via ordered tags. James --
One more example of flexible, fine grain flush (though quite far out) are T10 OSDs with which you can flush a byte range of a single object (or collection, partition, or the whole device LUN) --
That's a strawman argument: The choice is not between "good enough fsync" and full use of barriers / write-through caching, at all. It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be issued, without adding full barrier support to a filesystem. It is likely doable to avoid touching per-filesystem code at all, if we issue the flush from a generic fsync(2) code path in the kernel. Thus, you have a "third way": fsync(2) gives the guarantee it is supposed to, but you do not take the full performance hit of barriers-all-the-time. Remember, fsync(2) means that the user _expects_ a performance hit. And they took the extra step to call fsync(2) because they want a guarantee, not a lie. Jeff --
We could easily do that. It would even work for most cases. The problematic ones are where filesystems do their own disk management, but I guess those people can do their own fsync() management too. Within reason, though. OS X, for example, doesn't do the disk barrier. It requires you to do a separate FULL_FSYNC (or something similar) ioctl to get that. Apparently exactly because users don't expect quite _that_ big of a performance hit. (Or maybe just because it was easier to do that way. Never attribute to malice what can be sufficiently explained by stupidity). Linus --
One concern with doing this above the file system is that you are not in the context of a transaction so you have no clean promises about what is on disk and persistent when. Flushing the cache is primitive at best, but the way barriers work today is designed to give the transactions some pretty critical ordering semantics for journalling file systems at least. I don't see how you could use this approach to make a really robust, failure proof storage system, but it might appear to work most of the time for most people :-) --
You just do a write barrier after doing all the filesystem writing, and you return with the guarantee that all the writes the filesystem did are actually on disk. No gray areas. No questions. No "might appear to work". Sure, there might be other writes that got flushed _too_, but nobody cares. If you have a crash later on, that's always true - you don't get crashes at nice well-defined points. Linus --
In this case, you have not gained anything - same number of barrier This is pretty much how write barriers work today - you carry down other transactions (even for other partitions on the same disk) with you... ric --
Um. Except you gained the fact that the filesystem doesn't have to care and screw it up. And then we can know that it gets done, regardless of what odd things the low-level fs does. Linus --
This is a simple step that would cover a lot of cases... sync(2)
calls sync_blockdev(), and many filesystems do as well via the generic
filesystem helper file_fsync (fs/sync.c).
XFS code calls sync_blockdev() a "big hammer", so I hope my patch
follows with known practice.
Looking over every use of sync_blockdev(), its most frequent use is
through fsync(2), for the selected filesystems that use the generic
file_fsync helper.
Most callers of sync_blockdev() in the kernel do so infrequently,
when removing and invalidating volumes (MD) or storing the superblock
prior to release (put_super) in some filesystems.
Compile-tested only, of course :) But it should be work :)
My main concern is some hidden area that calls sync_blockdev() with
a high-enough frequency that the performance hit is bad.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..7b9f74a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -173,9 +173,14 @@ int sync_blockdev(struct block_device *bdev)
{
int ret = 0;
- if (bdev)
- ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
- return ret;
+ if (!bdev)
+ return 0;
+
+ ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
+ if (ret)
+ return ret;
+
+ return blkdev_issue_flush(bdev, NULL);
}
EXPORT_SYMBOL(sync_blockdev);
--
What about when you're running over a big raid device with battery-backed cache, and you trust the cache as much as much as the disks. Wouldn't this unconditional cache flush be painful there on any of the callers even if they're rare? (fs unmounts, freezes, unmounts, etc? Or a fat filesystem on that device doing an fsync?) xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are not enabled, I think for that reason... (I'm assuming these raid devices still honor a cache flush request even if they're battery-backed? I dunno.) -Eric --
What exactly do you think sync_blockdev() does? :) It is used right before a volume goes away. If that's not a time to flush the cache, I dunno what is. The _whole purpose_ of sync_blockdev() is to push out the data to permanent storage. Look at the users -- unmount volume, journal close, etc. Things that are OK to occur after those points include: power off, device unplug, etc. A secondary purpose of sync_blockdev() is as a hack, for simple/ancient bdev-based filesystems that do not wish to bother with barriers and all Enabling barriers causes slowdowns far greater than that of simply causing fsync(2) to trigger FLUSH CACHE, because barriers imply FLUSH CACHE issuance for all in-kernel filesystem journalled/atomic transactions, in addition to whatever syscalls userspace is issuing. The number of FLUSH CACHES w/ barriers is orders of magnitude larger than the number of fsync/fdatasync calls. Jeff --
It used to push os cached data to the storage. Now it tells the storage to flush cache too (with your patch). This seems fine in general, although it's not a panacea for all the various data integrity issues Sure. But I was thinking about enterprise raids with battery backup which may last for days. But, ok, I wasn't thinking quite right about the unmount situations etc; even on enterprise raids like this, flushing things out on unmount makes sense in the case where you lose power post-unmount and can't restore power before the battery backup dies. I also wondered if a cache flush on one lun issues a cache flush for the entire controller, or just for that lun. Hopefully the latter, in which I understand all that. My point is that the above filesystems (xfs, reiserfs, ext4) skip the blkdev flush on fsync when barriers are explicitly disabled. They do this because if an admin disables barriers, they are trusting that the write cache is nonvolatile and will be able to destage fully even if external power is lost for some time. In that case you don't need a blkdev_issue_flush on fsync either (or are at least willing to live with the diminished risk, thanks to the battery backup), and on xfs, ext4 etc you can turn it off (it goes away w/ the barriers off setting). With this change to the simple generic fsync path, you can't turn it off for those filesystems that use it for fsync. But I suppose it's rare that anybody ever uses a filesystem which uses this generic sync method on any sort of interesting storage like I'm talking about, and it's not a big deal... (or maybe that interesting storage just ignores cache flushes anyway, I dunno). My main concerns were that these extra cache flushes for fsync aren't tunable, and that flushes on one lun might affect other luns. I guess I've talked myself out of those concerns in a couple different ways now. ;) -Eric --
Enterprise raid systems don't have this issue. They all have sufficient battery power to safely destage the volatile cache to persistent storage on power outage (i.e., they keep enough drives spinning and so on to empty the cache). Hardware RAID cards that you have in some servers (not external RAID Probably just for that LUN - LUN's are usually independent in almost all ways. Firmware of course could do anything, I would assume that most I do enthusiastically agree that we should not be doing barriers and the blkdev flush for file systems that do barriers correctly. ric --
I think that Jeff's patch misses the whole need to protect transactions, including meta data, in a precise way. Useful for thing like unmount, not to give us strong protection for transactions or for fsync(). This patch will be adding overhead here - you will still need flushing at the transaction commit layer of the specific file systems to get any reliable transactions. Having looked at the timing of barrier flushes on slow s-ata drives with an analyser a few years back, the first one is expensive (as you would expect with a large drive cache of 16 or 32 MB) and the second was nearly free. Moving the expensive flush to this layer guts the transaction building blocks and costs the same.... ric --
What do you think sync_blockdev() does? What is its purpose? Twofold: (1) guarantee all user data is flushed out before a major event (unmount, journal close, unplug, poweroff, explosion, ...) (2) As a sledgehammer hack for simple or legacy filesystems that do not wish or need the complexity of transactional protection. sync_blockdev() is intentionally used in lieu of complexity for the following filesystems: HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, qnx4. My patch adds needed guarantees, only for the above filesystems, where sync_blockdev() is used as fsync(2) only in simple or legacy filesystems that do not want a transaction commit layer! Read the patch :) Jeff --
To be specific, I was referring to fsync(2) guarantees being added to HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, and qnx4. Other filesystems, besides those in the list, gain the flush-on-unmount action (a rare but useful addition) with my patch. Jeff --
Sorry for misunderstanding the scope of this before - this is certainly a net win for the file systems that don't have proper barrier support baked in already. Thanks! Ric --
It writes out data in the block device inode. Which does not include any user data, and might not contain anything at all for filesystems that have their own address space for metadata. It's defintively the wrong place for this kind of hack. --
I think most don't, as they realize it's a data integrity thing and that doesn't apply if you don't lose data on powerloss. But, I'm sure there are also "dumb" ones that DO flush the cache. In which case the flush is utterly hopeless and should not be done. -- Jens Axboe --
file_fsync probably needs to pass down more information so you can make this a mount option. It's going to depend on the application whether the flush is good bad or indifferent. --
file_fsync is only used by ancient legacy filesystems, who specifically don't want to bother with anything more complicated: HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, qnx4. IOW they _already_ consciously implement fsync(2) as "flush ENTIRE blockdev". I think it is worth it to simply wait and see if mount options are even wanted. Jeff --
Flush storage dev writeback cache, for each call to sync_blockdev().
sync_blockdev() is used primarily for two purposes:
1) To flush all data to permanent storage prior to a major event,
such as: unmount, journal close, unplug, poweroff, explosion, ...
2) As a "sledgehammer hack" to provide fsync(2) via file_fsync to
filesystems, generally simple or legacy filesystems, such as HFS,
HFS+, ADFS, AFFS, bfs, UFS, NTFS, qnx4 and FAT.
This change guarantees that the underlying storage device will have
flushed any pending data in its writeback cache to permanent media,
before it returns (...if underlying storage dev supports flushes).
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
---
Changes since last patch:
- do not return error, if storage dev does not support flushes at all
(-EOPNOTSUPP)
diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..e04d7a4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -173,8 +173,17 @@ int sync_blockdev(struct block_device *bdev)
{
int ret = 0;
- if (bdev)
- ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
+ if (!bdev)
+ return 0;
+
+ ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
+ if (ret)
+ return ret;
+
+ ret = blkdev_issue_flush(bdev, NULL);
+ if (ret == -EOPNOTSUPP)
+ ret = 0;
+
return ret;
}
EXPORT_SYMBOL(sync_blockdev);
--
Jeff, FYI, I tried your patch; it causes the lvm process called out of the initramfs from an Ubuntu 8.10 system to blow up while trying to set up the root filesystem. The stack trace was: generic_make_request+0x2a3/0x2e6 trace_hardirqs_on_caller+0x111/0x135 mempool_alloc_slab+0xe/0x10 mempool_alloc+0x42/0xe0 submit_bio+0xad/0xb5 bio_alloc_bioset+0x21/0xfc blkdev_issue_flush+0x7f/0xfc syn_blockdev+0x2a/0x36 __blkdev_put_0x44/0x131 blkdev_put+0xa/0xc blkdev_close+0x2e/0x32 __fput+0xcf/0x15f fput+0x19/0x1b filp_close+0x51/0x5b sys_close+0x73/0xad - Ted --
hmmm, I wonder if DM/LVM doesn't like blkdev_issue_flush, or it's too early, or what. I'll toss Ubuntu onto a VM and check it out... Jeff --
I forgot to mention. The failure was the EIP was NULL; so it looks like we called a null function pointer. The only function pointer derference I can find is q->make_request_fn() in line 1460 of blk-core.c, in __generic_make_request. But that doesn't seem make any sense.... Anyway, maybe you can figure out what's going on. The problem disappeared as soon as I popped off this patch, though, so it was pretty clearly the culprit. - Ted --
Simple and legacy blkdev-based filesystems such HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, and qnx4 all use file_fsync as their fsync(2) VFS helper implementation. Add a storage dev cache flush, to actually provide the guarantees that are promised with fsync(2). Signed-off-by: Jeff Garzik <jgarzik@redhat.com> --- Out of 18 other places that call sync_blockdev(), only 3-4 are in filesystems that arguably do not need or want a blkdev flush. This patch below clearly only addresses 1 out of ~15 callsites that really do want metadata, data, and everything in between flushed to disk at the sync_blockdev() callsite. It should be noted that other calls are NOT used in fsync(2), but rather than with guaranteed written data prior to major events such as unmount, journal close, MD consistency check, etc. diff --git a/fs/sync.c b/fs/sync.c index a16d53e..24bb2f4 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -5,6 +5,7 @@ #include <linux/kernel.h> #include <linux/file.h> #include <linux/fs.h> +#include <linux/blkdev.h> #include <linux/module.h> #include <linux/sched.h> #include <linux/writeback.h> @@ -72,6 +73,13 @@ int file_fsync(struct file *filp, struct dentry *dentry, int datasync) err = sync_blockdev(sb->s_bdev); if (!ret) ret = err; + + err = blkdev_issue_flush(sb->s_bdev, NULL); + if (err == -EOPNOTSUPP) + err = 0; + if (!ret) + ret = err; + return ret; } --
Looks good except that we still need a tuning know for it. Preferably one that works for these filesystems and all the existing barrier using ones. --
Christoph, I reworked my previous fsync() patches so that what was a mount option to trigger a storage device writeback cache flush becomes a sysfs knob. I still need to add support for automatic detection of underlying device's flushing capabilities but first I would like to know if you agree with the general approach. I'll be replying to this email with the new patches. - Fernando --
This patch adds a helper function that should be used by filesystems that need
to flush the underlying block device on fsync()/fdatasync().
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/fs/buffer.c linux-2.6.29/fs/buffer.c
--- linux-2.6.29-orig/fs/buffer.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/buffer.c 2009-03-28 20:43:51.000000000 +0900
@@ -165,6 +165,17 @@ void end_buffer_write_sync(struct buffer
put_bh(bh);
}
+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+ int ret = 0;
+
+ ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+ return (ret == -EOPNOTSUPP) ? 0 : ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
/*
* Write out and wait upon all the dirty data associated with a block
* device via its mapping. Does not take the superblock lock.
diff -urNp linux-2.6.29-orig/include/linux/buffer_head.h linux-2.6.29/include/linux/buffer_head.h
--- linux-2.6.29-orig/include/linux/buffer_head.h 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/include/linux/buffer_head.h 2009-03-28 20:43:51.000000000 +0900
@@ -238,6 +238,7 @@ int nobh_write_end(struct file *, struct
int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
int nobh_writepage(struct page *page, get_block_t *get_block,
struct writeback_control *wbc);
+int block_flush_device(struct super_block *sb);
void buffer_init(void);
--
To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/fs/ext3/fsync.c linux-2.6.29/fs/ext3/fsync.c
--- linux-2.6.29-orig/fs/ext3/fsync.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/ext3/fsync.c 2009-03-28 20:45:40.000000000 +0900
@@ -45,6 +45,8 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;
J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +71,30 @@ int ext3_sync_file(struct file * file, s
*/
if (ext3_should_journal_data(inode)) {
ret = ext3_force_commit(inode->i_sb);
- goto out;
+ if (!(journal->j_flags & JFS_BARRIER))
+ block_flush_device(inode->i_sb);
+ return ret;
}
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+ if (i_state & I_DIRTY_PAGES)
+ block_flush_device(inode->i_sb);
+ return ret;
+ }
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+ if (journal && !(journal->j_flags & JFS_BARRIER))
+ block_flush_device(inode->i_sb);
}
-out:
+
return ret;
}
--
Your patches do not seem to propagate the issue-flush error code, even when it is easily available. Jeff --
Oops... you are right. I will fix that. Thanks! - Fernando --
I reflected your comments in the new version of the patch set.While at it I also modified the respective reiserfs and xfs fsync functions so that, at least to some extent,they propagate the issue-flush error code. I'll be replying to this email with the new patches. Thanks, Fernando --
This patch adds a helper function that should be used by filesystems that need
to flush the underlying block device on fsync()/fdatasync().
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/fs/buffer.c linux-2.6.29/fs/buffer.c
--- linux-2.6.29-orig/fs/buffer.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/buffer.c 2009-03-30 15:27:04.000000000 +0900
@@ -165,6 +165,17 @@ void end_buffer_write_sync(struct buffer
put_bh(bh);
}
+/* Issue flush of write caches on the block device */
+int block_flush_device(struct block_device *bdev)
+{
+ int ret = 0;
+
+ ret = blkdev_issue_flush(bdev, NULL);
+
+ return (ret == -EOPNOTSUPP) ? 0 : ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
/*
* Write out and wait upon all the dirty data associated with a block
* device via its mapping. Does not take the superblock lock.
diff -urNp linux-2.6.29-orig/include/linux/buffer_head.h linux-2.6.29/include/linux/buffer_head.h
--- linux-2.6.29-orig/include/linux/buffer_head.h 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/include/linux/buffer_head.h 2009-03-30 15:27:26.000000000 +0900
@@ -238,6 +238,7 @@ int nobh_write_end(struct file *, struct
int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
int nobh_writepage(struct page *page, get_block_t *get_block,
struct writeback_control *wbc);
+int block_flush_device(struct block_device *bdev);
void buffer_init(void);
--
The problem lies in using NULL for error_sector argument which shows a subtle deficiency of the current implementation/usage of barriers based on a write cache flushing. I intend to document the issue with adding the FIXME to the current users of blkdev_issue_flush() so the problem is at least known and not forgotten (fixing it would require some work from both block and fs sides and unfortunately there wasn't even a willingness to discuss possible solutions few years back when the original code was added). Thanks, Bart --
The reason I used a wrapper is that I did not like the semantics provided by blkdev_issue_flush(). On the one hand, I did not want to pass -EOPNOTSUPP to filesystems (it is not an error filesystems should care about). On the other hand it is weird that some filesystems use blkdev_issue_flush() when they want emit a barrier. blkdev_issue_flush() happens to be implemented as an empty (block layer) barrier, but I think that is an implementation detail filesystems should not neet to know about. Indeed I am working on a patch that implements blkdev_issue_empty_barrier(), so that we can optimize fsync() flushes and filesystem-originated barriers independently in the block layer. Judging from your comments below, it seems we are in the same page regarding this issue. Again, thank you for you feedback! --
Btw, why do we do that silly EOPNOTSUPP at all? If the device doesn't support flushing, we should - set a flag in the device saying so, and not ever try to flush again on that device (who knows how long it took for the device to say "I can't do this"? We don't want to keep on doing it) - return "done". There's nothing sane the caller can do with the error code anyway, it just has to assume that the device basically doesn't reorder writes. So wouldn't it be better to just fix blkdev_issue_flush() to not do those crazy error codes? [ The same thing probably goes for those ENXIO errors, btw. If we don't have a bd_disk or a queue, why would the caller care about it? ] Jens? Linus --
The problem is that we may not know upfront, so it sort-of has to be this trial approach where the first barrier issued will notice and fail with -EOPNOTSUPP. Sure, we could cache this value, but it's pretty pointless since the filesystem will stop sending barriers in this case. As it also modifed fs behaviour, we need to pass the info back. For blkdev_issue_flush() it may not be very interesting, since there's not much we can do about that. Just seems like very bad style to NOT return an error in such a case. You can assume that ordering is fine, but it definitely wont be in all case (eg devices that have write back caching on by default and don't support flush). So the nice thing to do there is actually tell the caller about it. So the same error is reused Right, that is pretty pointless. -- Jens Axboe --
Well, absolutely. Except I don't think you shoul use ENOTSUPP, you should just set a bit in the "struct request_queue", and then return 0. IOW, something like this --- a/block/blk-barrier.c +++ b/block/blk-barrier.c @@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector) if (!q) return -ENXIO; + if (is_queue_noflush(q)) + return 0; + bio = bio_alloc(GFP_KERNEL, 0); if (!bio) return -ENOMEM; @@ -339,7 +342,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector) ret = 0; if (bio_flagged(bio, BIO_EOPNOTSUPP)) - ret = -EOPNOTSUPP; + set_queue_noflush(q); else if (!bio_flagged(bio, BIO_UPTODATE)) ret = -EIO; which just returns 0 if we don't support flushing on that queue. (Obviously incomplete patch, which is why I also intentionally Well no, it won't. Or rather, it will have to have such a stupid So? The thing is, you can't _do_ anything about it. So what's the point in returning an error? The caller cannot possibly care - because there is nothing the caller can really do. Sure, the device may or may not re-order things, but since the caller can't know, and can't really do a thing about it _anyway_, you're just better off not even confusing anybody. Linus --
Sorry, I just don't see much point to doing it this way instead. So now the fs will have to check a queue bit after it has issued the flush, how Not for blkdev_issue_flush(), all they can do is report about the device. And even that would be a vague "Your data may or may not be I'd call that a pretty reckless approach to data integrity, honestly. You HAVE to issue an error in this case. Then the user/admin can at least check up on the device stack in question, and determine whether this is an issue or not. That goes for both blkdev_issue_flush() and the actual barrier write. And perhaps the cached value is then of some use, since you then know when to warn (bit not already set) and you can keep the warning in blkdev_issue_flush() instead of putting it in every call site. -- Jens Axboe --
AFAICS, the aim is simply to return zero rather than EOPNOTSUPP, for the not-supported case, rather than burdening all callers with such checks. Which is quite reasonable for Fernando's patch -- the direct call fsync case. But that leaves open the possibility that some people really do want the Indeed -- if the drive tells us it failed the cache flush, it seems self-evident that we should be passing that failure back to userspace where possible. And as the patches show, it is definitely possible to return a FLUSH CACHE error back to an fsync(2) caller [though, yes, I certainly recognize fsync is not the only generator of these requests]. Jeff --
As far as I know, reiserfs is the only one actively using it to choose different code. It moves a single wait_on_buffer when barriers are on, which I took out once to simplify the code. Ric saw it in some benchmark numbers and I put it back in. Given that it was a long time ago, I don't have a problem with changing it to work like all the other filesystems. -chris --
When it was a win on reiserfs back then maybe it would be a win on ext4 or xfs today too? -Andi -- ak@linux.intel.com -- Speaking for myself only. --
It could be, but you get into some larger changes. The theory behind the code was that writeback cache is on, so wait_on_buffer isn't really going to give you a worthwhile error return anyway. Might as well do the wait_on_buffer some time later and fix up the commit blocks if it didn't work out. We're still arguing about barriers being a good idea all these years later, and the drives are better at them than they used to be. So, I'd rather see less complex code in the filesystems than more. -chris --
That's not what EOPNOTSUPP means! EOPNOTSUPP doesn't mean "the cache flush failed". It just means "I don't support cache flushing". No failure anywhere. See? Maybe the operation isn't supported becasue there are no caches? Who the hell knows? Nobody. The layer just said "I don't support this". For example, maybe it just cannot translate the "flush cache" op into its own command set, because the thing doesn't _do_ anything like that. For a concrete example, look at the "loop" driver. It literally returns EOPNOTSUPP if the filesystem doesn't have a "fsync()" thing. Ok, so it can't do serialization - does that mean that the caller should fail entirely? No. But it means that the caller cannot serialize, so now the caller has two choices: - not work at all - ignore it, and assume that a device without serialization is serialized enough as-is. Those are the two only choices. The caller knows that it can't flush. What would you _suggest_ it do? Just stop, and do nothing at all? I rally don't think that's a useful or valid approach. And notice - at NO TIME did anythign actually fail. It's just that the particular protocol didn't support that empty flush op. (Also note that block/blk-barrier.c really does an empty barrier command. If we were to be talking about a real IO with a real payload and the "barrier" bit set, that would be different. But we really aren't.) Linus --
Hence my statement of the aim is simply to return zero rather than EOPNOTSUPP [...] which is quite reasonable I think we are all getting a bit confused whether we are discussing (a) EOPNOTSUPP return value, or (b) _all possible_ blkdev_issue_flush() error return values. As I read it, you are talking about (a) and Jens responded to (b). But maybe I am wrong. So I have these observations: 1) fsync(2) should not return EOPNOTSUPP, if the block device does not support cache flushing. This seems to agree with Linus's patch. 2) A Linux filesystem MIGHT care about EOPNOTSUPP return value, as that return value does provide information about the future value of cache flushes. 3) However, at present NONE of the blkdev_issue_flush() callers use EOPNOTSUPP in any way. In fact, none of the current callers check the return value at all. 4) Furthermore, handling lack of cache flush support at the block layer, rather than per-filesystem, makes more sense to me. But I am biased towards storage, so what do I know :) 5) Based on observation #3, the current kernel should be changed to return USEFUL blkdev_issue_flush() return values back to userspace. Fernando's patches head in this direction, as does my most recent file_fsync patch. Jeff --
No. Now the fs SHOULD NEVER CHECK AT ALL. Either it did the ordering, or the FS cannot do anything about it. That's the point. EOPNOTSUPP is n ot a useful error message. You can't It has _nothing_ to do with 'reckless'. It has everything to do with 'you No. Returning an error just means that now the box is useless. Nobody can do anything about it. Not the admin, not the driver writer, not anybody. Ok, so a device didn't support flushing. We don't know why, we don't know if it needed it, we simply don't know. There's nothing to do. But returning an error to user mode is unacceptable, because that will result in everything just -failing-. And total failure is much worse than "we don't know whether the thing serialized". Linus --
My point is that some file systems may or may not have different paths
or optimizations depending on whether barriers are enabled and working
or not. Apparently that's just reiserfs and Chris says we can remove it,
so it is probably a moot point.
And that is for the barrier write btw, NOT blkdev_issue_flush(). For the
latter it obviously doesn't matter if you return -EOPNOTSUPP or not, as
What, that's nonsense. The admin can certainly check whether it's an
issue or not, and he should. That's different from handling it in the
kernel or in the application, but you have to inform about it. I
That is not what I meant with returning it to the user. My point was
that you have to notify that the error occured, which means putting a
printk() (or whatever) in that blkdev_issue_flush(). I guess most of the
miscommunication stems from this, I don't want -EOPNOTSUPP returned to
user space, but I want some notification that tells the admin that this
device doesn't support flushes. And if the file systems use the same
path for barrier or no barriers, then it's perfectly fine to have them
share the very same "flush doesn't work bit" and the same single warning
that we don't know whether ordering is preserved on this device or not.
IOW, what I'm advocating is just a simple:
@@
if (err == -EOPNOTSUPP) {
+ if (!is_queue_noflush(q)) {
+ warn();
set_queue_noflush(q);
}
}
change to the pseudo-patch you posted.
--
Jens Axboe
--
Well, if that's the issue, then just add a printk to that
'blkdev_issue_flush()', and now you have that informational message in
_one_ place, instead of havign each filesystem having to do it over and
If it's just informational, then again - why should the filesystem care?
Returning an error to the caller is never the right thing to do. The
caller can't do anything sane about it.
If you argue that the admin wants to know, then sure, make that
if (bio_flagged(bio, BIO_EOPNOTSUPP))
- ret = -EOPNOTSUPP;
+ set_queue_noflush(q);
"set_queue_noflush()" function print a warning message when it sets the
bit.
I cannot fathom why you can _possibly_ think that this is something that
can and must be done something about in the caller. When the caller
obviously has no real option except to ignore the error _anyway_.
That was always my point. Returning an error is INSANE, because ther is no
valid thing that the caller can possibly do.
If you want it logged, fine. But THAT DOES NOT CHANGE ANYTHING. It would
still be wrong to return the error, since the caller _still_ can't do
anything about it.
Linus
--
One thing the caller could do is to disable the write cache on the device. A second would be to stop using the transactions - skip the journal, just go back to ext2 mode or BSD like soft updates. Basically, it lets the file system know that its data integrity building blocks are not really there and allows it (if it cares) to try and minimize the chance of data loss. Ric --
First off, that's not the callers job. If the sysadmin enabled it, some random filesystem shouldn't disable it. Secondly, this whole insane belief that "write cache" has anything to do f*ck me, what's so hard with understanding that EOPNOTSUPP doesn't mean "no ordering". It means what it says - the op isn't supported. For all you know, ALL WRITES MAY BE TOTALLY ORDERED, but perhaps there is no way to make a _single_ write totally atomic (ie the "set barrier on a command that actually does IO"). Besides, why the hell do you think the filesystem (again) should do something that the admin didn't ask it to do. If the admin wants the thing to fall back to ext2, then he can ask to Your whole idiotic "as a filesystem designer I know better than everybody else" model where the filesystem is in total control is total crap. The fact is, it's not the filesystems job to make that decision. If the admin wants to have write caching enabled, the filesystem should get the hell out of the way. What about laptop mode? Do you expect your filesystem to always decide that "ok, the user wanted to spin down disks, but I know better"? What about people who have UPS's and don't worry about that part? They want write caching on the disk, and simply don't want to sync? They still worry about OS crashing, since they run random -git development kernels? In short, stop this IDIOTIC notion that you know better. YOU DO NOT KNOW BETTER. The filesystem DOES NOT KNOW BETTER. It should damn well not do those kinds of decisions that are simply not filesystem decisions to make! Linus --
Completely agree with that, that is why I want the error logged instead of returned. The write cache MAY be involved, but it may also be something entirely different. The cache may be perfectly fine and ordered but just not supporting flush cache because it doesn't need to (it has battery backing). The important bit is informing the admin of the situation, then it's up to the admin to look into the storage stack and determine if this is a real problem or not. There's nothing the kernel can do about it. -- Jens Axboe --
First I have heard anyone (other than you above) claim that "unable to flush" is tied to the write cache on disks. What I was responding to is your objection to exposing the proper error codes to the file system layer instead of hiding them in the block layer. True, the write cache example I used is pretty contrived, but it would be a valid strategy if your sacred sys admin had mounted with the "I do care about my data" mount option and left it up to the file system Now you are just being silly. The drive and the write cache - without barriers or similar tagged operations - will almost certainly reorder all of the IO's internally. No one designs code based on the "it might be ordered" basis. The way the barriers work does absolutely give you full ordering. All previous IO's are sent to the drive and flushed (barrier flush 1), the commit record is sent down followed by a second barrier flush. There is This is not me being snotty - this is really very basic to how transactions work. You need ordering and file systems (or data bases) that use transactions must have these building blocks to do the job right. Your argument seems to be, "Well, it will mostly be ordered anyway, as long as you don't lose power" which I simply don't agree is a good assumption. The logic conclusion of that argument is that we really should not use transactions at all - basically remove the journal from ext3/4, xfs, btrfs, etc. That is a point of view - drives are crap, journalling does Laptop mode is pretty much a red herring here. Mount it without barriers enabled - your drive will still spin up occasionally, but as you argued above, that existing options allows you the user/admin to make that If you run with a UPS or have a battery backed write cache, you should run without barriers since both of those mechanisms give you the required promise of ordering even in face of power outage. Again, mount with barriers disabled (or rely on the storage target to ignore ...
You do realize that the "drive" may not be a drive at all? But apparently you don't. You really seem to see just your own case, and have blinders on for everything else. That "drive" may be some virtualized device. It may be some super-fancy memory mapped and largely undocumented random flash thing. It might be a network block device, it may be somebody's IO trace dummy layer, it may be anything at all. Your filesystem doesn't know. It damn well not even _try_ to know, because it isn't the low-level driver. The low-level driver - which you don't have a friggin clue about - may say that it doesn't support barrier IO for any random reason that has absolutely _nothing_ to do with any write caches or anything else. Maybe the device has the same ordering semantics as an Intel CPU has: writes are always seen in order on the disk, and reads are always speculated but will snoop in write buffers, and ther is no way to not do that. See? EOPNOTSUPP means just that - it means that the driver doesn't support the notion of ordered IO. But that does not necessarily mean that the writes aren't always in order. It may well just mean that the drive is a thin shimmy layer over something else (for example, just a user level pipe), and the driver has NO IDEA what the end result is, and the protocol is simplistic and is just 'read' and 'write' and absolutely nothing else. But you seem to NOT UNDERSTAND THIS. I'm not interested in your inane drivel. Let's just say that your lack of understanding just means that your input is irrelevant, and leave it at that. Ok? Until you can see the bigger picture, just don't bother. Linus --
The part that we seem to be skipping over in talking about EOPNOTSUPP is not what do we do when a barrier isn't supported (print a warning and move on), it's what do we do when a barrier works. I very much agree that EOPNOTSUPP tells us almost nothing. The idea behind the original implementation was that when barriers did work, we could make some assumptions about how IO would be ordered around the barrier, and those assumptions would let us optimize things for the lying cheating cache enabled storage that we all know and love. It turns out 6 years later that very few people are interested in those optimizations, and we're probably better off skipping them in favor of reducing the complexity of the code involved. Jens has a little burial site all prepped for pdflush in his yard, dumping EOPNOTSUPP in there too wouldn't be a bad thing. -chris --
Of course I realize that. Most of the SSD devices, including ones that don't speak normal S-ATA/SCSI/etc, they have a write cache and will combine and re-order IO's. Some of them have non-volatile write caches and those don't need barriers (flush, fua, what ever) because of batteries, capacitors or other magic hardware people came up with. For the ones that do have a volatile write cache and can reorder IO's, transactions will still need the ordering primitives to survive a power failure reliably. If you don't need or want to pay the price of ordering, you can today easily disable this by mounting without barriers. As Mark pointed out, most S-ATA/SAS drives will flush the write cache when they see a bus reset so even without barriers, the cache will be preserved (or flushed) after a reboot or panic. Power outages are the problem If the low level device returns EOPNOTSUPP on a barrier op, that is fine. Running a transactional file system on that storage might or might not be a good idea, but at least we can log that and move on. I agree with Chris that what happens when the device does not support the primitives is not the core issue. The question is really what we do when you have a storage device in your box with a volatile write cache that does support flush or fua or similar. Using barriers & ordered transactions for these types of devices will give you a more reliable file system - less fsck time needed and better data integrity support for the (few?) applications that use fsync properly. Ric --
Ok. Then you are talking about a different case - not EOPNOTSUPP. [ Although it may be related in that maybe the admin can _force_ a EOPNOTSUPP thing for when he wants to disable any "write barrier implies flush" thing. IOW, we may end up with an _implementation_ detail where we overload a potential QUEUE_FLUSH_EOPNOTSUPP flag with two meanings - either "the driver told me a barrier isn't supported" or "the admin set that same flag by hand to disable barrier-related flush commands". But that's just an implementation detail, of course. We could use two Sure. And it still shouldn't be the filesystem that _requires_ use of it. The user (or low-level driver) may simply know better. The user may know that he trusts the disk more than anything else, and prefers to not actually emit the "FLUSH" command. Again, that's not something that the filesystem should know about, or care about. If the user trusts the disk subsystem and wants the performance, it's the users choice. Even the _driver_ may know better. Knowing the kinds of firmware bugs those drives have, it could even be a driver that simply black-lists certain disks as having known-broken FLUSH commands. We have _CPU's_ that corrupt memory on cache writeback ("wbinvl"), and those things are a lot more tested than most driver firmware is. Do you realize just how buggy some of those flash drives are? Some of them will literally (a) report the wrong size and (b) lock up if you try to read from the last sector. Oops. Do you really expect such crap to even bother to honor some flush command? Good luck with that. They're designed as a floppy replacement. Now, you can tell me that I shouldn't put a reliable filesystem on an el-cheapo flash drive and expect it to work, but I'm sorry, you're wrong. People _are_ supposed to be able to move their data around, and the filesystem shouldn't make judgement calls. If you want judgement calls, call your mom. Not your filesystem. For ...
So here's a test patch that attempts to just ignore such a failure to
flush the caches. It will still flag the bio as BIO_EOPNOTSUPP, but
that's merely maintaining the information in case the caller does want
to see if that barrier failed or not. It may not actually be useful, in
which case we can just kill that flag.
But it'll return 0 for a write, getting rid of hard retry logic in the
file systems. It'll also ensure that blkdev_issue_flush() does not see
the -EOPNOTSUPP and pass it back.
The first time we see such a failed barrier, we'll log a warning in
dmesg about the block device. Subsequent failed barriers with
-EOPNOTSUPP will bit warn.
Now, there's a follow up to this. If the device doesn't support barriers
and the block layer fails them early, we should still do the ordering
inside the block layer. Then we will at least not reorder there, even if
the device may or may not order. I'll test this patch and provide a
follow up patch that does that as well, before asking for any of this to
be included. So that's a note to not apply this patch, it hasn't been
tested!
commit 78ab31910c8c7b8853c1fd4d78c5f4ce2aebb516
Author: Jens Axboe <jens.axboe@oracle.com>
Date: Tue Mar 31 18:42:42 2009 +0200
barrier: Don't return -EOPNOTSUPP to the caller if the device does not support barriers
The caller cannot really do much about the situation anyway. Instead log
a warning if this is the first such failed barrier we see, so that the
admin can look into whether this poses a data integrity problem or not.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f7dae57..8660146 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -338,9 +338,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
*error_sector = bio->bi_sector;
ret = 0;
- if (bio_flagged(bio, BIO_EOPNOTSUPP))
- ret = -EOPNOTSUPP;
- else if (!bio_flagged(bio, BIO_UPTODATE))
+ if ...Updated version, the previous missed most of the buffer_eopnotsupp()
checking. So this one also gets rid of the file system retry logic.
Thanks to gfs2 Steve for pointing out that I missed gfs2, made me
realize that I missed a lot more as well.
block/blk-barrier.c | 8 ++------
block/blk-settings.c | 13 +++++++++++++
block/ioctl.c | 4 +---
fs/bio.c | 12 +++++++++++-
fs/btrfs/disk-io.c | 5 -----
fs/btrfs/extent_io.c | 9 ++-------
fs/buffer.c | 23 ++---------------------
fs/fat/misc.c | 5 +----
fs/gfs2/log.c | 18 ++++++------------
fs/jbd2/commit.c | 22 ----------------------
fs/reiserfs/journal.c | 15 ---------------
fs/xfs/linux-2.6/xfs_aops.c | 1 -
include/linux/blkdev.h | 2 ++
include/linux/buffer_head.h | 2 --
14 files changed, 40 insertions(+), 99 deletions(-)commit 74e725b7f2e5f3f073abe84c5823026a6f1e33ce
---
Author: Jens Axboe <jens.axboe@oracle.com>
Date: Tue Mar 31 19:00:53 2009 +0200
barrier: Don't return -EOPNOTSUPP to the caller if the device does not support barriers
The caller cannot really do much about the situation anyway. Instead log
a warning if this is the first such failed barrier we see, so that the
admin can look into whether this poses a data integrity problem or not.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f7dae57..8660146 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -338,9 +338,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
*error_sector = bio->bi_sector;
ret = 0;
- if (bio_flagged(bio, BIO_EOPNOTSUPP))
- ret = -EOPNOTSUPP;
- else if (!bio_flagged(bio, BIO_UPTODATE))
+ if (!bio_flagged(bio, BIO_UPTODATE))
ret = -EIO;
bio_put(bio);
@@ -408,9 +406,7 @@ int blkdev_issue_discard(struct ...Wouldn't it be cleaner to simply finish with success status from blk_do_ordered()? That is the single place that all flush/barrier ops go through and semantically better place too. Thanks. -- tejun --
I suspect this part is just wrong. I could easily imagine a driver that returns EOPNOTSUPP only for a certain _kind_ of bio. For example, if the drive doesn't support FUA, then you cannot do a serialized IO operation, but you can still mostly do a serialized op without any IO attached to it. IOW, the "empty flush" really _is_ special. An this check should not be in the generic "bio_endio()" case, it should only be in the special blkdev_issue_flush() case. I think. No? Linus --
FUA we should be able to reliably detect, it's really the cache flush operation itself that has caused headaches in the past. The -EOPNOTSUPP really comes from the block layer, not from the device driver. That's mainly due to the fact that we only send down the actual barrier, if the driver already said it supported them. If they do fail them, we probably need to pick up the -EIO bits and pieces and pretend it didn't happen as well. So it definitely needs more looking into, auditing, and testing. The empty flush is special and it is easy to fix that by itself. That should probably be the first patch in the series. But the retry logic and such for actual write barriers are the majority of the problems involved with supporting barriers, and those I want to get rid of. I think it'll be more clear when I post a real patch series with the individual steps outlined. -- Jens Axboe --
Hello, Yeah, we need to implement some kind of fallback logic such that filesystems get errors iff the underlying device actually failed to flush. For the most part, this shouldn't be too difficult. There is a corner case for tag ordered requests in that retrying might end up putting the barrier on the platter after writes following it. Well, the problem isn't specific to fallback tho. The root problem is that later command get issued before the previous ones are finished and SCSI ordered tag doesn't mandate failure of earlier request to abort all the following ones, so by the time block layer knows about the failure, writes after the barrier might already be on the platter. I guess we'll have to ignore that for the time being. Thanks. -- tejun --
That sounds reasonable enough. The key thing is how to squeeze as much True - high end arrays (as you mention below) will probably ack a flush request Sure - really cheap & crappy storage is easy enough to find. Definitely I agree File systems should try to do their best job with what they have, but we might also want to use a non-transaction based file system (ext2? ext4 w/o the journal like google?). Again, as you suggest, users (or distro installers?) can make For non-volatile write caches like these, you don't need to "flush" the storage write cache, you just need to move the data to the storage in the correct order. As far as I know, non of this kind of information is exposed to higher levels in a standard way, so what people do today is to disable barriers (or assume, No room for a pony in my yard in any case :-) ric --
Ric Wheeler wrote: .. I still see barriers as a separate issue from flushes. Flushes are there for power failures and hot-removable devices. Barriers are there for that, but also for better odds of data integrity in the even of a filesystem or kernel crash. Even if I don't want the kernel needlessly flushing my battery-backed write caches, I still do want the barrier ordering that improves the odds of filesystem consistency in the event of a kernel crash. Cheers --
And this is really what it boils down to. Abstraction. Bigger picture. And the fact that the filesystem should DAMN WELL NOT THINK IT KNOWS WHAT IS GOING ON! This is also fundamentally why returning that particular error is pointless. If the driver returns EOPNOTSUPP, there is simply never _any_ possible reason for upper layers to ever be informed about it - because there is not _any_ possible situation where they can do anything about it. Even _thinking_ that they can do something about it is fundamentally flawed. It misses the entire point of having layering and abstraction and having a "block layer" there to do these kinds of things. If you want to write your filesystem so that it interacts with the low-level device directly, go and write an MTD filesystem instead. Don't even _claim_ to care about generic filesystems like 'ext3' or something like that. But if you try to be a "real" filesystem (ie general-purpose, meant to work on any random block device), don't come and whine about it when the block device then doesn't really do anything but read or write, or when the driver literally doesn't even _know_ how to serialize something because it doesn't even make sense in its world-view. Don't mix up block layer and low-level driver issues with filesystem issues. The filesystem should say "block layer: flush the pending writes". And the block layer should try its best, but if the low-level driver says "that operation doesn't make sense for me", the block layer should just say "ok, whatever". And the filesystem shouldn't know, and it most definitely mustr not act any differently. Because that's behind the abstraction, and there's no sane way to bring it _out_ of the abstraction that isn't fundamentally flawed (like thinking that it's always a SATA-II drive). Linus --
How the file system responds has to depend upon what the users intents are with regards to still having their data. In a lot of cases "flush if you can" makes good sense. In higher integrity cases you want a way to tell the device "flush if you can, do whatever else is needed to fake a flush if not" and in some cases you genuinely want to propogate errors back at mount time to say "sorry can't do this" Agreed entirely that this shouldn't be expressed down the stack in terms of things like 'tags' or 'write with fua', but unless the different versions of it can be expressed, or refused you can't build a good enough abstraction. Throw and pray the block layer can fake it simply isn't a valid model for serious enterprise computing, and if people understood the worst cases, for a lot of non enterprise computing. The second problem is who has sufficient information to efficiently handle decisions around ordering/barriers/flushes/single outstanding command and other strategies. I am skeptical that in the case where the underlying block subsystem provides suboptimal ordering/barrier facilities that it falling back to alternatives without letting the fs also change strategies will be efficient. Alan --
I don't want to return -EOPNOTSUPP, I think this thread has become increasingly confusing. And it's probably largely due to me mixing write barriers into it, if we stick purely to blkdev_issue_flush(), then logging a warning and returning 0 is perfectly fine with me. My objection was to ignoring the "I can't flush" error in the first place, not returning 0. -- Jens Axboe --
.. XFS appears to have something along those lines. I believe it tries to disable the drive write caches if it discovers that it cannot do cache flushes. I'll check next time my MythTV box boots up. It has a RAID0 under XFS, and the md raid0 code doesn't appear to pass the cache flushes to libata for raid0, so XFS complains and tries to turn off the write caches. And I have a script to damn well turn them back ON again after it does so. Stupid thing tries to override user policy again. Cheers --
Perhaps; but speaking specifically about blkdev_issue_flush() -- nobody checks its return value at the present time. Jeff --
If we get EOPNOTSUPP back from a submit_bh/submit_bio, the IO didn't happen. So, all the filesystems have code to try again without the barrier flag, and then stop doing barriers from then on. I'm not saying this is a good or bad API, just explaining for this one XFS does print a warning about not doing barriers any more, but the write cache should still be on. Especially with MD in front of it, the storage stack is pretty complex, a mounted filesystem would have a hard time knowing where to start to turn off write caches on each drive in the stack. You can test this pretty easily: dd if=/dev/zero of=foo bs=4k count=10000 oflag=direct If that runs faster than 1MB/s the write cache is still on. -chris --
.. Or simply: hdparm -W /dev/sd? ## (for SATA/PATA drives) --
I'm afraid I tend to hammer on the drive instead of asking it politely, but I guess hdparm is trust worthy these days ;) -chris --
No, it just stops issuing barriers if the initial mount-time test finds It does not do this. -Eric --
.. Okay. My apologies to the XFS folks! I'll have to dig deeper to find out who/what is disabling the drive write caches, then. Thanks --
To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/fs/ext3/fsync.c linux-2.6.29/fs/ext3/fsync.c
--- linux-2.6.29-orig/fs/ext3/fsync.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/ext3/fsync.c 2009-03-30 15:31:34.000000000 +0900
@@ -45,6 +45,8 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;
J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +71,30 @@ int ext3_sync_file(struct file * file, s
*/
if (ext3_should_journal_data(inode)) {
ret = ext3_force_commit(inode->i_sb);
- goto out;
+ if (!ret && !(journal->j_flags & JFS_BARRIER))
+ ret = block_flush_device(inode->i_sb->s_bdev);
+ return ret;
}
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+ if (i_state & I_DIRTY_PAGES)
+ ret = block_flush_device(inode->i_sb->s_bdev);
+ return ret;
+ }
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+ if (!ret && journal && !(journal->j_flags & JFS_BARRIER))
+ ret = block_flush_device(inode->i_sb->s_bdev);
}
-out:
+
return ret;
}
--
NACK.
As Eric commented on linux-ext4 (and I think it was Chris Mason
deserves the credit for originally pointing this out), we don't need
to call blkdev_issue_flush() after calling sync_inode(). That's
because sync_inode() eventually (after going through a very deep call
tree inside fs/fs-writeback.c) __sync_single_inode(), which calls
write_inode(), which calls the filesystem-specific ->write_inode()
function, which for both ext3 and ext4, ends up calling
ext[34]_force_commit. Which, if barriers are enabled, will end up
issuing a barrier after writing the commit block.
In the code paths that don't end up calling sync_inode() or
ext4_force_commit(), (i.e., in the fdatasync() case) calling
block_flush_device is appropriate. But as it stands, this patch (and
the related one for ext4) will result in multiple unnecessary barrier
requests being sent to the block layer.
So two out of the three places where this patch adds
block_flush_device() are not necessary; as far as I can tell, only
A similar fixup is needed for the ext4 patch.
(And can we please start a new thread for these patches? Thanks!!)
Regards,
- Ted
--
I'm not sure we want to stick Fernando with changing how barriers are done in individual filesystems, his patch is just changing the existing call points. The ext34 code is especially tricky because there's no way to tell if a commit was actually done by sync_inode, so there's no way to know if an extra flush is really required. I think we'll be better off if he keeps the existing logic and a different patch set is made for tuning the ext3 and ext4 code. -chris --
Well, his patch actually added some calls to block_issue_flush(). But yes, it's probably better if he just changes the existing call points, and we can have the relevant filesystem maintainers double check to Yes, good point. What we need to do is to save inode->i_state *before* the call to sync_inode(), and issue the flush if the original value of (inode->i_state & I_DIRTY) == I_DIRTY_PAGES. But yeah, Agreed. - Ted --
Hello, How about having something like blk_ensure_cache_flushed() which issues flush iff there hasn't been any write since the last flush? It'll be easy to implement and will filter out duplicate flushes in most cases. Thanks. -- tejun --
I thought about such a thing, but my concern is that while this might
suppress most unnecessary double flushes, some intervening write from
another process might slip in which doesn't need to be flushed out.
In other words "in most cases" means that "in some cases" we will take
a performance hit thanks to the duplicate flushes. So this isn't
something we should depend upon, although if we do detect back-to-back
flushes, obviously we should filter them out.
So if we did something like this, it would be good if we had a
debugging option which would detect double flushes, and printk a
warning identifying where the call sites first and second flushes (by
function name and line number), so that filesystem developers could
detect the double flushes, and work to eliminate them.
Does that make sense?
- Ted
--
Hello, Yeah, well, it all comes down to how most the "most" is. If all that's between the first flush and the second one are some code the cpu has to eat through, I don't think there's high chance of an IO going inbetween unless the IO was already there and gets scheduled inbetween (which can be avoided). The thing is that detecting dup is possible but missing is not. If flush is missing in certain corner paths, nobody would know till somebody reviews the code. Even when the problem triggers, it would be rare and obscure enough to avoid proper diagnosis, so I think if the "most" is most enough, it could be the better way to do it. But, then again, I'm not a FS guy, so if such thing can be guaranteed in FSes without too much problem, no need to pull such a stunt at the Yeap, that definitely sounds like a good idea. I'll put it on my todo list. Thanks. -- tejun --
My original ide implementation of flushes actually did this. My memory is a little hazy on why it was dropped, I'm guessing because it basically never triggered anyway. -- Jens Axboe --
Yeah, and it probably wouldn't trigger today unless we add new code that starts generating enough duplicate cache flushes for this to be significant... And since duplicate cache flushes are harmless to the drive, you're only talking about no-op ATA command overhead. Which is only mildly notable on legacy IDE (eight or so inb/outb operations). I would put duplicate cache flush filtering way, way down on the priority list, IMO. Jeff --
Yeap, unless FS guys need it, there's no reason to push it. Although having dup flush detection Theodore described (w/ callstack saving at issue time) would be nice for debugging. Thanks. -- tejun --
To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/fs/ext4/fsync.c linux-2.6.29/fs/ext4/fsync.c
--- linux-2.6.29-orig/fs/ext4/fsync.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/ext4/fsync.c 2009-03-30 15:35:26.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
{
struct inode *inode = dentry->d_inode;
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,30 @@ int ext4_sync_file(struct file *file, st
*/
if (ext4_should_journal_data(inode)) {
ret = ext4_force_commit(inode->i_sb);
- goto out;
+ if (!ret && !(journal->j_flags & JBD2_BARRIER))
+ ret = block_flush_device(inode->i_sb->s_bdev);
+ return ret;
}
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+ if (i_state & I_DIRTY_PAGES)
+ ret = block_flush_device(inode->i_sb->s_bdev);
+ return ret;
+ }
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
- if (journal && (journal->j_flags & JBD2_BARRIER))
- blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ if (!ret && journal && !(journal->j_flags & JBD2_BARRIER))
+ ret = block_flush_device(inode->i_sb->s_bdev);
}
-out:
+
return ret;
}
--
To ensure that bits are truly on-disk after an fsync or fdatasync we should force a disk flush explicitly. This is necessary to have data integrity guarantees in filesystems such as FAT which do not provide their own fsync implementation and use the vfs helper instead. Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> --- diff -urNp linux-2.6.29-orig/fs/sync.c linux-2.6.29/fs/sync.c --- linux-2.6.29-orig/fs/sync.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29/fs/sync.c 2009-03-30 15:43:59.000000000 +0900 @@ -72,6 +72,11 @@ int file_fsync(struct file *filp, struct err = sync_blockdev(sb->s_bdev); if (!ret) ret = err; + + err = block_flush_device(sb->s_bdev); + if (!ret) + ret = err; + return ret; } --
Add a sysfs knob to disable storage device writeback cache flushes.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
--- linux-2.6.29-orig/block/blk-barrier.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-barrier.c 2009-03-30 17:08:28.000000000 +0900
@@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
if (!q)
return -ENXIO;
+ if (blk_queue_nowbcflush(q))
+ return -EOPNOTSUPP;
+
bio = bio_alloc(GFP_KERNEL, 0);
if (!bio)
return -ENOMEM;
diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
--- linux-2.6.29-orig/block/blk-core.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-core.c 2009-03-30 17:08:28.000000000 +0900
@@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
goto end_io;
}
if (bio_barrier(bio) && bio_has_data(bio) &&
- (q->next_ordered == QUEUE_ORDERED_NONE)) {
+ (blk_queue_nowbcflush(q) ||
+ q->next_ordered == QUEUE_ORDERED_NONE)) {
err = -EOPNOTSUPP;
goto end_io;
}
diff -urNp linux-2.6.29-orig/block/blk-sysfs.c linux-2.6.29/block/blk-sysfs.c
--- linux-2.6.29-orig/block/blk-sysfs.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-sysfs.c 2009-03-30 17:08:28.000000000 +0900
@@ -151,6 +151,27 @@ static ssize_t queue_nonrot_store(struct
return ret;
}
+static ssize_t queue_wbcflush_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(!blk_queue_nowbcflush(q), page);
+}
+
+static ssize_t queue_wbcflush_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nm;
+ ssize_t ret = queue_var_store(&nm, page, count);
+
+ spin_lock_irq(q->queue_lock);
+ if (nm)
+ queue_flag_clear(QUEUE_FLAG_NOWBCFLUSH , q);
+ else
+ queue_flag_set(QUEUE_FLAG_NOWBCFLUSH , q);
+ spin_unlock_irq(q->queue_lock);
+
+ return ret;
+}
+
static ssize_t ...This (and the above hunk) should be changed. -EOPNOTSUPP means the target does not support barriers, that is a different thing to flushes not being needed. A file system issuing a barrier and getting -EOPNOTSUPP back will disable barriers, since it now thinks that ordering cannot be guaranteed. A more appropriate change would be to successfully complete a flush without actually sending it down to the device if blk_queue_nowbcflush() is true. Then blkdev_issue_flush() would just work as well. It also needs to take stacking into account, or stacked drivers will have to propagate the settings up the stack. If you allow simply the barrier to Naming is also pretty bad, perhaps something like "honor_cache_flush" would be better, or perhaps "cache_flush_needed". At least something that is more descriptive of this setting actually controls, wbcflush does not do that. -- Jens Axboe --
The reason I decided to use -EOPNOTSUPP was that I wanted to keep barriers and device flushes from entering the block layer when they are not needed. I feared that if we pass them down the block stack (knowing in advance they will not be actually submitted to disk) we may end up slowing things down unnecessarily. As you mentioned, filesystems such as ext3/4 will disable barriers if they get -EOPNOTSUPP when issuing one. I was planning to add a notifier mechanism so that we can notify filesystems has been a change in the barrier settings. This might be over-engineering, though. Especially considering that "-o Aren't we risking slowing things down? Does the small optimization above make sense (especially taking the remount trick into account)? You are right, wbcflush is a pretty ugly name. I will use "honor_cache_flush" in the next iteration of the patches. Thanks, Fernando --
But that's just wrong, you need to make sure that the block layer / io scheduler doesn't reorder as well. It's a lot more complex than just the device end. So just returning -EOPNOTSUPP and pretending that you need It's not, I think you are missing the bigger picture. -- Jens Axboe --
I should have mentioned that in this patch set I was trying to tackle the blkdev_issue_flush() case only. As you pointed out, with the code above requests may get silently reordered across barriers inside the block layer. The follow-up patch I am working on implements blkdev_issue_empty_barrier(), which should be used by filesystems that want to emit an empty barrier (as opposed to just triggering a device flush). Doing this we can optimize fsync() flushes (block_flush_device()) and filesystem-originated barriers (blkdev_issue_empty_barrier()) independently in the block layer. I agree with you that the we should pass barriers down in __generic_make_request, but the optimization above for fsync()-originated blkdev_issue_flush()'s seems valid to me. Sorry for not explaining myself properly. I will add a changelog and better documentation for the patches. Thank you for your feedback! - Fernando --
Not sure it makes sense to abstract that out into an api, it's basically just a bio_alloc(gfp, 0); with setting the bio fields and then submitting. Otherwise you'd have to either pass a ton of parameters, the caller will want to set end_io, bdev, etc anyway. And after that it's Of course, we need to do that. Anything else would be broken. The blkdev_issue_flush() should be changed to return 0, with the -EOPNOTSUPP being flag cached. -- Jens Axboe --
I am currently cooking a new iteration of these patches that do just that. I will be reposting in a new thread and keep you all CCed. - Fernando --
The horde of casual desktop users (with me included) would probably prefer having two settings -- one for filesystem barriers and one for fsync(). IOW I prefer higher performance at the cost of risking losing few last seconds/minutes of work in case of crash / powerfailure but I would still like to have the filesystem in the consistent state after such accident. Thanks, Bart --
The knob is meant to control whether we really need to send a flush to the device or not, so it's an orthogonal issue to what you are talking about. For battery backed caches, we never need to flush. This knob is useful IFF we have devices with write back caches that STILL do a cache flush. As such, I'd also prefer waiting with adding such a knob until such a device has actually be observed. No point in adding something just in case it may exist. And even then, it's probably even better handled in the driver. -- Jens Axboe --
How do installers and/or kernels detect a battery-backed cache that does not need flush? Jeff --
They obviously can't, otherwise it would not be an issue at all. And whether it's an issue is up for debate, until someone can point at such a device. You could add a white/blacklist. So either that knob has to be turned by an administrator (yeah...), or the in-kernel info would have to be updated. Or a udev rule. -- Jens Axboe --
Sorry, I guess I misinterpreted your dual "IFF" statement :) I completely agree that the suggested knob, for disabling cache flush for these battery-backed devices, is at the present time addressing an entirely theoretical argument AFAICS. Jeff --
Guys, please look at the patch in the context of whole patchset posted
not the current Linus' tree context only.
Patch #4 adds mandatory cache flush to fsync() (based on earlier Jeff's
submission I guess) and patch #5 (this patch) adds a knob to disable cache
flushing completely.
If patch #4 is going to be ever applied I want to have an option to disable
mandatory cache flushing on fsync() -- I don't need it and I don't want it
on my desktop (+ I somehow believe I'm not the only one). OTOH I do need it
and I do want it on my server (+ to be on by default).
Actually legacy fsync() syscall is pretty bad interface in itself because:
a) it is synchronous
b) operates only on a single fd
and it encourages some pretty stupid (performance wise) usages like
calling fsync() after every mail fetched. Adding mandatory cache flush
to it only makes things worse (again looking from performance POV).
BTW in Linux world we never made any guarantees for fsync() on devices
using write caching:
$ man fsync
...
If the underlying hard disk has write caching enabled, then the data
may not really be on permanent storage when fsync() / fdatasync()
return.
...
aio_fsync() looks a bit better on a paper but no filesystem implements
it currently...
--
Quite true, but I've always thought that was trading away correctness for performance... at a critical juncture where a consistency checkpoint was explicitly requested by the app. My ideal would probably be blkdev cache flushing by default on fsync(2), with a block layer "desktop mode" knob to turn it off if you don't want it. The current alternatives -- mount sync or disable blkdev writeback cache -- are far, far slower and punish the entire system just to provide a consistency checkpoint for a handful of fsync-needful apps. Jeff --
blkdev_issue_flush() may fail (i.e. due to media error on FLUSH CACHE
command execution) so its users should check for the return value.
(This issues was first spotted Bartlomiej Zolnierkiewicz)
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_buf.c linux-2.6.29/fs/xfs/linux-2.6/xfs_buf.c
--- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_buf.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/xfs/linux-2.6/xfs_buf.c 2009-03-30 14:44:34.000000000 +0900
@@ -1446,6 +1446,7 @@ xfs_free_buftarg(
{
xfs_flush_buftarg(btp, 1);
if (mp->m_flags & XFS_MOUNT_BARRIER)
+ /* FIXME: check return value */
xfs_blkdev_issue_flush(btp);
xfs_free_bufhash(btp);
iput(btp->bt_mapping->host);
diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.c linux-2.6.29/fs/xfs/linux-2.6/xfs_super.c
--- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/xfs/linux-2.6/xfs_super.c 2009-03-30 15:16:42.000000000 +0900
@@ -721,11 +721,11 @@ xfs_mountfs_check_barriers(xfs_mount_t *
}
}
-void
+int
xfs_blkdev_issue_flush(
xfs_buftarg_t *buftarg)
{
- blkdev_issue_flush(buftarg->bt_bdev, NULL);
+ return block_flush_device(buftarg->bt_bdev);
}
STATIC void
diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.h linux-2.6.29/fs/xfs/linux-2.6/xfs_super.h
--- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.h 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/xfs/linux-2.6/xfs_super.h 2009-03-30 14:46:31.000000000 +0900
@@ -89,7 +89,7 @@ struct block_device;
extern __uint64_t xfs_max_file_offset(unsigned int);
-extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
+extern int xfs_blkdev_issue_flush(struct xfs_buftarg *);
extern const struct export_operations xfs_export_operations;
extern struct xattr_handler *xfs_xattr_handlers[];
diff -urNp ...This is different from my original patch which preserved the original This is also different and is a change in behavior (it makes sense IMHO but please document it). Thanks, Bart --
That is wrong. Even if there was a error, we still need to What happens if we get an EOPNOTSUPP here? That is broken, too. The realtime device is a different device, so always should be flushed regardless of the return from the log device. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
If any of the previous writes failed there is no way to know what we are actually flushing. When we know things went awry I do not see the point in flushing the device since part of the data we were trying to sync might not have made it to the device. Anyway this is a minor nitpick/policy issue that can be easily reverted to keep Please look at the code again. xfs_blkdev_issue_flush() calls blkdev_issue_flush() which turns EOPNOTSUPP into 0 to hide that error from filesystems. It is the non-EOPNOTSUPP errors that XFS should handle: the underlying device may support write cache flushes and still fail to flush (due to hardware errors)! Does it still make sense when writes to the log have failed? Thanks! - Fernando --
blkdev_issue_flush() may fail (i.e. due to media error on FLUSH CACHE command execution) so its users should check for the return value. Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> --- diff -urNp linux-2.6.29-orig/fs/reiserfs/file.c linux-2.6.29/fs/reiserfs/file.c --- linux-2.6.29-orig/fs/reiserfs/file.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29/fs/reiserfs/file.c 2009-03-30 16:19:19.000000000 +0900 @@ -146,8 +146,9 @@ static int reiserfs_sync_file(struct fil reiserfs_write_lock(p_s_inode->i_sb); barrier_done = reiserfs_commit_for_inode(p_s_inode); reiserfs_write_unlock(p_s_inode->i_sb); - if (barrier_done != 1 && reiserfs_barrier_flush(p_s_inode->i_sb)) - blkdev_issue_flush(p_s_inode->i_sb->s_bdev, NULL); + if (!n_err && barrier_done != 1 && + reiserfs_barrier_flush(p_s_inode->i_sb)) + n_err = block_flush_device(p_s_inode->i_sb->s_bdev); if (barrier_done < 0) return barrier_done; return (n_err < 0) ? -EIO : 0; --
This is again different from my original patch (the change in behavior should be documented). Thanks, Bart --
To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/fs/ext4/fsync.c linux-2.6.29/fs/ext4/fsync.c
--- linux-2.6.29-orig/fs/ext4/fsync.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/ext4/fsync.c 2009-03-28 20:48:17.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
{
struct inode *inode = dentry->d_inode;
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,30 @@ int ext4_sync_file(struct file *file, st
*/
if (ext4_should_journal_data(inode)) {
ret = ext4_force_commit(inode->i_sb);
- goto out;
+ if (!(journal->j_flags & JBD2_BARRIER))
+ block_flush_device(inode->i_sb);
+ return ret;
}
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+ if (i_state & I_DIRTY_PAGES)
+ block_flush_device(inode->i_sb);
+ return ret;
+ }
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
- if (journal && (journal->j_flags & JBD2_BARRIER))
- blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ if (journal && !(journal->j_flags & JBD2_BARRIER))
+ block_flush_device(inode->i_sb);
}
-out:
+
return ret;
}
--
To ensure that bits are truly on-disk after an fsync or fdatasync we should force a disk flush explicitly. This is necessary to have data integrity guarantees in filesystems such as FAT which do not provide their own fsync implementation and use the vfs helper instead. Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> --- diff -urNp linux-2.6.29-orig/fs/sync.c linux-2.6.29/fs/sync.c --- linux-2.6.29-orig/fs/sync.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29/fs/sync.c 2009-03-28 20:58:54.000000000 +0900 @@ -72,6 +72,11 @@ int file_fsync(struct file *filp, struct err = sync_blockdev(sb->s_bdev); if (!ret) ret = err; + + err = block_flush_device(sb); + if (!ret) + ret = err; + return ret; } --
Add a sysfs knob to disable storage device writeback cache flushes.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
--- linux-2.6.29-orig/block/blk-barrier.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-barrier.c 2009-03-29 17:55:45.000000000 +0900
@@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
if (!q)
return -ENXIO;
+ if (blk_queue_nowbcflush(q))
+ return -EOPNOTSUPP;
+
bio = bio_alloc(GFP_KERNEL, 0);
if (!bio)
return -ENOMEM;
diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
--- linux-2.6.29-orig/block/blk-core.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-core.c 2009-03-29 18:09:18.000000000 +0900
@@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
goto end_io;
}
if (bio_barrier(bio) && bio_has_data(bio) &&
- (q->next_ordered == QUEUE_ORDERED_NONE)) {
+ (blk_queue_nowbcflush(q) ||
+ q->next_ordered == QUEUE_ORDERED_NONE)) {
err = -EOPNOTSUPP;
goto end_io;
}
diff -urNp linux-2.6.29-orig/block/blk-sysfs.c linux-2.6.29/block/blk-sysfs.c
--- linux-2.6.29-orig/block/blk-sysfs.c 2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-sysfs.c 2009-03-30 10:21:38.000000000 +0900
@@ -151,6 +151,27 @@ static ssize_t queue_nonrot_store(struct
return ret;
}
+static ssize_t queue_wbcflush_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(!blk_queue_nowbcflush(q), page);
+}
+
+static ssize_t queue_wbcflush_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nm;
+ ssize_t ret = queue_var_store(&nm, page, count);
+
+ spin_lock_irq(q->queue_lock);
+ if (nm)
+ queue_flag_clear(QUEUE_FLAG_NOWBCFLUSH , q);
+ else
+ queue_flag_set(QUEUE_FLAG_NOWBCFLUSH , q);
+ spin_unlock_irq(q->queue_lock);
+
+ return ret;
+}
+
static ssize_t ...I can understand that, more from an admin standpoint than anything... ATA disks' FLUSH CACHE is horribly coarse-grained, all-or-nothing. SCSI's SYNCHRONIZE CACHE at least gives us an optional (LBA, length) pair that can be used to avoid to flushing everything in the cache. Microsoft has publicly proposed a WRITE BARRIER command for ATA, to try and improve the situation: http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command... but that isn't in the field yet (if ever?) Jeff --
Here's a simple patch that does that. Not even tested, it compiles. Note
that file systems that currently do blkdev_issue_flush() in their
It'd be better to have a knob to control whether fsync() should care
about the hardware side as well, instead of trying to teach applications
to use FULL_FSYNC.
diff --git a/fs/sync.c b/fs/sync.c
index ec95a69..7a44d4e 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/writeback.h>
+#include <linux/blkdev.h>
#include <linux/syscalls.h>
#include <linux/linkage.h>
#include <linux/pagemap.h>
@@ -104,6 +105,7 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
{
const struct file_operations *fop;
struct address_space *mapping;
+ struct block_device *bdev;
int err, ret;
/*
@@ -138,6 +140,13 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
err = filemap_fdatawait(mapping);
if (!ret)
ret = err;
+
+ bdev = mapping->host->i_sb->s_bdev;
+ if (bdev) {
+ err = blkdev_issue_flush(bdev, NULL);
+ if (!ret)
+ ret = err;
+ }
out:
return ret;
}
--
Jens Axboe
--
That's going to be a mess. Ext3 implements an fsync() by requesting a journal commit, and then waiting for the commit to have taken place. The commit happens in another thread, kjournald. Knowing when it's OK not to do a blkdev_issue_flush() because the commit was triggered by an fsync() is going to be really messy. Could we at least have a flag in struct super which says, "We'll handle the flush correctly, please don't try to do it for us?" - Ted --
Doing it in vfs_fsync also is completely wrong layering. If people want it for simple filesystems add it to file_fsync instead of messing up the generic helper. Removing well meaning but ill behaved policy from the generic path has been costing me far too much time lately. And please add a tuneable for the flush. Preferable a generic one at the block device layer instead of the current mess where every filesystem has a slightly different option for barrier usage. --
I agree that we need to be careful not to put extra device flushes if the file system handles this properly. They can be quite expensive (say 10-20ms on a busy s-ata disk). I have also seen some SSD devices have performance that drops into the toilet when you start flushing their volatile caches. ric --
At the very least, IMO the block layer should be able to notice when barriers need not be translated into cache flushes. Most notably when wb cache is disabled on the drive, something easy to auto-detect, but probably a manual switch also, for people with enterprise battery-backed storage and such. Jeff --
On Fri, 27 Mar 2009 16:38:35 -0400 The storage drivers for those cases already generally know this and treat cache flush requests as "has hit nvram", even the non enterprise ones. --
Yeah, that's why I suggested to have the tuning knob in the block layer and not in the fses when this came up last time. --
The filesystems vary a bit, but in general the perfect fsync (in a mail server workload) works something like this: step1: write out and wait for any dirty data step2: join the running transaction step3: hang around a bit and wait for friends and neighbors step4: commit the transaction step4a: write the log blocks step4b: barrier. This barrier also makes sure the data is on disk step4c: write the commit block step4d: barrier. This barrier makes sure the commit block is on disk. For ext34 and reiserfs, steps 4b,c,d are actually one call to submit_bh where two caches flushes are done for us, but they really are two cache flushes. During step 3, we collect a bunch of other procs who are hopefully also running fsync. If we collect 50 procs, then single the barrier in step 5b does a cache flush on the data writes of all 50. 50 flushes this patch does would be one flush if the FS did it right. In a multi-process fsync heavy workload, every extra barrier is going to have work to do because someone is always sending data down. The flushes done by this patch also aren't helpful for the journaled filesystem. If we remove the barriers from step 4b or 4d, we no longer have a consistent FS on power failure. Log checksumming may allow us to get rid of the barrier in step 4b, but then we wouldn't know the data blocks were on disk before the transaction commit, and we've had a few discussions on that already over the last two weeks. The patch also assumes the FS has one bdev, which isn't true for btrfs. xfs and btrfs at least want more control over that filemap_fdatawrite/wait step because we have to repeat it inside the FS anyway to make sure the inode properly updated before the commit. I'd much rather see a dumb fsync helper that looks like Jens' vfs_fsync, and then let the filesystems make their own replacement for the helper in a new address space operation or super operation. That way we could also run the fsync on directories without the directory ...
Jeff, if you drop my CC on reply, I wont see your messages for ages. Then let me rephrase that to "most users don't care about full integrity fsync()". If it kills their firefox performance, most will wont to turn it off. Personally I'd never use it on my notebook or desktop box, simply because I don't care strongly enough. I'd rather fix things up in Of course, it would be trivial. Just add a blkdev_issue_flush() to s/user/application. -- Jens Axboe --
(responding to an email way back near the start of the thread) I emailed Microsoft about their proposal to add a WRITE BARRIER command to ATA, documented at http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command... The MSFT engineer said they were definitely still pursuing this proposal. IMO we could look at this too, or perhaps come up with an alternate proposal like FLUSH CACHE RANGE(s). Jeff --
I agree that it is worth getting better mechanisms in place - the cache flush is really primitive. Now we just need a victim to sit in on T13/T10 standards meetings :-) ric --
Heck, we could even do a prototype implementation with the help of Mark Lord's sata_mv target mode support... Jeff --
.. Speaking of which.. you probably won't see the preliminary rev of sata_mv + target_mode until sometime this weekend. It's going to be something quite simple for 2.6.30, and we can expand on that in later kernels. Cheers --
On Tue, Mar 24, 2009 at 1:55 PM, Linus Torvalds Not really... Regardless of any journalling, a power-fail or a crash is almost certainly going to cause "data loss" of some variety. We simply didn't get to sync everything we needed to (otherwise we'd all be shutting down our computers with the SCRAM switches just for kicks). The difference is, with ext3/4 (in any journal mode) we guarantee our metadata is consistent. This means that we won't double-allocate or leak inodes or blocks, which means that we can safely *write* to the filesystem as soon as we replay the journal. With ext2 you *CAN'T* do that at all, as somebody may have allocated an inode but not yet marked it as in use. The only way to safely figure all that out without journalling is an fsck run. That difference between ext4 and ext3-in-writeback-mode is this: If you get a crash in the narrow window *after* writing initial metadata and before writing the data, ext4 will give you a zero length file, whereas ext3-in-writeback-mode will give you a proper-length file filled with whatever used to be on disk (might be the contents of a previous /etc/shadow, or maybe somebody's finance files). In that same situation, ext3 in data-ordered or data-journal mode will "close" the window by preventing anybody else from making forward progress until the data and the metadata are both updated. The thing is, even on ext3 I can get exactly the same kind of behavior with an appropriately timed "kill -STOP $dumb_program", followed by a power failure 60 seconds later. It's a relatively obvious race condition... When you create a file, you can't guarantee that all of that file's data and metadata has hit disk until after an fsync() call returns. The only *possible* exceptions are in cases like the previously-mentioned (and now patched) open(A)+write(A)+close(A)+rename(A,B), where the rename-over-existing-file should act as an implicit filesystem barrier. It should ensure that all writes to the file get flushed before it ...
The point is, if you write your metadata earlier (say, every 5 sec) and the real data later (say, every 30 sec), you're actually MORE LIKELY to see corrupt files than if you try to write them together. And if you write your data _first_, you're never going to see corruption at all. This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it. Linus --
It's pretty easy to reproduce it these days. Here's my setup, and it's not even that fancy: Dual core Xeon, 8GB RAM, SATA RAID1 array, GigE network. All it takes is a single client writing a large file using Samba or NFS to introduce huge latencies. Looking at the raw throughput, the server's disks can sustain 30-60MB/s writes (older disks), but the network can handle up to ~100MB/s. Throw in some other random seeky IO on the server, a bunch of fragmentation and it's sustained write throughput in reality for these writes is more like 10-25MB/s, far slower than the rate at which a client can throw data at it. 5% dirty_ratrio * 8GB is 400MB. Let's say in reality the system is flushing 20MB/s to disk, this is a delay of up to 20 seconds. Let's say you have a user application which needs to fsync a number of small files (and unfortunately they are done serially) and now I've got applications (like Firefox) which basically remain unresponsive the Thanks - I'll give the program a shot later with my test case and see what it reports. My simple test case[1] for reproducing this has reported 6-45 seconds depending on the system. I'll try it with the previously mentioned workload as well. -Dave [1] http://bugzilla.kernel.org/show_bug.cgi?id=12309#c249 --
OK, two simple tests on this system produce latencies well over 1-2s using your fsync-tester. The network client writing to disk scenario (~1GB file) resulted in this: fsync time: 6.5272 fsync time: 35.6803 fsync time: 15.6488 fsync time: 0.3570 One thing to note - writing to this particular array seems to have higher than expected latency without the big write, on the order of 0.2 seconds or so. I think this is because the system is not idle and has a good number of programs on it doing logging and other small bits of IO. vmstat 5 shows the system writing out about 300-1000 under the bo column. Copying that file to a separate disk was not as bad, but there were still some big spikes: fsync time: 6.8808 fsync time: 18.4634 fsync time: 9.6852 fsync time: 10.6146 fsync time: 8.5015 fsync time: 5.2160 The destination disk did not have any significant IO on it at the time. The system is running Fedora 10 2.6.27.19-78.2.30.fc9.x86_64 and has two RAID1 arrays attached to an aacraid controller. ext3 filesystems mounted with noatime. -Dave --
On Tue, 24 Mar 2009 09:20:32 -0400 You make it sound like this is hard to do... I was running into this problem *every day* until I moved to XFS recently. I'm running a fairly beefy desktop (VMware running a crappy Windows install w/AV junk on it, builds, icecream and large mailboxes) and have a lot of RAM, but it became unusable for minutes at a time, which was just totally unacceptable, thus the switch. Things have been better since, but are still a little choppy. I remember early in the 2.6.x days there was a lot of focus on making interactive performance good, and for a long time it was. But this I/O problem has been around for a *long* time now... What happened? Do not many people run into this daily? Do all the filesystem hackers run with special mount options to mitigate the problem? -- Jesse Barnes, Intel Open Source Technology Center --
On Tue, 24 Mar 2009 16:03:53 -0700 the people that care use my kernel patch on ext3 ;-) (or the userland equivalent tweak in /etc/rc.local) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
There's a couple of comments in bug 12309 [1] which confirm that increasing the priority of kjournald reduces latency significantly since I posted your tweak there yesterday. I hope to do some testing today on my systems to see if it helps on them, too. -Dave [1] http://bugzilla.kernel.org/show_bug.cgi?id=12309 --
Ok, I bite what is the userland tweak? -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) --
I have 4 gigs of memory on my laptop, and I've never seen it these sorts of issues. So maybe filesystem hackers don't have enough memory; or we don't use the right workloads? It would help if I understood how to trigger these disaster cases. I've had to work *really* hard (as in dd if=/dev/zero of=/mnt/dirty-me-harder) in order to get even a 30 second fsync() delay. So understanding what sort of things you do that cause that many files data blocks to be dirtied, and/or what is causing a major read workload, would be useful. It may be that we just need to tune the VM to be much more aggressive about pushing dirty pages to the disk sooner. Understanding how the All I can tell you is that *I* don't run into them, even when I was using ext3 and before I got an SSD in my laptop. I don't understand why; maybe because I don't get really nice toys like systems with 32G's of memory. Or maybe it's because I don't use icecream (whatever that is). What ever it is, it would be useful to get some solid reproduction information, with details about hardware configuration, and information collecting using sar and scripts that gather /proc/meminfo every 5 seconds, and what the applications were doing at the time. It might also be useful for someone to try reducing the amount of memory the system is using by using mem= on the boot line, and see if that changes things, and to try simplifying the application workload, and/or using iotop to determine what is most contributing to the problem. (And of course, this needs to be done with someone using ext3, since both ext4 and XFS use delayed allocation, which will largely make this problem go away.) - Ted --
On Tue, 24 Mar 2009 22:09:15 -0400 Well I think that's part of the problem; this is bigger than just filesystems; I've been using ext3 since before I started seeing this, icecream is a distributed compiler system. Like distcc but a bit more Yep, and that's where my blame comes in. I whined about this to a few people, like Arjan, who provided workarounds, but never got beyond that. Some real debugging would be needed to find & fix the root cause(s). -- Jesse Barnes, Intel Open Source Technology Center --
Well I always had the feeling that somewhen from one 2.6.x to another I/O latencies increased a lot. But first I thought I was just imaging this and when I more and more thought that this is for real, I forgot since when I observed these increased latencies. This is on IBM ThinkPad T42 and T23 with XFS. I/O latencies are pathetic when dpkg reads in the database or I do tar -xf linux-x.y.z.tar.bz2. I never got down to what is causing these higher latencies though also I tried different I/O schedulers, tuned XFS options, used relatime. What I found tough is that on XFS at least a tar -xf linux-kernel / rm -rf linux-kernel operation is way slower with barriers and write cache enabled that with no barriers and no write cache enabled. And frankly I never got that. XFS crawls to a stop on metadata operations when barriers are enabled. According to the XFS FAQ disabling drive write cache should be as safe as enabling barriers. And I always unterstood barriers as a feature to be have *some* ordering contraints, i.e. write before barrier go before barrier and writes after it after it - even when a drives hardware write cache is involved. But when this cache is enabled ordering will always be like issued from Linux block layer cause all I/Os issued to the drive are write-through and synchron without write cache, versus only barrier requests are synchron with barriers and write cache. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
One issue discussed back then (also for a similar XFS patch) was that having the kernel use the RT priorities by default makes them useless as user override. The proposal was to have a new priority level between normal and RT for this, but noone implemented this. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Yesterday about half of my testboxes (3 out of 7) started getting weird networking failures: their network interface just got stuck completely - no rx and no tx at all. Restarting the interface did not help. The failures were highly sporadic and not reproducible - they triggered in distcc workloads, and on random kernels and seemingly random .config's. After spending most of today trying to find a good reproducer (my regular tests werent specific enough to catch it in any bisectable manner), i settled down on 4 parallel instances of TCP traffic: nohup ssh testbox yes & nohup ssh testbox yes & nohup ssh testbox yes & nohup ssh testbox yes & [ over gigabit, forcedeth driver. ] If the box hung within 15 minutes, the kernel was deemed bad. Using that method i arrived to this upstream networking fix which was merged yesterday: | 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit | commit 303c6a0251852ecbdc5c15e466dcaff5971f7517 | Author: Herbert Xu <herbert@gondor.apana.org.au> | Date: Tue Mar 17 13:11:29 2009 -0700 | | gro: Fix legacy path napi_complete crash Applying the straight revert below cured the problem - i now have 10 million packets and 30 minutes of uptime and the box is still fine. bisection log: [ 10 iterations ] good: 73bc6e1: Merge branch 'linus' [ 3 iterations ] bad: 4eac7d0: Merge branch 'irq/threaded' [ 6.0m packets ] good: e17bbdb: Merge branch 'tracing/core' [ 0.1m packets ] bad: 8e0ee43: Linux 2.6.29 [ 0.1m packets ] bad: e2fc4d1: dca: add missing copyright/license headers [ 0.2m packets ] bad: 4783256: virtio_net: Make virtio_net support carrier detection [ 0.4m packets ] bad: 4ada810: Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf [ 7.0m packets ] good: ec8d540: netfilter: conntrack: fix dropping packet after l4proto->packet() [ 4.0m packets ] good: d1238d5: netfilter: conntrack: check for NEXTHDR_NONE before header sanity ...
Darn, does this patch help?
net: Fix netpoll lockup in legacy receive path
When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.
While this is fishy in itself, let's make the obvious fix right
now of reverting to the previous state where we always called
__napi_complete.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..523f53e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2580,24 +2580,26 @@ static int process_backlog(struct napi_struct *napi, int quota)
int work = 0;
struct softnet_data *queue = &__get_cpu_var(softnet_data);
unsigned long start_time = jiffies;
+ struct sk_buff *skb;
napi->weight = weight_p;
do {
- struct sk_buff *skb;
-
local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
- if (!skb) {
- local_irq_enable();
- napi_complete(napi);
- goto out;
- }
local_irq_enable();
+ if (!skb)
+ break;
napi_gro_receive(napi, skb);
} while (++work < quota && jiffies == start_time);
napi_gro_flush(napi);
+ if (skb)
+ goto out;
+
+ local_irq_disable();
+ __napi_complete(napi);
+ local_irq_enable();
out:
return work;
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
thanks, added it to the test mix. Should know the result fin 1-2 hours test time. Ingo --
This commit breaks nfsroot booting on i.MX27 and other ARM boxes with different network cards here in a reproducable way. rsc -- Pengutronix e.K. | | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | --
Can you confirm that Herbert's fix (see it below) solves the
problem?
Ingo
--------------->
From b8b66ac07cab1b45aac93e4f406833a1e0d7677e Mon Sep 17 00:00:00 2001
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 24 Mar 2009 21:35:42 +0800
Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.
While this is fishy in itself, let's make the obvious fix right
now of reverting to the previous state where we always called
__napi_complete.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090324133542.GA29046@gondor.apana.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
net/core/dev.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..523f53e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2580,24 +2580,26 @@ static int process_backlog(struct napi_struct *napi, int quota)
int work = 0;
struct softnet_data *queue = &__get_cpu_var(softnet_data);
unsigned long start_time = jiffies;
+ struct sk_buff *skb;
napi->weight = weight_p;
do {
- struct sk_buff *skb;
-
local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
- if (!skb) {
- local_irq_enable();
- napi_complete(napi);
- goto out;
- }
local_irq_enable();
+ if (!skb)
+ break;
napi_gro_receive(napi, skb);
} while (++work < quota && jiffies == start_time);
napi_gro_flush(napi);
+ if (skb)
+ goto out;
+
+ local_irq_disable();
+ __napi_complete(napi);
+ local_irq_enable();
out:
return work;
--
Actually, this patch is still racy. If some interrupt comes in
and we suddenly get the maximum amount of backlog we can still
hang when we call __napi_complete incorrectly. It's unlikely
but we certainly shouldn't allow that. Here's a better version.
net: Fix netpoll lockup in legacy receive path
When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.
What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.
This patch fixes this by essentially open-coding __napi_complete.
Note we no longer need the memory barrier because this function
is per-cpu.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..2a7f6b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
if (!skb) {
+ list_del(&napi->poll_list);
+ clear_bit(NAPI_STATE_SCHED, &napi->state);
local_irq_enable();
- napi_complete(napi);
- goto out;
+ break;
}
local_irq_enable();
@@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
napi_gro_flush(napi);
-out:
return work;
}
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Yes, this one works. I always wanted to give a Tested-by: Sascha Hauer <s.hauer@pengutronix.de> Thanks -- Pengutronix e.K. | | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | --
test failure on one of the boxes, interface got stuck after ~100K
packets:
eth1 Link encap:Ethernet HWaddr 00:13:D4:DC:41:12
inet addr:10.0.1.13 Bcast:10.0.1.255 Mask:255.255.255.0
inet6 addr: fe80::213:d4ff:fedc:4112/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:22555 errors:0 dropped:0 overruns:0 frame:0
TX packets:1897 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2435071 (2.3 MiB) TX bytes:503790 (491.9 KiB)
Interrupt:11 Base address:0x4000
i'm going back to your previous version for now - it might still be
racy but it worked well for about 1.5 hours of test-time.
Ingo
--
What's the NIC and config on this one? If it's still using the legacy/netif_rx path, where GRO is off by default, this patch should make it exactly the same as with my original patch reverted. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Same forcedeth box i reported before. Config below. (note: if you want to use it you need to run it through 'make oldconfig', with all defaults accepted) Ingo
Hm, i justhad a test failure (hung interface) with this too. I'll go back to the original straight revert of "303c6a0: gro: Fix legacy path napi_complete crash", and will test it overnight - to establish a baseline of stability again. (to make sure there are no other bugs interacting) Ingo --
FYI, this plain revert is holding up fine in my tests so far - 50 random iterations - the previous one failed after 5 iterations. Ingo --
From: Ingo Molnar <mingo@elte.hu> Something must be up with respect to letting interrupts in during certain windows of time, or similar. I'll take a look at this and hopefully Herbert or myself will be able to figure it out. --
It definitely did not show usual patterns of bug behavior - i'd have found it yesterday morning if it did. I spent most of the time trying to find a reliable reproducer .config and system. Sometimes the bug went away with a minor change in the .config. Until today i didnt even suspect a mainline change causing this. Also, note that i have reduced the probability of UP kernels in my randconfigs artificially to about 12.5% (it is 50% upstream). Still, despite that measure, the 'best' .config i found was an UP config - i dont think that's an accident. Also, i had to fully saturate the target CPU over gigabit to hit the bug best. Which suggests to me (empirically) that it's indeed a race and that it needs a saturated system with lots of IRQs to trigger, and perhaps that it needs saturated/overloaded network device queues and complex userspace/softirq/hardirq interactions. Ingo --
Was this with NAPI on or off? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
This means that we shouldn't even invoke netif_rx/process_backlog, so something else is going on. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> There is always loopback which does netif_rx(). Combine that with the straight NAPI receive that forcedeth is doing here and I'm sure there are all kinds of race scenerios possible :-) You're right about GRO not being relevant here. To be honest I wouldn't be disappointed if GRO was simply on by default even for the legacy paths. --
From: Herbert Xu <herbert@gondor.apana.org.au> I think the problem is that we need to do the GRO flush before the list delete and clearing the NAPI_STATE_SCHED bit. You can't disown the NAPI context until you've squared away the GRO state, I think. Ingo's case stresses TCP a lot so I think he's hitting these GRO cases a lot as well as hitting the backlog maximum. So this mis-ordering of completion operations could explain why he still sees problems. --
From: David Miller <davem@davemloft.net> Ok Herbert, I'm even more sure of this because in your original commit log message you mention: This simply doesn't work since we need to flush the held GRO packets first. We are certainly in a pickle here, actually. We can't run the GRO flush until we re-enable interrupts. But if we re-enable interrupts, more packets get queued to the input_pkt_queue and we end up back where we started. --
That's only because I was calling __napi_complete, which is used by drivers in general so I added the check to ensure that GRO packets have been flushed. Now that we're open-coding it this is no longer a requirement. But what's more GRO should be off on Ingo's test machines because we haven't added anything to turn it on by default for non-NAPI drivers. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Well first of all GRO shouldn't even be on in Ingo's case, unless he enabled it by hand with ethtool. Secondly the only thing that touches the GRO state for the legacy path is process_backlog, and since this is per-cpu, I can't see how another instance can run while the first is still going. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Right. I think the conditions Ingo is running under is that both loopback (using legacy paths) and his NAPI based device (forcedeth) are processing a lot of packets at the same time. Another thing that seems to be critical is he can only trigger this on UP, which means that we don't have the damn APIC potentially moving the cpu target of the forcedeth interrupts around. And this means also that all the processing will be on one cpu's backlog queue only. --
I tested the plain revert i sent in the original report overnight (with about 12 hours of combined testing time), and all systems held up fine. The system that would reproduce the bug within 10-20 iterations did 210 successful iterations. Other systems held up fine too. So if there's no definitive resolution for the real cause of the bug, the plain revert looks like an acceptable interim choice for .29.1 - at least as far as my systems go. Ingo --
From: Ingo Molnar <mingo@elte.hu> Then we get back the instant OOPS that patch fixes :-) I'm sure Herbert will look into fixing this properly. --
OK, let's just do the revert and disable GRO for the legacy path.
This should be the safest option for 2.6.29.
GRO: Disable GRO on legacy netif_rx path
When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.
What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.
Since we can't seem to find a fix that works properly right now,
this patch reverts all the GRO support from the netif_rx path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..e438f54 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,18 +2588,15 @@ static int process_backlog(struct napi_struct *napi, int quota)
local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
if (!skb) {
+ __napi_complete(napi);
local_irq_enable();
- napi_complete(napi);
- goto out;
+ break;
}
local_irq_enable();
- napi_gro_receive(napi, skb);
+ netif_receive_skb(skb);
} while (++work < quota && jiffies == start_time);
- napi_gro_flush(napi);
-
-out:
return work;
}
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
ok - i have started testing the delta below, on top of the plain revert. Ingo diff --git a/net/core/dev.c b/net/core/dev.c index c1e9dc0..e438f54 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2594,11 +2594,9 @@ static int process_backlog(struct napi_struct *napi, int quota) } local_irq_enable(); - napi_gro_receive(napi, skb); + netif_receive_skb(skb); } while (++work < quota && jiffies == start_time); - napi_gro_flush(napi); - return work; } --
Thanks! BTW Ingo, any chance you could help us identify the problem with the previous patch? I don't have a forcedeth machine here and the hang you had with my patch that open-coded __napi_complete appears intimately connected to forcedeth (with NAPI enabled). The simplest thing to try would be to build forcedeth.c with DEBUG and see what it prints out after it locks up. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
it's still fine btw, so: Sure, can try that. Probably the best would be if you sent me a combo patch with the precise patch you meant me to try (there were several patches, i'm not sure which one is the 'previous' one) plus the forcedeth debug enable change as well. Thanks, Ingo --
I saw your patch this morning and added it to my system too. 4 hours and 15 minutes and everything is still fine here. CONFIG_FORCEDETH=y # CONFIG_FORCEDETH_NAPI is not set with 00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3) --
Sure, here's the patch to do both.
diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..2a7f6b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
if (!skb) {
+ list_del(&napi->poll_list);
+ clear_bit(NAPI_STATE_SCHED, &napi->state);
local_irq_enable();
- napi_complete(napi);
- goto out;
+ break;
}
local_irq_enable();
@@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
napi_gro_flush(napi);
-out:
return work;
}
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index b8251e8..101e552 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -64,7 +64,7 @@
#include <asm/uaccess.h>
#include <asm/system.h>
-#if 0
+#if 1
#define dprintk printk
#else
#define dprintk(x...) do { } while (0)
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
The box is not even able to boot up, so many messages are printed. Ingo --
From: Herbert Xu <herbert@gondor.apana.org.au> Applied, and will push to -stable. --
i didnt. (But it's randconfig - so please have a good look at all .config details - maybe something has an unexpected side-effect?) Ingo --
Hi Ingo, No, still doesn't work. It seems to have something to do with enabling interrupts between __skb_dequeue() and __napi_complete(). I reverted 303c6a0251852ecbdc5c15e466dcaff5971f7517 and added a local_irq_enable(); local_irq_disable(); right before __napi_complete() and this already breaks networking. -- Pengutronix e.K. | | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | --
It would be very nice, if you could start with a commit to Makefile, that reflects the new series: e.g.: VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 30 EXTRAVERSION = -pre -pre for preparing state. Thanks, Pete --
If you're using the kernel-of-they-day, you're probably using git, and CONFIG_LOCALVERSION_AUTO=y should be mandatory. My kernel is called 2.6.29-03321-gbe0ea69... With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium Phone: +32 (0)2 700 8453 Fax: +32 (0)2 700 8622 E-mail: Geert.Uytterhoeven@sonycom.com Internet: http://www.sony-europe.com/ A division of Sony Europe (Belgium) N.V. VAT BE 0413.825.160 · RPR Brussels Fortis · BIC GEBABEBB · IBAN BE41293037680010 --
I sure hope it never becomes mandatory, I despise that thing. I don't even do -rc tags. .nn is .nn until baked and nn.1 appears. (would be nice if baked were immediately handed to stable .nn.0 instead of being in limbo for a bit, but I don't drive the cart, just tag along behind [w. shovel];) -Mike --
If you're a git user that changes kernels frequently, then enabling CONFIG_LOCALVERSION_AUTO is _really_ convenient when you learn to use it. This is quite common for me: gitk v$(uname -r).. and it works exactly due to CONFIG_LOCALVERSION_AUTO (and because git is rather good at figuring out version numbers). It's a great way to say "ok, what is in my git tree that I'm not actually running right now". Another case where CONFIG_LOCALVERSION_AUTO is very useful is when you're noticing some new broken behavior, but it took you a while to notice. You've rebooted several times since, but you know it worked last Tuesday. What do you do? The thing to do is grep "Linux version" /var/log/messages* and figure out what the good version was, and then do git bisect start git bisect good ..that-version.. git bisect bad v$(uname -r) and off you go. This is _very_ convenient if you are working with some "random git kernel of the day" like I am (and like hopefully others are Note that the "v2.6.29[-rcX" part is totally _useless_ in many cases, because if you're working past merges, and especially if you end up doing bisection, it is very possible that the main Makefile says "2.6.28-rc2", but the code you're working on wasn't actually _merged_ until after 2.6.29. In other words, the main Makefile version is totally useless in non-linear development, and is meaningful _only_ at specific release times. In between releases, it's essentially a random thing, since non-linear development means that versioning simply fundamentally isn't some simple monotonic numbering. And this is exactly when CONFIG_LOCALVERSION_AUTO is a huge deal. (It's even more so if you end up looking at "next" or merging other peoples trees. If you only ever track my kernel, and you only ever fast-forward - no bisection, no nothing - then the release numbering looks "simple", and things like LOCALVERSION looks just like noise). Linus --
That's why it irritates me. I build/test a lot, and do the occasional bisection, which makes a mess in /boot and /lib/modules. I use a quilt stack of git pull diffs as reference/rummage points. Awkward maybe, but effective (so no need for autoversion), and no mess. -Mike --
Well, you guys always see things from a deeply involved kernel developer _using git_ POV - which I do understand and accept (unlike hats nobody can change his head after all ;-), but there are other approaches to kernel source code, e.g. git is also really great for tracking the kernel development without any further involvement apart from using the resulting trees. I build kernel rpms from your git tree, and have a bunch of BUILDs lying around. Sure, I can always fetch the tarballs or fiddle with git, but why? Having a Makefile start commit allows to make sure with simplest tools, say "head Makefile" that a locally copied 2.6.29 tree is really a 2.6.29, and not something moving towards the next release. That's all, nothing less, nothing more, it's just a strong hint which blend is in the box. I always wonder, why Arjan does not intervene for his kerneloops.org project, since your approach opens a window of uncertainty during the merge window when simply using git as an efficient fetch tool. Ducks and hides now, Pete --
I would *love* it if Linus would, as first commit mark his tree as "-git0" (as per snapshots) or "-rc0". So that I can split the "final" versus "merge window" oopses. --
..which is an important difference. I still vote for -pre for "preparation state" as -git0 does imply some sort of versioning, which *is* meaningless in this state. Linus, this would be a small step for you, but makes a big difference for those of us, that miss it sorely. Junio: is it possible to automate this in git somehow: make sure, that the first commit after a release really happens for a "new" version (e.g. a version patch to Makefile)? Pete --
Can't you discern that from the v$VERSION tag? According to your definition, -git0 would simply be v2.6.29 commit + 1, correct? Jeff --
it needs to be something that is shown in the oops output... ... basically version or extraversion in the Makefile. --
Pretty please... I keep kernel binaries around and being able to tell what it is when it boots is useful. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
So you have a place where you have a git repository from which you copy You may add a small script like this into .git/hooks/post-checkout: ----- #!/bin/bash if [ "$3" == 1 ]; then # don't do it for file checkouts sed -ri "s/^(EXTRAVERSION =.*)/\1$(scripts/setlocalversion)/" Makefile fi ----- That will append the EXTRAVERSION automatically with what If you are working on a tagged version, the EXTRAVERSION won't be extended, on an untagged version it will have some ident for that intermediate version e.g. git checkout master -> EXTRAVERSION =-07100-g833bb30 git checkout HEAD~1 -> EXTRAVERSION =-07099-g8b53ef3 git checkout v2.6.29 -> EXTRAVERSION = git checkout HEAD~1 -> EXTRAVERSION = -rc8-00303-g0030864 git checkout v2.6.29-rc8 -> EXTRAVERSION = -rc8 In that way your copies of the source tree will have the EXTRAVERSION set in the Makefile. You can detect an intermediate version easily in the Makefile and you even can checkout that exact version from the git tree later, if you need to. Or just make an diff between two rpms by diffing the versions taken from the Makefiles e.g. git diff 07099-g8b53ef3..07100-g833bb30 or git diff 00303-g0030864..v2.6.29 Attention: Of course, the Makefile is changed in your working tree as if you had changed it yourself. Therefore you have to use "git checkout Makefile" to revert the changes before you can checkout a different version from the git tree. This is only a hack and there might be a better way to do it, but maybe it helps as a starting point in your special situation. Andreas --
If you have a git checkout, you can easily do this yourself: git checkout -b 2.6.30-rc master sed -i "/^SUBLEVEL/ s/29/30/; /^EXTRAVERSION/ s/$/ -rc0/" Makefile git add Makefile git commit -m "Mark as -rc0" Then to get latest git head: git checkout master git pull git rebase master 2.6.30-rc When Linus releases -rc1, the rebase will signal a conflict on that commit and you can just 'git rebase --skip' it. Instead of sed you can also just edit the Makefile of course, or you can go the other way and create a simple script that automatically increases the existing sublevel by 1. I just do this manually, given that it's only needed once per three months or so. Using a branch is something I do anyway as I almost always have a few minor patches on top of git head for various reasons. Cheers, FJP --
