Good idea. The filesystems are all ext3 with default mount parameters. The dmesgs say that the filesystems are mounted in ordered data mode and that barriers are not enabled. mount says: /dev/sda2 on / type ext3 (rw,errors=remount-ro,commit=0) /dev/sda1 on /boot type ext3 (rw,commit=0) Do you mean the partition sizes? Here's that: $ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 72G 52G 17G 77% / tmpfs 755M 4.0K 755M 1% /lib/init/rw udev 750M 212K 750M 1% /dev tmpfs 755M 0 755M 0% /dev/shm /dev/sda1 274M 117M 143M 45% /boot I don't have a test from the time I ran rsync (but I'll check that tonight), but I traced the currently running emacs and iceweasel (a.k.a. firefox) with "strace -p PID 2>&1 | grep sync". That didn't turn up any sync-related calls. (I checked the firefox because I seem to remember that it used to do fsync absurdly often, but I also seem to remember that the outcry made them stop.) -Sanjoy `Until lions have their historians, tales of the hunt shall always glorify the hunters.' --African Proverb --
I now personally have thought that this problem is the kernel not keeping track of reads vs writers properly or not providing enough time to reading processes as writing ones which look like they are blocking the system.... If you want to do a simple test do an unlimited dd (or two dd's of a limited size, say 10gb) and a find / Tell me how it goes :) ( the system will stall) (obviously stop the dd after some time :) ). http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561 iirc can reproduce this on plain ext3. --
The find runs at IO latency speed while the dd processes run at disk bandwidth: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 vdb 0.00 0.00 58.00 1251.00 0.45 556.54 871.45 26.69 20.39 0.72 94.32 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 That looks pretty normal to me for XFS and the noop IO scheduler, and there are no signs of latency or interactive problems in the system at all. Kill the dd's and: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 vdb 0.00 0.00 214.80 0.40 1.68 0.00 15.99 0.33 1.54 1.54 33.12 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 And the find runs 3-4x faster, but ~200 iops is about the limit I'd expect from 7200rpm SATA drives given a single thread issuing IO No, the system doesn't stall at all. It runs just fine. Sure, anything that requires IO on the loaded filesystem is _slower_, but if you're writing huge files to it that's pretty much expected. The root drive (on a different spindle) is still perfectly responsive on a cold cache: $ sudo time find / -xdev > /dev/null 0.10user 1.87system 0:03.39elapsed 58%CPU (0avgtext+0avgdata 7008maxresident)k 0inputs+0outputs (1major+844minor)pagefaults 0swap So what you describe is not a systemic problem, but a problem that your system configuration triggers. That's why we need to know You're pointing to a "fsync-tester" program that exercises a well-known problem with ext3 (sync-the-world-on-fsync). Other filesystems do not have that design flaw so don't suffer from interactivity ...
Thank you for your reply. Well I am not sure :) Is the answer "don't use ext3" ? If it is what should I really be using instead? --
As already mentioned, ext3 is just not a good choice for this sort of thing. Did you have atimes enabled? -- Jens Axboe --
At least for ext3, more important than atimes is the "data=writeback"
setting. Especially since our atime default is sane these days (ie if
you don't specify anything, we end up using 'relatime').
If you compile your own kernel, answer "N" to the question
Default to 'data=ordered' in ext3?
at config time (CONFIG_EXT3_DEFAULTS_TO_ORDERED), or you can make sure
"data=writeback" is in the fstab (but I don't think everything honors
it for the root filesystem).
Linus
--
Don't forget to mention data=writeback is not the default because if your system crashes or you lose power running in this mode it will *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention the significant security issues (e.g stale data exposure) that also occur even if the filesystem is not corrupted by the crash. IOWs, data=writeback is the "fast but I'll eat your data" option for ext3. So I recommend that nobody follows this path because it only leads to worse trouble down the road. Your best bet it to migrate away from ext3 to a filesystem that doesn't have such inherent ordering problems like ext4 or XFS.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Ok so all of us on ext3 should just up and move to ext4 ^ ^ ? (who want to avoid these problems) --
Is it save to use "data=writeback" with ext4? At least are there security issues? Why do you say, that fs can be corrupted? Metadata is still journalled, so only data might be corrupted, but FS should still be consistent. -- Evgeniy Ivanov --
I believe the same issues exist with data=writeback in ext4, but you probably should have an ext4 developer answer that question for Data corruption is still a filesystem corruption. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
As far as I understand, apps should not expect anything unless they use fsync(). And fsync() still works in ext3... -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
You will lose your data, but the filesystem should still be I agree on security issues. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
That is data that was freshly touched around the time the system went down, right? I.e. data that was probably half-modified by user-space to begin with. Ingo --
It's data that wasn't synced out yet, yes. Which isn't the problem per se. With ext3/4 in ordered mode, or xfs, or btrfs the file size won't be incremented until the data is written. in ext3/4 in writeback mode (or various non-journaling filesystems) however the inode size is updated, and metadagta changes are logged. Besides exposing stale data which is a security risk in multi-user systems it also means the inode looks modified (by size and timestamps), but contains other data than actually written. --
Well, afaict thats traditional unix behaviour... while it is not user friendly, I'd not call it 'corrupted filesytem'. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
This is strictly speaking not true. Using data=writeback will not cause you to lose any data --- at least, not any more than you would without the feature. If you have applications that write files in an unsafe way, that data is going to be lost, one way or another. (i.e., with XFS in a similar situation you'll get a zero-length file) The difference is that in the case of a system crash, there may be unwritten data revealed if you use data=writeback. This could be a security exposure, especially if you are using your system in as time-sharing system, and where you see the contents of deleted files belonging to another user. So it is not an "eat your data" situation, but rather, a "possibly expose old data". Whether or not you care on a single-user workstation situation, is an individual judgement call. There's been a lot of controversy about this. The chance that this occurs using data=writeback in ext4 is much less, BTW, because with delayed allocation we delay updating the inode until right before we write the block. I have a plan for changing things so that we write the data blocks *first* and then update the metadata blocks second, which will mean that ext4 data=ordered will go away entirely, and we'll get both the safety and as well as avoiding the forced data page writeouts during journal commits. -- Ted --
That's the scheme used by XFS and btrfs in one form or another. Chris also had a patch to implement it for ext3, which unfortunately fell under the floor. --
It probably still applies, but by the time I had it stable I realized that ext4 was really a better place to fix this stuff. ext3 is what it is (good and bad), and a big change like my data=guarded code probably isn't the best way to help. -chris --
In theory, that's all that is _supposed_ to happen. However, my recent experience is that massive ext3 filesystem corruption occurs in data=writeback mode when the system crashes and that does not happen in ordered mode. Why do you think i posted the patches to change the default back to ordered mode a few months back? I basically trashed the root ext3 partitions on three test machines (to the point where >5000 files across /sbin, /bin, /lib and /usr were corrupted or missing and I had to reinstall from scratch) when I'd forgotten to set the ordered-is-defult config option in the kernel i was testing. And that is when the only thing being written to the root filesystems was log files... The worst part about this was that I also had ext3 filesystems corrupted by crashes in such a way that e2fsck didn't detect it but My experience says otherwise.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
You will lose data even with data=ordered. All the data that didn't
get logged before the crash is lost anyway.
So your argument is kind of dishonest. The thing is, if you have a
crash or power outage or whatever, the only data you can really rely
on is always going to be the data that you fsync'ed before the crash.
Everything else is just gravy.
Are there downsides to "data=writeback"? Absolutely. But anybody who
tries to push those downsides without taking the performance and
latency issues into account is just not thinking straight.
Too many people think that "correct" is somehow black-and-white. It's
not. "The correct answer too late" is not worth anything. Sane people
understand that "good enough" is important.
And quite frankly, "data=writeback" is not wonderful, but it's "good
enough". And it helps enormously with at least one class of serious
performance problems. Dismissing it because it doesn't have quite the
guarantees of "data=ordered" is like saying that you should never use
"pi=3.14" for any calculations because it's not as exact as
"pi=3.14159265". The thing is, for many things, three significant
digits (or even _one_ significant digit) is plenty.
ext3 [f]sync sucks. We know. All filesystems suck. They just tend to
do it in different dimensions.
Linus
--
On Wed, Nov 10, 2010 at 5:59 PM, Linus Torvalds Linus, are you using with data=writeback? Those of us, who did (without UPS), will never do it again. Propability of non-trivial FS corruption becomes so much bigger. I believe from my experience, average number of crashes before one loses FS becomes single digit number. With data=ordered, it's quite hard. --
I used to, indeed. But since I upgrade computers fairly regularly, and
all the distros have moved towards ext4, I'm no longer using ext3 at
all.
But yes, to me ext3 was totally unusable with rotational media and
"data=ordered". Not just bad. Total crap. Whenever the mail client
wanted to write something out, the whole machine basically stopped.
Of course, part of that was that long ago I used reiserfs back when
SuSE had it as the default. So I didn't think that the hickups were
"normal" like a lot of people probably do. I knew better. So it was
Before or after the change to make renaming on top of old files do the
IO flushing?
That made a big difference for some rather common cases.
Linus
--
On Wed, Nov 10, 2010 at 6:55 PM, Linus Torvalds That's good. --
I've used ext2 and ext3 extensively on all of the boxes here, every since each first became available. I developed Linux IDE, the first IDE DMA, lots of custom storage drivers, and more recently worked on libata drivers. This meant a LOT of sudden and catastrophic system failures, as the bugs and other kinks were worked on. Never lost a nibble. Totally, utterly reliable stuff for everyday use. *WITH* the write-caches all enabled on all of the drives, too. Sure, sudden power-failures could have a better chance of corrupting data, but those are really rare, and the few that have happened were again non-events here. That's the difference between theory and practice. Cheers -ml --
I've been using it for a looong time on my desktop box. Yeah, you can be bitten easier than ordered, and I have been, but it's never been anything major. The risk for me is worth it, as data=ordered sucked really bad. If I didn't need to maintain compatibility with 30+ old kernels for That's not my experience. I've yet to have to rebuild my ext3 fs since upgrading box to shiny new opensuse 11.1 (however long ago and how many --
I crash kernels tens of times every day doing filesystem testing. With data=ordered I have not seen a corrupted root filesystem as a result of normal testing and crashing as long as I can remember. With data=writeback, I'll have corrupted root ext3 partitions in under a day. Hardly what I'd call stable or something you'd want to deploy in production. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
On Fri, 5 Nov 2010 08:48:13 -0400 btw few more things to try (from my standard rc.local script): echo 4096 > /sys/block/sda/queue/nr_requests for i in `pidof kjournald` ; do ionice -c1 -p $i ; done echo 75 > /proc/sys/vm/dirty_ratio (replace sda with whatever your disk is of course) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
