Re: 2.6.36 io bring the system to its knees

Previous thread: Re: perf tools miscellaneous questions by Francis Moreau on Friday, November 5, 2010 - 5:38 am. (6 messages)

Next thread: [PATCH 2/3 v2] regulator: Ensure enough delay time for enabling regulator by Axel Lin on Friday, November 5, 2010 - 6:51 am. (3 messages)
From: Sanjoy Mahajan
Date: Friday, November 5, 2010 - 5:48 am

Good idea.  

The filesystems are all ext3 with default mount parameters.  The dmesgs
say that the filesystems are mounted in ordered data mode and that
barriers are not enabled.

mount says:

/dev/sda2 on / type ext3 (rw,errors=remount-ro,commit=0)
/dev/sda1 on /boot type ext3 (rw,commit=0)

Do you mean the partition sizes?  Here's that:

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              72G   52G   17G  77% /
tmpfs                 755M  4.0K  755M   1% /lib/init/rw
udev                  750M  212K  750M   1% /dev
tmpfs                 755M     0  755M   0% /dev/shm
/dev/sda1             274M  117M  143M  45% /boot


I don't have a test from the time I ran rsync (but I'll check that
tonight), but I traced the currently running emacs and iceweasel
(a.k.a. firefox) with "strace -p PID 2>&1 | grep sync".  That didn't
turn up any sync-related calls.

(I checked the firefox because I seem to remember that it used to do
fsync absurdly often, but I also seem to remember that the outcry made
them stop.)

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
 glorify the hunters.'  --African Proverb
--

From: dave b
Date: Saturday, November 6, 2010 - 7:10 am

I now personally have thought that this problem is the kernel not
keeping track of reads vs writers properly  or not providing enough
time to reading processes as writing ones which look like they are
blocking the system....

If you want to do a simple test do an unlimited dd  (or two dd's of a
limited size, say 10gb) and a find /
Tell me how it goes :) ( the system will stall)
(obviously stop the dd after some time :) ).

http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
iirc can reproduce this on plain ext3.
--

From: Dave Chinner
Date: Saturday, November 6, 2010 - 8:12 am

The find runs at IO latency speed while the dd processes run at disk
bandwidth:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
vdb               0.00     0.00   58.00 1251.00     0.45   556.54   871.45    26.69   20.39   0.72  94.32
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

That looks pretty normal to me for XFS and the noop IO scheduler,
and there are no signs of latency or interactive problems in
the system at all. Kill the dd's and:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00
vdb               0.00     0.00  214.80    0.40     1.68     0.00 15.99     0.33    1.54   1.54  33.12
sda               0.00     0.00    0.00    0.00     0.00     0.00 0.00     0.00    0.00   0.00   0.00

And the find runs 3-4x faster, but ~200 iops is about the limit
I'd expect from 7200rpm SATA drives given a single thread issuing IO

No, the system doesn't stall at all. It runs just fine. Sure,
anything that requires IO on the loaded filesystem is _slower_, but
if you're writing huge files to it that's pretty much expected. The
root drive (on a different spindle) is still perfectly responsive on
a cold cache:

$ sudo time find / -xdev > /dev/null
0.10user 1.87system 0:03.39elapsed 58%CPU (0avgtext+0avgdata 7008maxresident)k
0inputs+0outputs (1major+844minor)pagefaults 0swap

So what you describe is not a systemic problem, but a problem that
your system configuration triggers. That's why we need to know

You're pointing to a "fsync-tester" program that exercises a
well-known problem with ext3 (sync-the-world-on-fsync). Other
filesystems do not have that design flaw so don't suffer from
interactivity ...
From: dave b
Date: Saturday, November 6, 2010 - 11:06 pm

Thank you for your reply.
Well I am not sure :)
Is the answer "don't use ext3" ?
If it is what should I really be using instead?
--

From: Jens Axboe
Date: Sunday, November 7, 2010 - 5:08 am

As already mentioned, ext3 is just not a good choice for this sort of
thing. Did you have atimes enabled?

-- 
Jens Axboe

--

From: Linus Torvalds
Date: Sunday, November 7, 2010 - 8:50 am

At least for ext3, more important than atimes is the "data=writeback"
setting. Especially since our atime default is sane these days (ie if
you don't specify anything, we end up using 'relatime').

If you compile your own kernel, answer "N" to the question

  Default to 'data=ordered' in ext3?

at config time (CONFIG_EXT3_DEFAULTS_TO_ORDERED), or you can make sure
"data=writeback" is in the fstab (but I don't think everything honors
it for the root filesystem).

                                   Linus
--

From: Dave Chinner
Date: Tuesday, November 9, 2010 - 6:32 pm

Don't forget to mention data=writeback is not the default because if
your system crashes or you lose power running in this mode it will
*CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention
the significant security issues (e.g stale data exposure) that also
occur even if the filesystem is not corrupted by the crash. IOWs,
data=writeback is the "fast but I'll eat your data" option for ext3.

So I recommend that nobody follows this path because it only leads
to worse trouble down the road.  Your best bet it to migrate away
from ext3 to a filesystem that doesn't have such inherent ordering
problems like ext4 or XFS....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: dave b
Date: Tuesday, November 9, 2010 - 7:01 pm

Ok so all of us on ext3 should just up and move to ext4 ^ ^ ? (who
want to avoid these problems)
--

From: Evgeniy Ivanov
Date: Wednesday, November 10, 2010 - 1:08 am

Is it save to use "data=writeback" with ext4? At least are there
security issues?
Why do you say, that fs can be corrupted? Metadata is still
journalled, so only data might be corrupted, but FS should still be
consistent.


-- 
Evgeniy Ivanov
--

From: Dave Chinner
Date: Wednesday, November 10, 2010 - 1:24 am

I believe the same issues exist with data=writeback in ext4, but you
probably should have an ext4 developer answer that question for

Data corruption is still a filesystem corruption.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Pavel Machek
Date: Wednesday, November 10, 2010 - 7:22 am

As far as I understand, apps should not expect anything unless they
use fsync(). And fsync() still works in ext3...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Pavel Machek
Date: Wednesday, November 10, 2010 - 7:20 am

You will lose your data, but the filesystem should still be

I agree on security issues.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 7:27 am

That is data that was freshly touched around the time the system went down, right?

I.e. data that was probably half-modified by user-space to begin with.

	Ingo
--

From: Christoph Hellwig
Date: Wednesday, November 10, 2010 - 7:55 am

It's data that wasn't synced out yet, yes.  Which isn't the problem per
se.  With ext3/4 in ordered mode, or xfs, or btrfs the file size won't
be incremented until the data is written.  in ext3/4 in writeback mode
(or various non-journaling filesystems) however the inode size is
updated, and metadagta changes are logged.  Besides exposing stale
data which is a security risk in multi-user systems it also means the
inode looks modified (by size and timestamps), but contains other data
than actually written.

--

From: Pavel Machek
Date: Wednesday, November 10, 2010 - 12:09 pm

Well, afaict thats traditional unix behaviour... while it is not user
friendly, I'd not call it 'corrupted filesytem'.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Theodore Tso
Date: Wednesday, November 10, 2010 - 7:33 am

This is strictly speaking not true.  Using data=writeback will not cause you to lose any data --- at least, not any more than you would without the feature.   If you have applications that write files in an unsafe way, that data is going to be lost, one way or another.  (i.e., with XFS in a similar situation you'll get a zero-length file)   The difference is that in the case of a system crash, there may be unwritten data revealed if you use data=writeback.  This could be a security exposure, especially if you are using your system in as time-sharing system, and where you see the contents of deleted files belonging to another user.

So it is not an "eat your data" situation,  but rather, a "possibly expose old data".   Whether or not you care on a single-user workstation situation, is an individual judgement call.   There's been a lot of controversy about this.

The chance that this occurs using data=writeback in ext4 is much less, BTW, because with delayed allocation we delay updating the inode until right before we write the block.  I have a plan for changing things so that we write the data blocks *first* and then update the metadata blocks second, which will mean that ext4 data=ordered will go away entirely, and we'll get both the safety and as well as avoiding the forced data page writeouts during journal commits.

-- Ted

--

From: Christoph Hellwig
Date: Wednesday, November 10, 2010 - 7:57 am

That's the scheme used by XFS and btrfs in one form or another.  Chris
also had a patch to implement it for ext3, which unfortunately fell
under the floor.

--

From: Chris Mason
Date: Wednesday, November 10, 2010 - 8:00 am

It probably still applies, but by the time I had it stable I realized
that ext4 was really a better place to fix this stuff.  ext3 is what it
is (good and bad), and a big change like my data=guarded code probably
isn't the best way to help.

-chris
--

From: Dave Chinner
Date: Wednesday, November 10, 2010 - 4:36 pm

In theory, that's all that is _supposed_ to happen. However, my
recent experience is that massive ext3 filesystem corruption occurs
in data=writeback mode when the system crashes and that does not
happen in ordered mode.

Why do you think i posted the patches to change the default back to
ordered mode a few months back? I basically trashed the root ext3
partitions on three test machines (to the point where >5000 files
across /sbin, /bin, /lib and /usr were corrupted or missing and I
had to reinstall from scratch) when I'd forgotten to set the
ordered-is-defult config option in the kernel i was testing.  And
that is when the only thing being written to the root filesystems
was log files...

The worst part about this was that I also had ext3 filesystems
corrupted by crashes in such a way that e2fsck didn't detect it but

My experience says otherwise....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Linus Torvalds
Date: Wednesday, November 10, 2010 - 8:59 am

You will lose data even with data=ordered. All the data that didn't
get logged before the crash is lost anyway.

So your argument is kind of dishonest. The thing is, if you have a
crash or power outage or whatever, the only data you can really rely
on is always going to be the data that you fsync'ed before the crash.
Everything else is just gravy.

Are there downsides to "data=writeback"? Absolutely. But anybody who
tries to push those downsides without taking the performance and
latency issues into account is just not thinking straight.

Too many people think that "correct" is somehow black-and-white. It's
not. "The correct answer too late" is not worth anything. Sane people
understand that "good enough" is important.

And quite frankly, "data=writeback" is not wonderful, but it's "good
enough". And it helps enormously with at least one class of serious
performance problems. Dismissing it because it doesn't have quite the
guarantees of "data=ordered" is like saying that you should never use
"pi=3.14" for any calculations because it's not as exact as
"pi=3.14159265". The thing is, for many things, three significant
digits (or even _one_ significant digit) is plenty.

ext3 [f]sync sucks. We know. All filesystems suck. They just tend to
do it in different dimensions.

                         Linus
--

From: Alexey Dobriyan
Date: Wednesday, November 10, 2010 - 9:46 am

On Wed, Nov 10, 2010 at 5:59 PM, Linus Torvalds

Linus, are you using with data=writeback?

Those of us, who did (without UPS), will never do it again.

Propability of non-trivial FS corruption becomes so much bigger.
I believe from my experience, average number of crashes before
one loses FS becomes single digit number.

With data=ordered, it's quite hard.
--

From: Linus Torvalds
Date: Wednesday, November 10, 2010 - 9:55 am

I used to, indeed. But since I upgrade computers fairly regularly, and
all the distros have moved towards ext4, I'm no longer using ext3 at
all.

But yes, to me ext3 was totally unusable with rotational media and
"data=ordered". Not just bad. Total crap. Whenever the mail client
wanted to write something out, the whole machine basically stopped.

Of course, part of that was that long ago I used reiserfs back when
SuSE had it as the default. So I didn't think that the hickups were
"normal" like a lot of people probably do. I knew better. So it was

Before or after the change to make renaming on top of old files do the
IO flushing?

That made a big difference for some rather common cases.

                            Linus
--

From: Alexey Dobriyan
Date: Wednesday, November 10, 2010 - 10:10 am

On Wed, Nov 10, 2010 at 6:55 PM, Linus Torvalds


That's good.
--

From: Mark Lord
Date: Wednesday, November 10, 2010 - 11:55 am

I've used ext2 and ext3 extensively on all of the boxes here,
every since each first became available.   I developed Linux IDE,
the first IDE DMA, lots of custom storage drivers, and more recently
worked on libata drivers.  This meant a LOT of sudden and catastrophic
system failures, as the bugs and other kinks were worked on.

Never lost a nibble.  Totally, utterly reliable stuff for everyday use.
*WITH* the write-caches all enabled on all of the drives, too.

Sure, sudden power-failures could have a better chance of corrupting data,
but those are really rare, and the few that have happened were again non-events 
here.

That's the difference between theory and practice.

Cheers
-ml
--

From: Mike Galbraith
Date: Wednesday, November 10, 2010 - 11:27 am

I've been using it for a looong time on my desktop box.  Yeah, you can
be bitten easier than ordered, and I have been, but it's never been
anything major.  The risk for me is worth it, as data=ordered sucked
really bad.

If I didn't need to maintain compatibility with 30+ old kernels for

That's not my experience.  I've yet to have to rebuild my ext3 fs since
upgrading box to shiny new opensuse 11.1 (however long ago and how many


--

From: Dave Chinner
Date: Wednesday, November 10, 2010 - 4:43 pm

I crash kernels tens of times every day doing filesystem testing.
With data=ordered I have not seen a corrupted root filesystem as a
result of normal testing and crashing as long as I can remember.
With data=writeback, I'll have corrupted root ext3 partitions in
under a day. Hardly what I'd call stable or something you'd want
to deploy in production.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Arjan van de Ven
Date: Saturday, November 6, 2010 - 12:10 pm

On Fri, 5 Nov 2010 08:48:13 -0400

btw few more things to try (from my standard rc.local script):

echo 4096 > /sys/block/sda/queue/nr_requests

for i in `pidof kjournald` ; do ionice -c1 -p $i ; done

echo 75 >  /proc/sys/vm/dirty_ratio


(replace sda with whatever your disk is of course)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

Previous thread: Re: perf tools miscellaneous questions by Francis Moreau on Friday, November 5, 2010 - 5:38 am. (6 messages)

Next thread: [PATCH 2/3 v2] regulator: Ensure enough delay time for enabling regulator by Axel Lin on Friday, November 5, 2010 - 6:51 am. (3 messages)