Re: Status of buffered write path (deadlock fixes)

Previous thread: Relative atime (was Re: What's in ocfs2.git) by Valerie Henson on Monday, December 4, 2006 - 5:36 pm. (10 messages)

Next thread: [RFC][PATCH 0/2] Extent base online defrag (ver 0.2) by sho on Tuesday, December 5, 2006 - 4:18 am. (1 message)
From: Nick Piggin
Date: Monday, December 4, 2006 - 11:52 pm

Hi,

I'd like to try to state where we are WRT the buffered write patches,
and ask for comments. Sorry for the wide cc list, but this is an
important issue which hasn't had enough review.

Well the next -mm will include everything we've done so far. I won't
repost patches unless someone would like to comment on a specific one.

I think the core generic_file_buffered_write is fairly robust, after
fixing the efault and zerolength iov problems picked up in testing
(thanks, very helpful!).

So now I *believe* we have an approach that solves the deadlock and
doesn't expose transient or stale data, transient zeroes, or anything
like that.

Error handling is getting close, but there may be cases that nobody
has picked up, and I've noticed a couple which I'll explain below.

I think we do the right thing WRT pagecache error handling: a
!uptodate page remains !uptodate, an uptodate page can handle the
write being done in several parts. Comments in the patches attempt
to explain how this works. I think it is pretty straightforward.

But WRT block allocation in the case of errors, it needs more review.

Block allocation:
- prepare_write can allocate blocks
- prepare_write doesn't need to initialize the pagecache on top of
   these blocks where it is within the range specified in prepare_write
   (because the copy_from_user will initialise it correctly)
- In the case of a !uptodate page, unless the page is brought uptodate
   (ie the copy_from_user completely succeeds) and marked dirty, then
   a read that sneaks in after we unlock the page (to retry the write)
   will try to bring it uptodate by pulling in the uninitialised blocks.

Problem 1:
I think that allocating blocks outside i_size is OK WRT uninitialised
data, because we update i_size only after a successful copy. However,
I don't think we trim these blocks off (eg. perhaps the "prepare_write
may have instantiated a few blocks" path should be the normal error
path for both the copy_from_user and the commit_write error ...
From: Mark Fasheh
Date: Thursday, December 7, 2006 - 12:55 pm

Hi Nick,


I pulled broken-out-2006-12-05-01-0.tar.gz from ftp.kernel.org and applied
the following patches to get a reasonable idea of what the final product
would look like:

revert-generic_file_buffered_write-handle-zero-length-iovec-segments.patch
revert-generic_file_buffered_write-deadlock-on-vectored-write.patch
generic_file_buffered_write-cleanup.patch
mm-only-mm-debug-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks-comment.patch
mm-fix-pagecache-write-deadlocks-xip.patch
mm-fix-pagecache-write-deadlocks-mm-pagecache-write-deadlocks-efault-fix.patch
mm-fix-pagecache-write-deadlocks-zerolength-fix.patch
mm-fix-pagecache-write-deadlocks-stale-holes-fix.patch
fs-prepare_write-fixes.patch

If this is incorrect, or I should apply further patches, please let me know.


In generic_file_buffered_write() we now do:

	status = a_ops->commit_write(file, page, offset,offset+copied);

Which tells the file system to commit only the amount of data that
filemap_copy_from_user() was able to pull in, despite our zeroing of 
the newly allocated buffers in the copied != bytes case. Shouldn't we be
doing:

        status = a_ops->commit_write(file, page, offset,offset+bytes);

instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
parts of the page which are properly allocated and zero'd but haven't been
copied into yet? I think that in the case of a crash just after the
transaction is closed in ->commit_write(), we might lose those guarantees,

For some reason, I'm not seeing where BH_New is being cleared in case with
no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
to clear the flag somewhere (perhaps in block_commit_write()?).


Ok, that's it for now. I have some thoughts regarding the asymmetry between
ranges passed to ->prepare_write() and ->commit_write(), but I'd like to
save those thoughts until I know whether my comments above uncovered real
issues ...
From: Nick Piggin
Date: Thursday, December 7, 2006 - 8:28 pm

That looks right (there are fixes for the "cont" buffer scheme, but they

No, because we might be talking about buffers that haven't been newly
allocated, but are not uptodate (so the pagecache can contain junk).

We can't zero these guys and do the commit_write, because that exposes
transient zeroes to a concurrent reader.

What we *could* do, is to do the full length commit_write for uptodate
pages, even if the copy is short. But we still need to do a zero-length
commit_write if the page is not uptodate (this would reduce the number
of new cases that need to be considered).

Does a short commit_write cause a problem for filesystems? They can
still do any and all operations they would have with a full-length one.
But maybe it would be better to eliminate that case. OK.

How about a zero-length commit_write? In that case again, they should
be able to remain unchanged *except* that they are not to extend i_size

Hmm, it is a bit inconsistent. It seems to be anywhere from prepare_write
to block_write_full_page.

Where do filesystems need the bit? It would be nice to clear it in

Very helpful, thanks ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


From: Mark Fasheh
Date: Friday, December 8, 2006 - 4:48 pm

Ahh ok - zeroing would populate with incorrect data and we can't write those
buffers because we'd write junk over good data.

And of course, the way it works right now will break ordered write mode.


If they've done allocation, yes. You're telling the file system to stop
early in the page, even though there may be BH_New buffers further on which
should be processed (for things like ordered data mode).

Hmm, I think we should just just change functions like walk_page_buffers()
in fs/ext3/inode.c and fs/ocfs2/aops.c to look for BH_New buffers outside
the range specified (they walk the entire buffer list anyway). If it finds
one that's buffer_new() it passes it to the journal unconditionally. You'd
also have to revert the change you did in fs/ext3/inode.c to at least always
make the call to walk_page_buffers().

I really don't like that we're hiding a detail of this interaction so deep
within the file system commit_write() callback. I suppose we can just do our

If we make the change I described above (looking for BH_New buffers outside
the range passed), then zero length or partial shouldn't matter, but zero
length instead of partial would be nicer imho just for the sake of reducing

->commit_write() would probably do fine. Currently, block_prepare_write()
uses it to know which buffers were newly allocated (the file system specific
get_block_t sets the bit after allocation). I think we could safely move
the clearing of that bit to block_commit_write(), thus still allowing us to
detect and zero those blocks in generic_file_buffered_write()

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com
-



From: Nick Piggin
Date: Monday, December 11, 2006 - 2:11 am

Well they must have done the allocation, right (or be prepared to do the
allocation at a later time) because from the point of view of the fs,
they don't know or care whether the copy has succeeded just so long as

Would this satisfy non jbd filesystems, though? How about data journalling

Well, supposing we do the full-length commit in the case of an uptodate
page, then the *only* thing we have to worry about is a zero length commit
to a !uptodate page.


We don't want to do zero length, because we might make the theoretical
livelock much easier to hit (eg. in the case of many small iovecs). But

OK, great, I'll make a few patches and see how they look. What did you
think of those other uninitialised buffer problems in my first email?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

-



From: Nick Piggin
Date: Monday, December 11, 2006 - 7:20 am

On second thoughts, I think I'm wrong about that.

Consider the last page of a file, which is uptodate. A full length
commit, which extends the file, will expose transient zeroes if the
usercopy fails.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

-



From: Nick Piggin
Date: Monday, December 11, 2006 - 8:52 am

Hmm, doesn't look like we can do this either because at least GFS2
uses BH_New for its own special things.

Also, I don't know if the trick of only walking over BH_New buffers
will work anyway, since we may still need to release resources on
other buffers as well. As you say, filesystems are simply not set up
to expect this, which is a problem.

Maybe it isn't realistic to change the API this way, no matter how
bad it is presently.

What if we tackle the problem a different way?

1. In the case of no page in the pagecache (or an otherwise
!uptodate page), if the operation is a full-page write then we
first copy all the user data *then* lock the page *then* insert it
into pagecache and go on to call into the filesystem.

2. In the case of a !uptodate page and a partial-page write, then
we can first bring the page uptodate, then continue (goto 3).

3. In the case of an uptodate page, we could perform a full-length
commit_write so long as we didn't expand i_size further than was
copied, and were sure to trim off blocks allocated past that
point.

This scheme IMO is not as "nice" as the partial commit patches,
but in practical terms it may be much more realistic.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

-



From: Steven Whitehouse
Date: Monday, December 11, 2006 - 9:12 am

Hi,

What makes you say that? As far as I know we are not doing anything we
shouldn't with this flag, and if we are, then I'm quite happy to
consider fixing it up so that we don't,

Steve.


-



From: Nick Piggin
Date: Monday, December 11, 2006 - 9:39 am

Bad wording. Many other filesystems seem to only make use of buffer_new
between prepare and commit_write.

gfs2 seems to at least test it in a lot of places, so it is hard to know
whether we can change the current semantics or not. I didn't mean that
gfs2 is doing anything wrong.

So can we clear it in commit_write?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

-



From: Steven Whitehouse
Date: Monday, December 11, 2006 - 10:18 am

Hi,

Are you perhaps looking at an older copy of the GFS2 code? We set it (or
clear it) in bmap.c:gfs2_block_map() as appropriate. It also seems to be
cleared when we release buffers (which may or may not be actually
required. I suspect not, but its harmless anyway). There is only one
test that I can find for it which is in bmap.c:gfs2_extent_map() where
its value is then later ignored in the only caller relevant to "normal"
files, which is ops_vm.c:alloc_page_backing(). So I think you should be
quite safe to clear it in commit_write().

Steve.


-



From: Mark Fasheh
Date: Tuesday, December 12, 2006 - 3:31 pm

Oh, my idea was that only the range passed to ->commit() would be walked,
but any BH_New buffers (regardless of where they are in the page) would be
passed to the journal as well. So the logic would be:

for all the buffers in the page:
  If the buffer is new, or it is within the range passed to commit, pass to
  the journal.


We definitely agree. It's not intuitive that the range should change
between ->prepare_write() and ->commit_write() and IMHO, all the issues
we've found are good evidence that this particular approach will be

Silly question - what's preventing a reader from filling the !uptodate page with disk

It seems more realistic in that it makes sure the write is properly setup
before calling into the file system. What do you think is not as nice about
it?
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com
-



From: Nick Piggin
Date: Tuesday, December 12, 2006 - 5:53 pm

If the buffer is not new, outside the commit range (and inside the prepare),
then it does not get passed to the journal. I'm not familiar with jbd, but


Not silly -- I guess that is the main sticking point. Luckily *most*
!uptodate pages will be ones that we have newly allocated so will
not be in pagecache yet.

If it is in pagecache, we could do one of a number of things: either
remove it or try to bring it uptodate ourselves. I'm not yet sure if
either of these actions will cause other problems, though :P

If both of those are really going to cause problems, then we could
solve this in a more brute force way (assuming that !uptodate, locked
pages, in pagecache at this point are very rare -- AFAIKS these will
only be caused by IO errors?). We could allocate another, temporary
page and copy the contents into there first, then into the target

The corner cases. However I guess the current scheme was building up
corner cases as well, so you might be right.

-- 
Send instant messages to your online friends http://au.messenger.yahoo.com 

-



From: Trond Myklebust
Date: Tuesday, December 12, 2006 - 6:47 pm

We are NOT going to mandate read-modify-write behaviour on
prepare_write()/commit_write(). That would be a completely unnecessary
slowdown for write-only workloads on NFS.

Trond

-



From: Nick Piggin
Date: Tuesday, December 12, 2006 - 6:56 pm

Note that these pages should be *really* rare. Definitely even for normal
filesystems I think RMW would use too much bandwidth if it were required
for any significant number of writes.

I don't want to mandate anything just yet, so I'm just going through our
options. The first two options (remove, and RMW) are probably trickier
than they need to be, given the 3rd option available (temp buffer). Given
your input, I'm increasingly thinking that the best course of action would
be to fix this with the temp buffer and look at improving that later if it
causes a noticable slowdown.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

-



From: Trond Myklebust
Date: Tuesday, December 12, 2006 - 7:31 pm

If file "foo" exists on the server, and contains data, then something
like

fd = open("foo", O_WRONLY);
write(fd, "1", 1);

should never need to trigger a read. That's a fairly common workload
when you think about it (happens all the time in apps that do random

What is the generic problem you are trying to resolve? I saw something
fly by about a reader filling the !uptodate page while the writer is
updating it: how is that going to happen if the writer has the page
locked?
AFAIK the only thing that can modify the page if it is locked (aside
from the process that has locked it) is a process that has the page
mmapped(). However mmapped pages are always uptodate, right?

Trond

-



From: Nick Piggin
Date: Tuesday, December 12, 2006 - 9:03 pm

Right. What I'm currently looking at doing in that case is two copies,
first into a temporary buffer. Unfortunate, but we'll see what the

The problem is that you can't take a pagefault while holding the page
lock. You can deadlock against another page, the same page, or the

That's right (modulo the pagefault vs invalidate race bug).

But we need to unlock the destination page in order to be able to take
a pagefault to bring the source user memory uptodate. If the page is
not uptodate, then a read might see uninitialised data.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

-



From: Trond Myklebust
Date: Wednesday, December 13, 2006 - 5:21 am

The NFS read code will automatically flush dirty data to disk if it sees
that the page is marked as dirty or partially dirty. We do this without
losing the page lock thanks to the helper function nfs_wb_page().

BTW: We will do the same thing in our mapping->release_page() in order
to fix the races you can have between invalidate_inode_pages2() and page
dirtying, however that call to test_and_clear_page_dirty() screws us
over for the case of mmapped files (although we're quite safe w.r.t.
prepare_write()/commit_write()).

Cheers
  Trond

-



From: Peter Staubach
Date: Wednesday, December 13, 2006 - 6:49 am

I have to admit that I've only been paying attention with one eye, but
why doesn't this require a read?  If "foo" is non-zero in size, then
how does the client determine how much data in the buffer to write to
the server?

Isn't RMW required for any i/o which is either not buffer aligned or
a multiple of the buffer size?

    Thanx...

       ps
-



From: OGAWA Hirofumi
Date: Monday, December 11, 2006 - 11:17 am

BTW, there are still some from==to users.

	fs/affs/file.c:affs_truncate
	fs/hfs/extent.c:hfs_file_truncate
	fs/hfsplus/extents.c:hfsplus_file_truncate
	fs/reiserfs/ioctl.c:reiserfs_unpack

I'll see those this weekend.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
-



Previous thread: Relative atime (was Re: What's in ocfs2.git) by Valerie Henson on Monday, December 4, 2006 - 5:36 pm. (10 messages)

Next thread: [RFC][PATCH 0/2] Extent base online defrag (ver 0.2) by sho on Tuesday, December 5, 2006 - 4:18 am. (1 message)