Hi, I'd like to try to state where we are WRT the buffered write patches, and ask for comments. Sorry for the wide cc list, but this is an important issue which hasn't had enough review. Well the next -mm will include everything we've done so far. I won't repost patches unless someone would like to comment on a specific one. I think the core generic_file_buffered_write is fairly robust, after fixing the efault and zerolength iov problems picked up in testing (thanks, very helpful!). So now I *believe* we have an approach that solves the deadlock and doesn't expose transient or stale data, transient zeroes, or anything like that. Error handling is getting close, but there may be cases that nobody has picked up, and I've noticed a couple which I'll explain below. I think we do the right thing WRT pagecache error handling: a !uptodate page remains !uptodate, an uptodate page can handle the write being done in several parts. Comments in the patches attempt to explain how this works. I think it is pretty straightforward. But WRT block allocation in the case of errors, it needs more review. Block allocation: - prepare_write can allocate blocks - prepare_write doesn't need to initialize the pagecache on top of these blocks where it is within the range specified in prepare_write (because the copy_from_user will initialise it correctly) - In the case of a !uptodate page, unless the page is brought uptodate (ie the copy_from_user completely succeeds) and marked dirty, then a read that sneaks in after we unlock the page (to retry the write) will try to bring it uptodate by pulling in the uninitialised blocks. Problem 1: I think that allocating blocks outside i_size is OK WRT uninitialised data, because we update i_size only after a successful copy. However, I don't think we trim these blocks off (eg. perhaps the "prepare_write may have instantiated a few blocks" path should be the normal error path for both the copy_from_user and the commit_write error ...
Hi Nick,
I pulled broken-out-2006-12-05-01-0.tar.gz from ftp.kernel.org and applied
the following patches to get a reasonable idea of what the final product
would look like:
revert-generic_file_buffered_write-handle-zero-length-iovec-segments.patch
revert-generic_file_buffered_write-deadlock-on-vectored-write.patch
generic_file_buffered_write-cleanup.patch
mm-only-mm-debug-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks.patch
mm-fix-pagecache-write-deadlocks-comment.patch
mm-fix-pagecache-write-deadlocks-xip.patch
mm-fix-pagecache-write-deadlocks-mm-pagecache-write-deadlocks-efault-fix.patch
mm-fix-pagecache-write-deadlocks-zerolength-fix.patch
mm-fix-pagecache-write-deadlocks-stale-holes-fix.patch
fs-prepare_write-fixes.patch
If this is incorrect, or I should apply further patches, please let me know.
In generic_file_buffered_write() we now do:
status = a_ops->commit_write(file, page, offset,offset+copied);
Which tells the file system to commit only the amount of data that
filemap_copy_from_user() was able to pull in, despite our zeroing of
the newly allocated buffers in the copied != bytes case. Shouldn't we be
doing:
status = a_ops->commit_write(file, page, offset,offset+bytes);
instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
parts of the page which are properly allocated and zero'd but haven't been
copied into yet? I think that in the case of a crash just after the
transaction is closed in ->commit_write(), we might lose those guarantees,
For some reason, I'm not seeing where BH_New is being cleared in case with
no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
to clear the flag somewhere (perhaps in block_commit_write()?).
Ok, that's it for now. I have some thoughts regarding the asymmetry between
ranges passed to ->prepare_write() and ->commit_write(), but I'd like to
save those thoughts until I know whether my comments above uncovered real
issues ...That looks right (there are fixes for the "cont" buffer scheme, but they No, because we might be talking about buffers that haven't been newly allocated, but are not uptodate (so the pagecache can contain junk). We can't zero these guys and do the commit_write, because that exposes transient zeroes to a concurrent reader. What we *could* do, is to do the full length commit_write for uptodate pages, even if the copy is short. But we still need to do a zero-length commit_write if the page is not uptodate (this would reduce the number of new cases that need to be considered). Does a short commit_write cause a problem for filesystems? They can still do any and all operations they would have with a full-length one. But maybe it would be better to eliminate that case. OK. How about a zero-length commit_write? In that case again, they should be able to remain unchanged *except* that they are not to extend i_size Hmm, it is a bit inconsistent. It seems to be anywhere from prepare_write to block_write_full_page. Where do filesystems need the bit? It would be nice to clear it in Very helpful, thanks ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com
Ahh ok - zeroing would populate with incorrect data and we can't write those buffers because we'd write junk over good data. And of course, the way it works right now will break ordered write mode. If they've done allocation, yes. You're telling the file system to stop early in the page, even though there may be BH_New buffers further on which should be processed (for things like ordered data mode). Hmm, I think we should just just change functions like walk_page_buffers() in fs/ext3/inode.c and fs/ocfs2/aops.c to look for BH_New buffers outside the range specified (they walk the entire buffer list anyway). If it finds one that's buffer_new() it passes it to the journal unconditionally. You'd also have to revert the change you did in fs/ext3/inode.c to at least always make the call to walk_page_buffers(). I really don't like that we're hiding a detail of this interaction so deep within the file system commit_write() callback. I suppose we can just do our If we make the change I described above (looking for BH_New buffers outside the range passed), then zero length or partial shouldn't matter, but zero length instead of partial would be nicer imho just for the sake of reducing ->commit_write() would probably do fine. Currently, block_prepare_write() uses it to know which buffers were newly allocated (the file system specific get_block_t sets the bit after allocation). I think we could safely move the clearing of that bit to block_commit_write(), thus still allowing us to detect and zero those blocks in generic_file_buffered_write() Thanks, --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com -
Well they must have done the allocation, right (or be prepared to do the allocation at a later time) because from the point of view of the fs, they don't know or care whether the copy has succeeded just so long as Would this satisfy non jbd filesystems, though? How about data journalling Well, supposing we do the full-length commit in the case of an uptodate page, then the *only* thing we have to worry about is a zero length commit to a !uptodate page. We don't want to do zero length, because we might make the theoretical livelock much easier to hit (eg. in the case of many small iovecs). But OK, great, I'll make a few patches and see how they look. What did you think of those other uninitialised buffer problems in my first email? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
On second thoughts, I think I'm wrong about that. Consider the last page of a file, which is uptodate. A full length commit, which extends the file, will expose transient zeroes if the usercopy fails. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Hmm, doesn't look like we can do this either because at least GFS2 uses BH_New for its own special things. Also, I don't know if the trick of only walking over BH_New buffers will work anyway, since we may still need to release resources on other buffers as well. As you say, filesystems are simply not set up to expect this, which is a problem. Maybe it isn't realistic to change the API this way, no matter how bad it is presently. What if we tackle the problem a different way? 1. In the case of no page in the pagecache (or an otherwise !uptodate page), if the operation is a full-page write then we first copy all the user data *then* lock the page *then* insert it into pagecache and go on to call into the filesystem. 2. In the case of a !uptodate page and a partial-page write, then we can first bring the page uptodate, then continue (goto 3). 3. In the case of an uptodate page, we could perform a full-length commit_write so long as we didn't expand i_size further than was copied, and were sure to trim off blocks allocated past that point. This scheme IMO is not as "nice" as the partial commit patches, but in practical terms it may be much more realistic. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Hi, What makes you say that? As far as I know we are not doing anything we shouldn't with this flag, and if we are, then I'm quite happy to consider fixing it up so that we don't, Steve. -
Bad wording. Many other filesystems seem to only make use of buffer_new between prepare and commit_write. gfs2 seems to at least test it in a lot of places, so it is hard to know whether we can change the current semantics or not. I didn't mean that gfs2 is doing anything wrong. So can we clear it in commit_write? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Hi, Are you perhaps looking at an older copy of the GFS2 code? We set it (or clear it) in bmap.c:gfs2_block_map() as appropriate. It also seems to be cleared when we release buffers (which may or may not be actually required. I suspect not, but its harmless anyway). There is only one test that I can find for it which is in bmap.c:gfs2_extent_map() where its value is then later ignored in the only caller relevant to "normal" files, which is ops_vm.c:alloc_page_backing(). So I think you should be quite safe to clear it in commit_write(). Steve. -
Oh, my idea was that only the range passed to ->commit() would be walked, but any BH_New buffers (regardless of where they are in the page) would be passed to the journal as well. So the logic would be: for all the buffers in the page: If the buffer is new, or it is within the range passed to commit, pass to the journal. We definitely agree. It's not intuitive that the range should change between ->prepare_write() and ->commit_write() and IMHO, all the issues we've found are good evidence that this particular approach will be Silly question - what's preventing a reader from filling the !uptodate page with disk It seems more realistic in that it makes sure the write is properly setup before calling into the file system. What do you think is not as nice about it? --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com -
If the buffer is not new, outside the commit range (and inside the prepare), then it does not get passed to the journal. I'm not familiar with jbd, but Not silly -- I guess that is the main sticking point. Luckily *most* !uptodate pages will be ones that we have newly allocated so will not be in pagecache yet. If it is in pagecache, we could do one of a number of things: either remove it or try to bring it uptodate ourselves. I'm not yet sure if either of these actions will cause other problems, though :P If both of those are really going to cause problems, then we could solve this in a more brute force way (assuming that !uptodate, locked pages, in pagecache at this point are very rare -- AFAIKS these will only be caused by IO errors?). We could allocate another, temporary page and copy the contents into there first, then into the target The corner cases. However I guess the current scheme was building up corner cases as well, so you might be right. -- Send instant messages to your online friends http://au.messenger.yahoo.com -
We are NOT going to mandate read-modify-write behaviour on prepare_write()/commit_write(). That would be a completely unnecessary slowdown for write-only workloads on NFS. Trond -
Note that these pages should be *really* rare. Definitely even for normal filesystems I think RMW would use too much bandwidth if it were required for any significant number of writes. I don't want to mandate anything just yet, so I'm just going through our options. The first two options (remove, and RMW) are probably trickier than they need to be, given the 3rd option available (temp buffer). Given your input, I'm increasingly thinking that the best course of action would be to fix this with the temp buffer and look at improving that later if it causes a noticable slowdown. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
If file "foo" exists on the server, and contains data, then something like fd = open("foo", O_WRONLY); write(fd, "1", 1); should never need to trigger a read. That's a fairly common workload when you think about it (happens all the time in apps that do random What is the generic problem you are trying to resolve? I saw something fly by about a reader filling the !uptodate page while the writer is updating it: how is that going to happen if the writer has the page locked? AFAIK the only thing that can modify the page if it is locked (aside from the process that has locked it) is a process that has the page mmapped(). However mmapped pages are always uptodate, right? Trond -
Right. What I'm currently looking at doing in that case is two copies, first into a temporary buffer. Unfortunate, but we'll see what the The problem is that you can't take a pagefault while holding the page lock. You can deadlock against another page, the same page, or the That's right (modulo the pagefault vs invalidate race bug). But we need to unlock the destination page in order to be able to take a pagefault to bring the source user memory uptodate. If the page is not uptodate, then a read might see uninitialised data. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
The NFS read code will automatically flush dirty data to disk if it sees that the page is marked as dirty or partially dirty. We do this without losing the page lock thanks to the helper function nfs_wb_page(). BTW: We will do the same thing in our mapping->release_page() in order to fix the races you can have between invalidate_inode_pages2() and page dirtying, however that call to test_and_clear_page_dirty() screws us over for the case of mmapped files (although we're quite safe w.r.t. prepare_write()/commit_write()). Cheers Trond -
I have to admit that I've only been paying attention with one eye, but
why doesn't this require a read? If "foo" is non-zero in size, then
how does the client determine how much data in the buffer to write to
the server?
Isn't RMW required for any i/o which is either not buffer aligned or
a multiple of the buffer size?
Thanx...
ps
-
BTW, there are still some from==to users. fs/affs/file.c:affs_truncate fs/hfs/extent.c:hfs_file_truncate fs/hfsplus/extents.c:hfsplus_file_truncate fs/reiserfs/ioctl.c:reiserfs_unpack I'll see those this weekend. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> -
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm | [PATCH 03/07] ARM: Use shared GIC entry macros on Realview |
| Oliver Neukum | Re: [Bug #13682] The webcam stopped working when upgrading from 2.6.29 to 2.6.30 |
| Martin Schwidefsky | Re: [PATCH] optimized ktime_get[_ts] for GENERIC_TIME=y |
git: | |
| Junio C Hamano | Re: Some advanced index playing |
| Jeff King | Re: confusion over the new branch and merge config |
| Robin Rosenberg | Re: cvs2svn conversion directly to git ready for experimentation |
| Linus Torvalds | git binary size... |
| Ævar Arnfjörð Bjarmason | Re: Challenge with Git-Bash |
| Linux Kernel Mailing List | md: move allocation of ->queue from mddev_find to md_probe |
| Linux Kernel Mailing List | md: raid0: Represent zone->zone_offset in sectors. |
| Linux Kernel Ma |
