Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems

Previous thread: [PATCH 3/3] memblock: Make find_memory_core_early() find from top-down by Yinghai Lu on Friday, December 17, 2010 - 5:59 pm. (1 message)

Next thread: [PATCH] s1d13xxxfb: drop unused code by Andres Salomon on Friday, December 17, 2010 - 8:00 pm. (2 messages)
From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

These are not for merge quite yet, need a bit more changelog and comments
polishing, and splitting out filesystems patches, but if I don't get negative
comments about them, I'll be asking to merge them in next window.

Thanks,
Nick

--

From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

Filesystems appear to be using ->dirty_inode, expecting that the dirtying
operating is done and visible to all CPUs (eg. setting private inode dirty
bits, without any barriers themselves). So release the dirty "critical
section" with a barrier before calling ->dirty_inode.

Cost is not significantly changed, because we're just moving the barrier.
Those filesystems that do use ->dirty_inode should have to care slightly
less about barriers, which is a good thing.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-19 16:47:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-19 16:49:39.000000000 +1100
@@ -934,6 +934,15 @@ void __mark_inode_dirty(struct inode *in
 	bool wakeup_bdi = false;
 
 	/*
+	 * Make sure that changes are seen by all cpus before we test i_state
+	 * or mark anything as being dirty. Ie. all dirty state should be
+	 * written to the inode and visible. Like an "unlock" operation, the
+	 * mark_inode_dirty call must "release" our ordering window that is
+	 * opened when we started modifying the inode.
+	 */
+	smp_mb();
+
+	/*
 	 * Don't do this for I_DIRTY_PAGES - that doesn't actually
 	 * dirty the inode itself
 	 */
@@ -942,12 +951,6 @@ void __mark_inode_dirty(struct inode *in
 			sb->s_op->dirty_inode(inode);
 	}
 
-	/*
-	 * make sure that changes are seen by all cpus before we test i_state
-	 * -- mikulas
-	 */
-	smp_mb();
-
 	/* avoid the locking if we can */
 	if ((inode->i_state & flags) == flags)
 		return;


--

From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

It is incorrect to test inode dirty bits without participating in the inode
writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then
writes out the particular bits, then clears I_SYNC when it is done. BTW. it
may not completely write all pages out, so I_DIRTY_PAGES would get set
again.

This is a standard pattern used throughout the kernel's writeback caches
(I_SYNC ~= I_WRITEBACK, if that makes it clearer).

And so it is not possible to determine an inode's dirty status just by
checking I_DIRTY bits. Especially not for the purpose of data integrity
syncs.

Missing the check for these bits means that fsync can complete while
writeback to the inode is underway. Inode writeback functions get this
right, so call into them rather than try to shortcut things by testing
dirty state improperly.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-11-19 16:44:39.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-11-19 16:49:42.000000000 +1100
@@ -895,11 +895,6 @@ int generic_file_fsync(struct file *file
 	int ret;
 
 	ret = sync_mapping_buffers(inode->i_mapping);
-	if (!(inode->i_state & I_DIRTY))
-		return ret;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return ret;
-
 	err = sync_inode_metadata(inode, 1);
 	if (ret == 0)
 		ret = err;
Index: linux-2.6/fs/exofs/file.c
===================================================================
--- linux-2.6.orig/fs/exofs/file.c	2010-11-19 16:50:00.000000000 +1100
+++ linux-2.6/fs/exofs/file.c	2010-11-19 16:50:07.000000000 +1100
@@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file
 	struct inode *inode = filp->f_mapping->host;
 	struct super_block *sb;
 
-	if (!(inode->i_state & I_DIRTY))
-		return 0;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return 0;
-
 	ret = sync_inode_metadata(inode, 1);
 
 	/* This is a good place to write the sb */


--

From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

Inode dirty state cannot be securely tested without participating properly
in the inode writeback protocol. Some filesystems need to check this state,
so break out the code into helpers and make them available.

This could also be used to reduce strange interactions between background
writeback and fsync. Currently if we fsync a single page in a file, the
entire file gets requeued to the back of the background IO list, even if
it is due for writeout and has a large number of pages. That's left for
a later time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-12-16 18:33:14.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-12-16 18:33:17.000000000 +1100
@@ -299,26 +299,25 @@ static void inode_wait_for_writeback(str
 	}
 }
 
-/*
- * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via __iget or via syscall against an fd)
- * or the inode has I_WILL_FREE set (via generic_forget_inode)
+/**
+ * inode_writeback_begin -- prepare to writeback an inode
+ * @indoe: inode to write back
+ * @wait: synch writeout or not
+ * @Returns: 0 if wait == 0 and this call would block (due to other writeback).
+ *           otherwise returns 1.
  *
- * If `wait' is set, wait on the writeout.
+ * Context: inode_lock must be held, may be dropped. Returns with it held.
  *
- * The whole writeout design is quite complex and fragile.  We want to avoid
- * starvation of particular inodes when others are being redirtied, prevent
- * livelocks, etc.
+ * inode_writeback_begin sets up an inode to be written back (data and/or
+ * metadata). This must be called before examining I_DIRTY state of the
+ * inode, and should be called at least before any data integrity writeout.
  *
- * Called under inode_lock.
+ * If inode_writeback_begin returns 1, it must be followed by a call to
+ * ...
From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

Optimise fsync by adding a datasync parameter to sync_inode_metadata to
DTRT with writing back the inode (->write_inode in theory should have a
datasync parameter too perhaps, but that's for another time).

Also, implement the metadata sync optimally rather than reusing the
normal data writeback path. This means less useless moving the inode around the
writeback lists, and less dropping and retaking of inode_lock, and avoiding
the data writeback call with nr_pages == 0.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-12-16 18:29:02.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-12-16 18:33:29.000000000 +1100
@@ -883,7 +883,7 @@ static int pohmelfs_fsync(struct file *f
 {
 	struct inode *inode = file->f_mapping->host;
 
-	return sync_inode_metadata(inode, 1);
+	return sync_inode_metadata(inode, datasync, 1);
 }
 
 ssize_t pohmelfs_write(struct file *file, const char __user *buf,
Index: linux-2.6/fs/exofs/file.c
===================================================================
--- linux-2.6.orig/fs/exofs/file.c	2010-12-16 18:33:16.000000000 +1100
+++ linux-2.6/fs/exofs/file.c	2010-12-16 18:33:29.000000000 +1100
@@ -48,7 +48,7 @@ static int exofs_file_fsync(struct file
 	struct inode *inode = filp->f_mapping->host;
 	struct super_block *sb;
 
-	ret = sync_inode_metadata(inode, 1);
+	ret = sync_inode_metadata(inode, datasync, 1);
 
 	/* This is a good place to write the sb */
 	/* TODO: Sechedule an sb-sync on create */
Index: linux-2.6/fs/ext2/dir.c
===================================================================
--- linux-2.6.orig/fs/ext2/dir.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/ext2/dir.c	2010-12-16 18:33:29.000000000 +1100
@@ -98,7 +98,7 @@ static int ext2_commit_chunk(struct page
 	if (IS_DIRSYNC(dir)) {
 		err = write_one_page(page, 1);
 		if ...
From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

Checking I_DIRTY* bits is racy unless we're under I_SYNC, because there
might be a concurrent writeback thread that has cleared I_DIRTY, but
has not yet completed the ->write_inode call.

This means our (typically integrity/fsync) operation can finish before
previously dirty data has been written safely to disk.

Solve it by exporting inode_writeback_begin/end to filesystems, and have
them use that before checking dirty flags, where it matters.

I'm not thrilled at exporting inode_lock, however I didn't see a good way
to make that code fully generic (eg. some cases want to avoid taking
and dropping locks if there is nothing to do, future users might want to
be smarter about dirty bits and keep their own inode dirty bits in synch
under the same lock, etc).

Not signed off yet

Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-12-18 03:04:10.000000000 +1100
@@ -373,6 +373,7 @@ static int pohmelfs_write_inode_create_c
 		dprintk("%s: parent: %llu, ino: %llu, inode: %p.\n",
 				__func__, parent->ino, n->ino, inode);
 
+		/* XXX: is this race WRT writeback? */
 		if (inode && (inode->i_state & I_DIRTY)) {
 			struct pohmelfs_inode *pi = POHMELFS_I(inode);
 			pohmelfs_write_create_inode(pi);
Index: linux-2.6/fs/gfs2/file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/file.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/gfs2/file.c	2010-12-18 03:04:10.000000000 +1100
@@ -557,7 +557,7 @@ static int gfs2_close(struct inode *inod
 static int gfs2_fsync(struct file *file, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
-	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
+	unsigned dirty, mask;
 	int ret = 0;
 
 	if (gfs2_is_jdata(GFS2_I(inode))) {
@@ -565,13 +565,35 @@ static int ...
From: Christoph Hellwig
Date: Wednesday, December 29, 2010 - 8:01 am

As mentioned last round I think the exporting of inode_lock and pusing
of the I_DIRTY* complexities into the filesystems can be avoided.  See
the patch below, which compiles and passes xfstests for xfs, but
otherwise isn't quite done yet.  The only code change vs the opencoded
variant in the patch is that we do a useless inode_lock roundtrip
for a non-dirty inode on gfs2, which is I think is acceptable,
especially once we have the lock split anyway.

The other thing I don't like yet is passing the datasync flag - the
callback shouldn't care about what we were called with, but rather
about which bits it needs to sync out - which the dirty flag already
tells us about.

IMHO the behaviour in ocfs2 and gfs2 that relies on it is plain wrong:

 - ocfs2 really should always force the journal if any bit we care about
   in the inode is dirty, and only do the pure cache flush is nothing
   we care about is dirty (similar to the more complex code in XFS)
 - gfs2 seems really weird.  Doesn't it need to do any log force
   if an inode has a pending transaction?  Currently it only does for
   stuffed inodes, and if datasync was set, which seems weird.  Also
   I can't see why we'd never want to call into ->write_inode to write
   out the inode for the datasync case - except for not catching
   timestamp updates fdatasync really isn't any different from fsync.


Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-12-29 15:45:37.153003373 +0100
+++ linux-2.6/fs/fs-writeback.c	2010-12-29 15:46:13.566021881 +0100
@@ -315,7 +315,7 @@ static void inode_wait_for_writeback(str
  * If inode_writeback_begin returns 1, it must be followed by a call to
  * inode_writeback_end.
  */
-int inode_writeback_begin(struct inode *inode, int wait)
+static int inode_writeback_begin(struct inode *inode, int wait)
 {
 	assert_spin_locked(&inode_lock);
 	if (!atomic_read(&inode->i_count))
@@ -356,7 +356,7 @@ ...
From: Steven Whitehouse
Date: Monday, January 3, 2011 - 8:03 am

Hi,

The algorithm was intended to be:

 - With "journaled data" files
   - Do a log flush conditional upon the inode's glock
   - The core code then writes back any dirty pages

 - With regular files/directories
  - If datasync is not set, we need to write back the metadata including
timestamp updates, so that is done via ->write_inode. Note that an extra
complication here is that we need to get the glock on the inode if we
don't already have it in order to check and conditionally update the
atime.The call to ->write_inode includes an implicit (conditional) log
flush.
 - If datasync is set, we assume that only the data pages need to be
written out. My understanding of datasync was that it was only supposed
to write out data and never any of the metadata. The reason for the call
to flush the log for "stuffed" files is that the data shares a disk
block with the inode metadata, so we cannot avoid the log flush in this
case, since we must unpin the block to write it back.

There is something strange going on here though since there should be a
metadata sync included as well I think - I'm just working back through
the changes to see where that was lost at the moment,

Steve.


--

From: Christoph Hellwig
Date: Monday, January 3, 2011 - 9:58 am

What happens to indirect blocks, inode size updates, etc?  In general
the only correct form to use the datasync argument is along the lines
of:

	if ((inode->i_state & I_DIRTY_DATASYNC) ||
	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) {
		/* write out the inode */
	} else {
		/*
		 * VFS inode not dirty, no need to write it out.
		 *
		 * If the filesystem support asynchronous inode writes,
		 * we may have to wait for them here.
		 */
	}

or rather mostly correct, as pointed out by Nick in this series, that's
why the above gets replaced with an equivalent check that also
participates in the writeback locking protocol in this series.

For gfs2 on current mainline an fsync respecting that would look like:

static int gfs2_fsync(struct file *file, int datasync)
{
	struct inode *inode = file->f_mapping->host;
	struct gfs2_inode *ip = GFS2_I(inode);
	int ret = 0;

	if (gfs2_is_jdata(ip) {
		gfs2_log_flush(GFS2_SB(inode), ip);
		return 0;
	}

	if ((inode->i_state & I_DIRTY_DATASYNC) ||
	    ((inode->i_state & I_DIRTY_SYNC) && !datasync))
		sync_inode_metadata(inode, 1);
	else if (gfs2_is_stuffed(ip))
		gfs2_log_flush(GFS2_SB(inode), ip->i_gl);
}

Note that the asynchronous write_inode_now is replaced with a
sync_inode_metadata, which doesn't incorrectly write data again, and
makes sure we do a synchronous write.

I'm still not quite sure how the gfs2_log_flush are supposed to work.
What's the reason we don't need the ->write_inode call for journaled
data mode?  Also is it guaranteed that we might not have an asynchronous
transaction that update the inode in the log, e.g. why doesn't gfs2
need some sort of log flush even if the VFS inode is not dirty, unlike
most other journaled filesystems.

--

From: Nick Piggin
Date: Tuesday, January 4, 2011 - 12:12 am

Just to recap, basically we have 2 main problems in vfs/filesystems:

- i_state dirtyness is checked outside the correct synchronization
  protcol, so it may be seen as clean before a concurrent writer
  has finished.

- .write_inode is only guaranteed to be called once regardless of sync
  or async mode, for a dirty inode at a sync point. Many filesystems
  were incorrectly assuming they would be called once *in synchronous
  mode*.

  The optimal approach for .write_inode seems to be clean the struct
  inode so that it may be eventually reclaimed. Then have your .fsync and
  .sync_fs implementations enforce the actual data integrity.

  Note that "clean struct inode" often means to copy the metadata
  somewhere else to be scheduled for asynch writeout. You have to be
  careful to note that if you allow the inode to be evicted at this
  point without data integrity point also in .evict_inode, then you need
  to keep in mind that .sync_fs (and subsequent .fsync, if the inode is
  re opened) need to still enforce integrity for these potentially
  evicted inodes.

Everyone happy with this? Please review your filesystems and look at
my patches :)

Thanks,
Nick

--

From: Steven Whitehouse
Date: Tuesday, January 4, 2011 - 7:22 am

Hi,


I think that has just been missed due to the way in which the code has
developed. It appears to be needed to me, but originally all the
timestamp updates were handled internally by GFS2 and in a synchronous
manner, so that there was no need for ->write_inode() in that case. I
think that needs to be added now the vfs looks after atime updates
though, in order to be correct.

After the log flush there should also be a write on the metadata mapping
as per the inode_go_sync() function which is very similar (but not quite
similar enough to use the same code, I think) function,

Steve.


--

From: Nick Piggin
Date: Monday, January 3, 2011 - 11:04 pm

Yes I did see that, I hoped to continue discussion of that detail.

Let me start out by saying OK I will agree to hold off that change
until inode_lock is removed at least, and concentrate on just the
fixes.

However I strongly believe that filesystems should be able to access
and manipulate the inode dirty state directly. If you agree with that,
then I think they should be able to access the lock required for that.
Filesystems will want to keep their internal state in synch with vfs
visible state most likely (eg. like your hfsplus patches), and _every_
time we do "loose" coupling between state bits like this (eg. page and
buffer state; page and pte state; etc), it turns out to be a huge mess

I dislike this style, except where it has some real advantages like
get_block case. I prefer just to make the existing inode_writeback_begin
into a "__special" variant, and make inode_writeback_begin just do the

The bigger issue IMO is if filesystems want to be smarter with dirty
bit handling and keep more internal state in sync with it. I don't
see any problem at all with allowing them to lock the dirty state.
(but will hold off the patch for now, as said).
 
Thanks,
Nick

--

From: Nick Piggin
Date: Tuesday, January 4, 2011 - 12:52 am

Agree and as I said I'll change inode_writeback_begin/end to do the

Right if you want a helper to get the correct mask of bits required
that's fine and I agree, but locking is a different issue too: if
filesystems are trying to keep private state in sync with vfs state,
then they _need_ to do it properly with the proper locking. I think
your hfsplus implementation had a bug or two in this area didn't it?
(although I ended up getting side tracked with all these bugs half

I disagree, but we'll explore it further later.

It is a couple of lines to check and clear dirty bits.  Not rocket
science and I think it is far better to make it explicit what is
happening to the filesystem. The disconnect between what the vfs is
doing and what the filesystems throught should be happening is what
caused all these bugs to start with.

Anyway, long story short, I'll drop the inode_lock export and move the
locking and manipulation into inode_writeback_begin/end for now.

Thanks,
Nick

--

From: Christoph Hellwig
Date: Tuesday, January 4, 2011 - 2:13 am

It's not locking, but ordering that was the issue.  It's an issue caused
by the VFS interfaces, but not really related to the area you work
on currently.  If ->dirty_inode told us what was dirtied it would be
a lot simpler.  Alternatively we should just stop requiring filesystems
to participate in the I_DIRTY_SYNC/DATASYNC protocol.  In general it's
much easier for the filesystem to keep that state by itself in proper
granularity.  But I lost the fight to have the timestamp updates go
through proper methods instead of just writing into the VFS inode and
marking the inode dirty long ago, so we'll have to live with it.
--

From: Nick Piggin
Date: Tuesday, January 4, 2011 - 2:28 am

Right, but ordering should probably not be an issue if filesystem could

Well it's one side of the issue, the other side is indeed the dirtying

I wouldn't mind revisiting that, once these correctness fixes are in
(and yes that's another good reason to hold off with allowing
filesystems to do the i_state locking just yet).

Allowing the filesystem to entirely manage the setting and clearing of
dirty bits would be a good idea. Then just have a single bit that
specifies they want background writeout to run a callback together when
it does data writeout for that inode. Everything else would be done in
the fs.

Page dirtying has similarly silly conventions like having the caller
clear dirty before calling into ->writepage, which means the filesystem
has to have hacks like redirty_page_for_io and has a hard time keeping
page dirty state in sync with page private metadata. (completely
different issue but similar general problem of having things half
managed by one side and half by the other).

--

From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

Otherwise we think the inode is clean even if syncing failed.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-23 22:28:22.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-23 22:57:40.000000000 +1100
@@ -447,15 +447,25 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
-		int err = write_inode(inode, wbc);
+		int err;
+
+		spin_unlock(&inode_lock);
+		err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		spin_lock(&inode_lock);
+		if (err) {
+			/*
+			 * Inode writeout failed, restore inode metadata
+			 * dirty bits.
+			 */
+			inode->i_state |= dirty &
+					(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+		}
 	}
 
-	spin_lock(&inode_lock);
 	if (!inode_writeback_end(inode)) {
 		if (wbc->nr_to_write <= 0) {
 			/*


--

From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

Add a sync_inode i_op, for filesystems that can fsync with the inode and
not requiring the file (which is most of them), and use it in filesystems
and nfsd to correctly sync things.

Get rid of sync_inode, unexport sync_inodes_sb and make it internal, remove the
'wait' parameter from sync_inode_metadata, and document these things a little
better.

The core of the problem is that ->write_inode(inode, sync==1) is not
guaranteed to have the inode metadata on disk. This is because filesystems
that do asynchronous ->write_inode stuff don't "obey" sync==1. Thus, generic
code cannot assume that ->write_inode is all that is required for a full
sync. (filesystem specific code can, but that's a bit ugly I'd like to phase
it out or clarify with different function names).

Not signed off yet

---
 Documentation/filesystems/porting |    6 ++++
 drivers/staging/pohmelfs/inode.c  |    2 -
 fs/ext2/dir.c                     |    4 ++
 fs/ext2/ext2.h                    |    4 +-
 fs/ext2/file.c                    |   32 +++------------------
 fs/ext2/inode.c                   |   38 ++++++++++++++++++++++++-
 fs/ext2/namei.c                   |    1 
 fs/ext2/xattr.c                   |    4 ++
 fs/fat/file.c                     |   33 ++++++++++------------
 fs/fat/inode.c                    |   21 +++++---------
 fs/fs-writeback.c                 |   23 ---------------
 fs/internal.h                     |    5 +++
 fs/libfs.c                        |   57 ++++++++++++++++++++++++++++++++++++++
 fs/nfs/write.c                    |   16 ++++------
 fs/nfsd/vfs.c                     |    3 ++
 fs/sync.c                         |    8 +++++
 include/linux/fs.h                |   12 +++++++-
 include/linux/writeback.h         |    1 
 18 files changed, 172 insertions(+), 98 deletions(-)

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-12-18 02:08:49.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-12-18 ...
From: Christoph Hellwig
Date: Wednesday, December 29, 2010 - 8:12 am

- the sync_inodes_sb export removal looks fine to, but should be a
   separate patch with a good changelog
 - why is the sync_inode_metadata wait parameter removed?  Especially
   as we would need it for some callers that need to be converted to it.
   E.g. in this patch converts writeback_inode from non-blocking to
   blocking behaviour.
 - except for that the sync_inode() removal is fine, I had planned for
   that already.  But again, please a separate and well-documented
   patch.

As for the actualy sync_inode operation:  I don't think it's helpful.
The *_sync_inode helpers in ext2 and fat are fine, but there's little
point in going through an iops vector for it.  Also adding the file
syncing really complicates the API for now reason, epecially with
the range interface.

--

From: Nick Piggin
Date: Monday, January 3, 2011 - 11:27 pm

Basically the blocking or non blocking behaviour of ->write_inode is
irrelevant and will go away. Specific behaviour cannot be relied upon from
generic code, only filesystems. But even filesystems should not
differentiate between blocking and non-blocking (where they do, they
have been buggy). That is the *other* big bug in the code (the first one
being syncing code not waiting for writeback).

So after my series, we guarantee ->write_inode is called after the inode
has been dirtied. We make no guarantees about what mode it is called in
(blocking or non blocking). So the filesystem should either have _all_
write_inode calls do sync writeout, or have them all do part of the op
(eg. clean struct inode by dirtying pagecache) and then syncing in fsync
and sync_fs. In the latter scheme, it doesn't make sense to do anything
more in the case of a "sync" call.

Basically you can spot the bugs by looking for where the .write_inode
functions differentiate the writeback mode.

So, that's all very confusing with the current set of APIs that are
called sync_inode, sync_inode_metadata, etc. So I'm trying to reduce and


It does have a reason, which is the nfsd syncing callback -- using
sync_inode_metadata there is actually wrong and should really be
replaced with a warning that the fs cannot support such syncing.

See the problem I explained above -- it really needs to do a full
fsync call. But it doesn't have a file descriptor, and most filesystems
don't need one, so I think a sync_inode operation is nice.

Also giving filesystems the flexibility to do the data writeout can
be more optimal by doing all writeout at once or ordering to suit their
writeback patterns. Having range data could give some performance
advantages writing back mappings or commit operations over network. I
don't see it as a big complexity. It gives a chance to do range fsyncs
and sync_file_range and such properly too.


--

From: Christoph Hellwig
Date: Monday, January 3, 2011 - 11:57 pm

There is absolutely no problem with writing out inodes asynchronously,
it's just that the writeback code can't really cope with it right now.
The fix is to only update i_state on I/O completion.

The blocking vs non-blocking difference is actually quite important for
performance still.  In fact for most modern filesystems we simply don't
every want to do a normal writeout for the !sync case.  Either it would
be a no-op, or do something like we do in xfs currently, where we make
sure to read the inode to start the read/modify/write cycle early, and
put it at the end of the delayed write queue.  The sync one on the other
hand just logs the inode, so that the log force in ->sync_fs/etc can

It doesn't need to do an fsync, it's actually a quite different
operations.  That's why we added the ->commit_metadata operations.  What
NFSD really wants is to guarantee synchronous metadata operations.
Traditionally unix required the filesystems to be mounted using -o wsync
for that, but we try to emulate that from NFSD without affecting other
access to the filesystem.  The ->commit_metadata is there to ensure any
previously started async operations completes.  It does not have to
push all dirty state to disk like fsync.  Just compare the complexities
of xfs_file_fsync and xfs_fs_nfs_commit_metadata - the latter simply
checks if the inode has been logged, and if yes forces the log to disk
up to the last operation on this inode.  fsync does the same force if
the inode is not dirty, but otherwise has to start a new transactions.
It also has to wait for pending I/O completions before dealing with
metadata, and issue barriers to data devices, which is completly

We currently do all that just fine before calling into ->fsync.  That
doesn't mean we can't move it into ->fsync if a good reason for it comes
up, but

 a) it needs a good rational
 b) it should stay in ->fsync instead of beeing mixed up with the method
    used for other metadata writeback

--

From: Nick Piggin
Date: Tuesday, January 4, 2011 - 1:03 am

I don't know what you mean by this. As a filesystem generic API,
.write_inode says almost nothing. Some filesystems will do one thing,
some may do something in future after patches to the vfs to allow more
fancy work, others do other things, generic code can't assume any of

I don't see how it could be. All integrity points still have to wait for
all potentially non blocking write_inode, so if that's the case just make
*all* the write_inode non blocking to begin with. Simpler code and
you'd get more performance at least in the sync(3) case where writeback


Thanks for the interesting comments, but I didn't say it needs an fsync,
I said sync_inode_metadata is not sufficient by definition because it

What do you mean we do all that just fine? A filesystem can't schedule
data submission and waiting optimally versus metadata, it can't do
metadata operations or remote commits corresponding to data ranges, and
it doesn't support nfs sync operations.

--

From: Nick Piggin
Date: Tuesday, January 4, 2011 - 2:49 am

Right, but that would be a bit of a change to generic code and a really
big change to make all filesystems support the API.

What I am saying right now is that .write_inode doesn't mean anything
to generic code, except that it does something "nice" for
writeback/reclaim purposes, and it needs to be called within the
sync / fsync protocol.

But .write_inode on its own doesn't mean anything. Not even if 99% of
the filesystems implement it in a particular way. Hence,

It doesn't matter, the integrity still has to be enforced in .sync_fs,
because sync .write_inode may *never* get called, because of the fact
that async .write_inode also clears the inode metadata dirty bits.

So if .sync_fs has to enforce integrity anyway, then you don't ever need

The patches are working towards that. It's a lot less broken that it
was, and (barring my oversights, or existing fs bugs) it should actually
"work" now.

In cases where particular calls remain, it is for example where the

Well then you have a bug, because a sync .write_inode *may never get
called*. You may only get an async one, even in the case of fsync,


Just if I'm adding this API it makes sense to add the obvious
parameters. A sync range does not "hugely complicate the API" at

The one that is attempted to be implemented with sync_inode_metadata.

--

From: Nick Piggin
Date: Friday, December 17, 2010 - 6:46 pm

There is a big fuckup with inode metadata writeback (I suspect in a lot
of filesystems, but I've only dared to look at a couple as yet).

ext2 relies on ->write_inode being called from sync_inode_metadata in
fsync in order to sync the inode. However I_DIRTY_SYNC gets cleared
after a call to this guy, and it doesn't actually write back and wait
on the inode block unless it is called for sync. This means that write_inode
from background writeback can kill the inode dirty bits without the data
getting to disk. Fsync will subsequently miss it.

The fix is for ->write_inode to dirty the buffers/cache, and then ->fsync to
write back the dirty data. In the full filesystem sync case, buffercache
writeback in the generic code will write back the dirty data. Other
filesystems could use ->sync_fs for this.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/ext2/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -1211,7 +1211,7 @@ static int ext2_setsize(struct inode *in
 	return 0;
 }
 
-static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
+struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
 					struct buffer_head **p)
 {
 	struct buffer_head * bh;
@@ -1505,16 +1505,8 @@ static int __ext2_write_inode(struct ino
 	} else for (n = 0; n < EXT2_N_BLOCKS; n++)
 		raw_inode->i_block[n] = ei->i_data[n];
 	mark_buffer_dirty(bh);
-	if (do_sync) {
-		sync_dirty_buffer(bh);
-		if (buffer_req(bh) && !buffer_uptodate(bh)) {
-			printk ("IO error syncing ext2 inode [%s:%08lx]\n",
-				sb->s_id, (unsigned long) ino);
-			err = -EIO;
-		}
-	}
-	ei->i_state &= ~EXT2_STATE_NEW;
 	brelse (bh);
+	ei->i_state &= ~EXT2_STATE_NEW;
 	return err;
 }
 
Index: linux-2.6/fs/ext2/file.c
===================================================================
--- ...
Previous thread: [PATCH 3/3] memblock: Make find_memory_core_early() find from top-down by Yinghai Lu on Friday, December 17, 2010 - 5:59 pm. (1 message)

Next thread: [PATCH] s1d13xxxfb: drop unused code by Andres Salomon on Friday, December 17, 2010 - 8:00 pm. (2 messages)