[RFC PATCH v7] ext4: Coordinate data-only flush requests sent by fsync

Previous thread: Ensure Your Account by WEBMASTER SERVICE on Monday, November 29, 2010 - 2:23 pm. (1 message)

Next thread: [PATCH 0/2] x86, nmi watchdog: Fixes from removing old nmi watchdog by Don Zickus on Monday, November 29, 2010 - 3:07 pm. (10 messages)
From: Darrick J. Wong
Date: Monday, November 29, 2010 - 3:05 pm

On certain types of hardware, issuing a write cache flush takes a considerable
amount of time.  Typically, these are simple storage systems with write cache
enabled and no battery to save that cache after a power failure.  When we
encounter a system with many I/O threads that write data and then call fsync
after more transactions accumulate, ext4_sync_file performs a data-only flush,
the performance of which is suboptimal because each of those threads issues its
own flush command to the drive instead of trying to coordinate the flush,
thereby wasting execution time.

Instead of each fsync call initiating its own flush, there's now a flag to
indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a flush.

So, if someone calls ext4_sync_file and no flushes are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are any
other threads that are close behind in ext4_sync_file.  After that wait, the
state transitions to 2 and the flush is issued.  Once that's done, the state
goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled.  Instead of issuing a flush themselves, they simply
wait for that first thread to do it for them.  If they see that the flag is 2,
they wait for the current flush to finish, and start over.

However, there are a couple of exceptions to this rule.  First, there exist
high-end storage arrays with battery-backed write caches for which flush
commands take very little time (< 2ms); on these systems, performing the
coordination actually lowers performance.  Given the earlier patch to the block
layer to report low-level device flush times, we can detect this situation and
have all threads issue flushes without coordinating, as we did before.  The
second case is when there's a single thread issuing flushes, in which case it
can skip the ...
From: Darrick J. Wong
Date: Monday, November 29, 2010 - 3:05 pm

This patch adds to the block layer the ability to intercept flush requests for
the purpose of measuring the round-trip times of the write-cache flush command
on whatever hardware sits underneath the block layer.  The eventual intent of
this patch is for higher-level clients (filesystems) to be able to measure
flush times to tune their use of them.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---
 block/blk-core.c          |   13 +++++++++++++
 block/genhd.c             |   39 +++++++++++++++++++++++++++++++++++++++
 fs/bio.c                  |   11 +++++++++++
 include/linux/blk_types.h |    2 ++
 include/linux/blkdev.h    |    2 ++
 include/linux/genhd.h     |   23 +++++++++++++++++++++++
 6 files changed, 90 insertions(+), 0 deletions(-)


diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..233573a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1540,6 +1540,9 @@ static inline void __generic_make_request(struct bio *bio)
 
 		trace_block_bio_queue(q, bio);
 
+		if (bio->bi_rw & REQ_FLUSH)
+			bio->flush_start_time_ns = ktime_to_ns(ktime_get());
+
 		ret = q->make_request_fn(q, bio);
 	} while (ret);
 
@@ -1954,6 +1957,9 @@ void blk_start_request(struct request *req)
 		req->next_rq->resid_len = blk_rq_bytes(req->next_rq);
 
 	blk_add_timer(req);
+
+	if (req->cmd_flags & REQ_FLUSH)
+		req->flush_start_time_ns = ktime_to_ns(ktime_get());
 }
 EXPORT_SYMBOL(blk_start_request);
 
@@ -2182,6 +2188,13 @@ EXPORT_SYMBOL_GPL(blk_unprep_request);
  */
 static void blk_finish_request(struct request *req, int error)
 {
+	if (!error && req->cmd_flags & REQ_FLUSH) {
+		u64 delta;
+
+		delta = ktime_to_ns(ktime_get()) - req->flush_start_time_ns;
+		update_flush_times(req->rq_disk, delta);
+	}
+
 	if (blk_rq_tagged(req))
 		blk_queue_end_tag(req->q, req);
 
diff --git a/block/genhd.c b/block/genhd.c
index 5fa2b44..465bd00 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -538,6 +538,9 @@ void add_disk(struct gendisk *disk)
 	 */
 ...
From: Lukas Czerner
Date: Thursday, December 2, 2010 - 2:49 am

Hi,

This is not exactly helpful name for the function since it does not
store anything, but rather clear the counters. Would not be
disk_avg_flush_time_ns_clear a better name ?

Thanks!

--

From: Darrick J. Wong
Date: Monday, November 29, 2010 - 3:05 pm

For dm devices which are composed of other block devices, a flush is mapped out
to those other block devices.  Therefore, the average flush time can be
computed as the average flush time of whichever device flushes most slowly.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---
 drivers/md/dm.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)


diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7cb1352..62aeeb9 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -846,12 +846,38 @@ static void start_queue(struct request_queue *q)
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
+static void measure_flushes(struct mapped_device *md)
+{
+	struct dm_table *t;
+	struct dm_dev_internal *dd;
+	struct list_head *devices;
+	u64 max = 0, samples = 0;
+
+	t = dm_get_live_table(md);
+	devices = dm_table_get_devices(t);
+	list_for_each_entry(dd, devices, list) {
+		if (dd->dm_dev.bdev->bd_disk->avg_flush_time_ns <= max)
+			continue;
+		max = dd->dm_dev.bdev->bd_disk->avg_flush_time_ns;
+		samples = dd->dm_dev.bdev->bd_disk->flush_samples;
+	}
+	dm_table_put(t);
+
+	spin_lock(&md->disk->flush_time_lock);
+	md->disk->avg_flush_time_ns = max;
+	md->disk->flush_samples = samples;
+	spin_unlock(&md->disk->flush_time_lock);
+}
+
 static void dm_done(struct request *clone, int error, bool mapped)
 {
 	int r = error;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
 
+	if (clone->cmd_flags & REQ_FLUSH)
+		measure_flushes(tio->md);
+
 	if (mapped && rq_end_io)
 		r = rq_end_io(tio->ti, clone, error, &tio->info);
 
@@ -2310,6 +2336,8 @@ static void dm_wq_work(struct work_struct *work)
 		if (dm_request_based(md))
 			generic_make_request(c);
 		else
+			if (c->bi_rw & REQ_FLUSH)
+				measure_flushes(md);
 			__split_and_process_bio(md, c);
 
 		down_read(&md->io_lock);

--

From: Mike Snitzer
Date: Monday, November 29, 2010 - 10:21 pm

On Mon, Nov 29 2010 at  5:05pm -0500,

I share Neil's concern about having to track such fine grained
additional state in order to make the FS behave somewhat better.  What
are the _real_ fsync-happy workloads which warrant this optimization?


You're checking all devices in a table rather than all devices that will
receive a flush.  The devices that will receive a flush is left for each
target to determine (target exposes num_flush_requests).  I'd prefer to
see a more controlled .iterate_devices() based iteration of devices in
each target.

dm-table.c:dm_calculate_queue_limits() shows how iterate_devices can be
used to combine device specific data using a common callback and a data
pointer -- for that data pointer we'd need a local temporary structure

You're missing important curly braces for the else in your dm_wq_work()
change...

But the bio-based call to measure_flushes() (dm_wq_work's call) should
be pushed into __split_and_process_bio() -- and maybe measure_flushes()
could grow a 'struct dm_table *table' argument that, if not NULL, avoids
getting the reference that __split_and_process_bio() already has on the
live table.

Mike
--

From: Darrick J. Wong
Date: Monday, November 29, 2010 - 3:06 pm

On certain types of hardware, issuing a write cache flush takes a considerable
amount of time.  Typically, these are simple storage systems with write cache
enabled and no battery to save that cache after a power failure.  When we
encounter a system with many I/O threads that write data and then call fsync
after more transactions accumulate, ext4_sync_file performs a data-only flush,
the performance of which is suboptimal because each of those threads issues its
own flush command to the drive instead of trying to coordinate the flush,
thereby wasting execution time.

Instead of each fsync call initiating its own flush, there's now a flag to
indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a flush.

So, if someone calls ext4_sync_file and no flushes are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are any
other threads that are close behind in ext4_sync_file.  After that wait, the
state transitions to 2 and the flush is issued.  Once that's done, the state
goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled.  Instead of issuing a flush themselves, they simply
wait for that first thread to do it for them.  If they see that the flag is 2,
they wait for the current flush to finish, and start over.

However, there are a couple of exceptions to this rule.  First, there exist
high-end storage arrays with battery-backed write caches for which flush
commands take very little time (< 2ms); on these systems, performing the
coordination actually lowers performance.  Given the earlier patch to the block
layer to report low-level device flush times, we can detect this situation and
have all threads issue flushes without coordinating, as we did before.  The
second case is when there's a single thread issuing flushes, in which case it
can skip the ...
From: Darrick J. Wong
Date: Monday, November 29, 2010 - 3:05 pm

For md devices which are composed of other block devices, a flush is spread out
to those other block devices.  Therefore, the average flush time can be
computed as the average flush time of whichever device flushes most slowly.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---
 drivers/md/md.c |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 43243a4..af25c96 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -357,6 +357,28 @@ EXPORT_SYMBOL(mddev_congested);
  * Generic flush handling for md
  */
 
+static void measure_flushes(mddev_t *mddev)
+{
+	mdk_rdev_t *rdev;
+	u64 max = 0, samples = 0;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+		if (rdev->raid_disk >= 0 &&
+		    !test_bit(Faulty, &rdev->flags)) {
+			if (rdev->bdev->bd_disk->avg_flush_time_ns <= max)
+				continue;
+			max = rdev->bdev->bd_disk->avg_flush_time_ns;
+			samples = rdev->bdev->bd_disk->flush_samples;
+		}
+	rcu_read_unlock();
+
+	spin_lock(&mddev->gendisk->flush_time_lock);
+	mddev->gendisk->avg_flush_time_ns = max;
+	mddev->gendisk->flush_samples = samples;
+	spin_unlock(&mddev->gendisk->flush_time_lock);
+}
+
 static void md_end_flush(struct bio *bio, int err)
 {
 	mdk_rdev_t *rdev = bio->bi_private;
@@ -365,6 +387,7 @@ static void md_end_flush(struct bio *bio, int err)
 	rdev_dec_pending(rdev, mddev);
 
 	if (atomic_dec_and_test(&mddev->flush_pending)) {
+		measure_flushes(mddev);
 		/* The pre-request flush has finished */
 		queue_work(md_wq, &mddev->flush_work);
 	}

--

From: Ric Wheeler
Date: Monday, November 29, 2010 - 4:48 pm

Hi Darrick,

Just curious why we would need to have batching in both places? Doesn't your 
patch set make the jbd2 transaction batching redundant?

I noticed that the patches have a default delay and a mount option to override 
that default. The jbd2 code today tries to measure the average time needed in a 
transaction and automatically tune itself. Can't we do something similar with 
your patch set? (I hate to see yet another mount option added!)

Regards,

Ric


--

From: Darrick J. Wong
Date: Monday, November 29, 2010 - 5:19 pm

The code path that I'm changing is only executed when ext4_sync_file determines
that the flush can't go through the journal, i.e. whenever the previous
sequence of data writes hasn't resulted in any metadata updates, or if the

The mount option is no longer the delay time, as it was in previous patches.
In the (unreleased) v5 patch, the code automatically tuned the delay based on
the average flush time.  However, we then observed very low flush times (< 2ms)
and about a 6% regression on our arrays with battery-backed write cache, so the
auto-tune code was then adapted in v6 to skip the coordination if the average
flush time falls below that threshold, as it does on our arrays.

Therefore, the new mount option exists to override the default threshold.

--D
--

From: Mingming Cao
Date: Tuesday, November 30, 2010 - 5:14 pm

We hoped JBD2 could take care of the batching too. But ftrace shows
there are a fair amount of barriers (440 barriers/second, if I remember

I don't like a new mount option too. Darrick's new mount option is for
the threshold to turn on/off batching. Probably could make it a tunables
instead of a mount option.

Regards,


--

From: Neil Brown
Date: Monday, November 29, 2010 - 5:39 pm

On Mon, 29 Nov 2010 14:05:36 -0800 "Darrick J. Wong" <djwong@us.ibm.com>

I haven't seen any of the preceding discussion do I might be missing
something important, but this seems needlessly complex and intrusive.
In particular, I don't like adding code to md to propagate these timings up
to the fs, and I don't the arbitrary '2ms' number.

Would it not be sufficient to simply gather flushes while a flush is pending.
i.e
  - if no flush is pending, set the 'flush pending' flag, submit a flush,
    then clear the flag.
  - if a flush is pending, then wait for it to complete, and then submit a
    single flush on behalf of all pending flushes.

That way when flush is fast, you do a flush every time, and when it is slow
you gather multiple flushes together.
I think it would issues a few more flushes than your scheme, but it would be
a much neater solution.  Have you tried that and found it to be insufficient?

Thanks,
NeilBrown



--

From: Ric Wheeler
Date: Monday, November 29, 2010 - 5:48 pm

The problem with that is that you can introduce a wait for the next flush longer 
than it would take to complete the flush. Having the wait adjust itself 
according to the speed of the device is much better I think....

Ric

--

From: Neil Brown
Date: Monday, November 29, 2010 - 6:26 pm

Hi Ric,

  You certainly can introduce a wait longer than the flush would take, but
  the proposed code does that too.

  The proposed code waits "average flush time", then submits a flush for
  all threads that are waiting.
  My suggestion submits a flush (Which on average takes 'average flush time')
  and then submits a flush for all threads that started waiting during that
  time.

  So we do get two flushes rather than one, but also the flush starts
  straight away, so will presumably finish sooner.

  I do have the wait 'adjust itself according to the speed of the device' by
  making flushes wait for one flush to complete.


  I'm guessing that the situation where this is an issue is when you have a
  nearly constant stream of flush requests - is that right?

  In that case:
     - the current code would submit lots of over-lapping flush requests,
       reducing the opportunity for optimising multiple requests in the
       device,
     - my proposal would submit a steady sequence of flushes with minimal
       gaps
     - the code from Darrick would submit a steady sequence of flush requests
       which would be separated by pauses of 'average flush time'.
       This would increase latency, but might increase throughput too, I
       don't know.

  So it seems to me that the performance differences between my suggesting
  and Darrick's proposal are not obvious. So some testing would be
  interesting.

 I was going to suggest changes to Darrick's code to demonstrate my idea, but
 I think that code is actually wrong, so it wouldn't be a good base to start.

 In particular, the usage of a continuation seems racy.
 As soon as one thread sets EXT4_FLUSH_IDLE, another thread could call
 INIT_COMPLETION, before some other threads has had a chance to wake up and
 test the completion status in wait_for_completion.
 The effect is that the other thread would wait for an extra time+flush which
 it shouldn't have to.  So it isn't a serious race, but it looks ...
From: Darrick J. Wong
Date: Tuesday, November 30, 2010 - 4:32 pm

Yes.

--

From: Tejun Heo
Date: Tuesday, November 30, 2010 - 6:45 am

Hello,


Heh, I was about to suggest exactly the same thing.  Unless the delay
is gonna be multiple times longer than avg flush time, I don't think
the difference between the above scheme and the one w/ preemptive
delay would be anything significant especially now that the cost of
flush is much lower.  Also, as Neil pointed out in another message,
the above scheme will result in lower latency for flushes issued while
no flush is in progress.

IMO, this kind of optimization is gonna make noticeable difference
only when there are a lot of simulatenous fsyncs, in which case the
above would behave in mostly identical way with the more elaborate
timer based one anyway.

Thanks.

-- 
tejun
--

From: Ric Wheeler
Date: Tuesday, November 30, 2010 - 6:58 am

When we played with this in ext3/4, it was important to not wait when doing 
single threaded fsync's (a pretty common case) since that would just make them 
slower.

Also, the wait time for multi-threaded fsync's should be capped at some fraction 
of the time to complete a flush. For example, we had ATA_CACHE_FLUSH_EXT 
commands that took say 16ms or so to flush and waited one jiffie (4ms) and that 
worked well. It tanked when we used that fixed waiting time for a high speed 
device that could execute a flush in say 1ms (meaning we waited 4 times as long 
as it would have taken to just submit the fsync().

I am still not clear that the scheme that you and Neil are proposing would 
really batch up enough flushes to help though since you effectively do not wait.

The workload that we used years back was single threaded fs_mark (small files), 
2 threads, 4 threads, 8 threads, 16 threads.

Single threaded should show no slow down with various schemes showing 
multi-threaded writes grow with the number threads to some point....

Ric

--

From: Christoph Hellwig
Date: Tuesday, November 30, 2010 - 9:43 am

We can even optimize the second flush away if no other I/O has
completed since the first flush has started.  That will always be the
case for SATA devices as the cache flush command is not queued.

--

From: Darrick J. Wong
Date: Tuesday, November 30, 2010 - 4:31 pm

Some time ago I actually did test the patchset with the schedule_hrtimeout
removed, which I think is fairly close to what you've suggested.  As I recall,
it did help a bit, though not as much as also instituting the wait to limit the
% of disk execution time spent on flushes.  That said, I think you might be
right about the completion races, so I'll try to code up your suggestion to
see what happens with a newer kernel.

--

From: Christoph Hellwig
Date: Tuesday, November 30, 2010 - 9:41 am

FYI, here's an updated version of my patch to not run pure flushes
that also works on SCSI/ATA and not just virtio.  I suspect the patch
alone might not be enough, but together with a variant of Neil's
suggestion might do the trick, with my patch taking care of the
highend-devices and Neil's scheme of taking care of stupid SATA disks.


Index: linux-2.6/block/blk-flush.c
===================================================================
--- linux-2.6.orig/block/blk-flush.c	2010-11-30 17:27:33.108254088 +0100
+++ linux-2.6/block/blk-flush.c	2010-11-30 17:27:38.790004333 +0100
@@ -143,6 +143,17 @@ struct request *blk_do_flush(struct requ
 	unsigned skip = 0;
 
 	/*
+	 * Just issue pure flushes directly.
+	 */
+	if (!blk_rq_sectors(rq)) {
+		if (!do_preflush) {
+			__blk_end_request_all(rq, 0);
+			return NULL;
+		}
+		return rq;
+	}
+
+	/*
 	 * Special case.  If there's data but flush is not necessary,
 	 * the request can be issued directly.
 	 *
Index: linux-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_lib.c	2010-11-30 17:27:33.120254298 +0100
+++ linux-2.6/drivers/scsi/scsi_lib.c	2010-11-30 17:27:38.791003634 +0100
@@ -1060,7 +1060,7 @@ int scsi_setup_blk_pc_cmnd(struct scsi_d
 	 * that does not transfer data, in which case they may optionally
 	 * submit a request without an attached bio.
 	 */
-	if (req->bio) {
+	if (req->bio && !(req->cmd_flags & REQ_FLUSH)) {
 		int ret;
 
 		BUG_ON(!req->nr_phys_segments);
--

From: Darrick J. Wong
Date: Tuesday, January 4, 2011 - 9:27 am

On certain types of hardware, issuing a write cache flush takes a considerable
amount of time.  Typically, these are simple storage systems with write cache
enabled and no battery to save that cache after a power failure.  When we
encounter a system with many I/O threads that write data and then call fsync
after more transactions accumulate, ext4_sync_file performs a data-only flush,
the performance of which is suboptimal because each of those threads issues its
own flush command to the drive instead of trying to coordinate the flush,
thereby wasting execution time.

Instead of each fsync call initiating its own flush, we now try to detect the
situation where multiple threads are issuing flush requests.  The first thread
to enter ext4_sync_file becomes the owner of the flush, and any threads that
enter ext4_sync_file before the flush finishes are queued up to wait for the
next flush.  When that first flush finishes, one of those sleeping threads is
woken up to perform the next flush and wake up the other threads that are
merely asleep waiting for the second flush to finish.

In the single-threaded case, the thread will simply issue the flush and exit.
There is no delay factor as in previous versions of this patch.

This patch probably belongs in the block layer, but now that I've made a
working proof-of-concept in ext4 I thought it expedient to push it out to
lkml/linux-ext4 for comments while I work on moving it to the block layer.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---

 fs/ext4/ext4.h  |   15 ++++++++
 fs/ext4/fsync.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/super.c |   10 ++++++
 3 files changed, 124 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94ce3d7..2ea4fe6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -36,6 +36,16 @@
 /*
  * The fourth extended filesystem constants/structures
  */
+struct flush_completion_t {
+	struct completion ready;
+	struct completion finish;
+	int ...
Previous thread: Ensure Your Account by WEBMASTER SERVICE on Monday, November 29, 2010 - 2:23 pm. (1 message)

Next thread: [PATCH 0/2] x86, nmi watchdog: Fixes from removing old nmi watchdog by Don Zickus on Monday, November 29, 2010 - 3:07 pm. (10 messages)