Re: [PATCH 1/4] cfq-iosched: Keep track of average think time for the sync-noidle workload.

Previous thread: [PATCH] powerpc: Fix string library functions by Andreas Schwab on Tuesday, May 18, 2010 - 11:15 am. (2 messages)

Next thread: [PATCH] block: remove all rcu head initializations by Paul E. McKenney on Tuesday, May 18, 2010 - 11:36 am. (5 messages)
From: Jeff Moyer
Date: Tuesday, May 18, 2010 - 11:20 am

Hi,

In this, the fourth posting of this patch series, I've addressed the following
issues:
- cfq queue yielding is now done in select_queue instead of the dispatch routine
- minor patch review comments were addressed
- the queue is now yielded to a specific task

For those not familiar with this patch set already, previous discussions
appeared here:
  http://lkml.org/lkml/2010/4/1/344
  http://lkml.org/lkml/2010/4/7/325
  http://lkml.org/lkml/2010/4/14/394

This patch series addresses a performance problem experienced when running
io_zone with small file sizes (from 4KB up to 8MB) and including fsync in
the timings.  A good example of this would be the following command line:
  iozone -s 64 -e -f /mnt/test/iozone.0 -i 0
As the file sizes get larger, the performance improves.  By the time the
file size is 16MB, there is no difference in performance between runs
using CFQ and runs using deadline.  The storage in my testing was a NetApp
array connected via a single fibre channel link.  When testing against a
single SATA disk, the performance difference is not apparent.

fs_mark can also be used to show the performance problem using the following
example command line:
  fs_mark  -S  1  -D  100  -N  1000  -d  /mnt/test/fs_mark  -s  65536  -t  1  -w  4096

Following are some performance numbers from my testing.  The below numbers
represent an average of 5 runs for each configuration when running: 
        iozone -s 64 -e -f /mnt/test/iozone.0 -i 0
Numbers are in KB/s.

            |    SATA      |     %diff      ||      SAN      |     %diff
            |write |rewrite| write |rewrite || write |rewrite| write |rewrite
------------+--------------+----------------++-------------------------------
deadline    | 1452 |  1788 | 1.0   | 1.0    || 35611 | 46260 | 1.0   | 1.0
vanilla cfq | 1323 |  1330 | 0.91  | 0.74   ||  6725 |  7163 | 0.19  | 0.15
patched cfq | 1591 |  1485 | 1.10  | 0.83   || 35555 | 46358 | 1.0   | 1.0


Here are some fs_mark numbers from the same storage ...
From: Jeff Moyer
Date: Tuesday, May 18, 2010 - 11:20 am

This patch gets CFQ back in line with deadline for iozone runs, especially
those testing small files + fsync timings.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 fs/jbd/journal.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index bd224ee..267108f 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -36,6 +36,7 @@
 #include <linux/poison.h>
 #include <linux/proc_fs.h>
 #include <linux/debugfs.h>
+#include <linux/blkdev.h>
 
 #include <asm/uaccess.h>
 #include <asm/page.h>
@@ -549,6 +550,11 @@ int log_wait_commit(journal_t *journal, tid_t tid)
 	while (tid_gt(tid, journal->j_commit_sequence)) {
 		jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
 				  tid, journal->j_commit_sequence);
+		/*
+		 * Give up our I/O scheduler time slice to allow the journal
+		 * thread to issue I/O.
+		 */
+		blk_yield(journal->j_dev->bd_disk->queue, journal->j_task);
 		wake_up(&journal->j_wait_commit);
 		spin_unlock(&journal->j_state_lock);
 		wait_event(journal->j_wait_done_commit,
-- 
1.6.5.2

--

From: Jeff Moyer
Date: Tuesday, May 18, 2010 - 11:20 am

This patch implements a blk_yield to allow a process to voluntarily
give up its I/O scheduler time slice.  This is desirable for those processes
which know that they will be blocked on I/O from another process, such as
the file system journal thread.  Following patches will put calls to blk_yield
into jbd and jbd2.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 block/blk-core.c         |   13 +++++
 block/blk-settings.c     |    6 +++
 block/cfq-iosched.c      |  112 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator.c         |    8 +++
 include/linux/blkdev.h   |    4 ++
 include/linux/elevator.h |    3 +
 6 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9fe174d..b8be6c8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -323,6 +323,18 @@ void blk_unplug(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_unplug);
 
+void generic_yield_iosched(struct request_queue *q, struct task_struct *tsk)
+{
+	elv_yield(q, tsk);
+}
+
+void blk_yield(struct request_queue *q, struct task_struct *tsk)
+{
+	if (q->yield_fn)
+		q->yield_fn(q, tsk);
+}
+EXPORT_SYMBOL(blk_yield);
+
 /**
  * blk_start_queue - restart a previously stopped queue
  * @q:    The &struct request_queue in question
@@ -580,6 +592,7 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 	q->request_fn		= rfn;
 	q->prep_rq_fn		= NULL;
 	q->unplug_fn		= generic_unplug_device;
+	q->yield_fn		= generic_yield_iosched;
 	q->queue_flags		= QUEUE_FLAG_DEFAULT;
 	q->queue_lock		= lock;
 
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f5ed5a1..fe548c9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -171,6 +171,12 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 }
 EXPORT_SYMBOL(blk_queue_make_request);
 
+void blk_queue_yield(struct request_queue *q, yield_fn *yield)
+{
+	q->yield_fn = yield;
+}
+EXPORT_SYMBOL_GPL(blk_queue_yield);
+
 /**
  * ...
From: Vivek Goyal
Date: Tuesday, May 18, 2010 - 2:07 pm

I think if we just place cfq_cfqq_yield(cfqq), check above it, we don't
need above code modification?

This might result in some group losing fairness in certain scenarios, but
I guess we will tackle it once we face it. For the time being if fsync
thread is giving up cpu, journald commits will come in root group and
there might not be a point in wasting time idling on this group.

Vivek
--

From: Vivek Goyal
Date: Tuesday, May 18, 2010 - 2:44 pm

Why does yielding queue has to be active queue? Could it be possible that
a queue submitted a bunch of IO and did yield. It is possible that
yielding queue is not active queue. Even in that case we will like to mark

Why is it mandatory that target context has already the queue setup. If
there is no queue setup, can't we just yield the slice and let somebody
else run?

Oh, I guess you want to keep the slice and idle till target queue
dispatches some requests and sets up context and a queue? But even that

This is confusing. The yielding queue's stats are already part of service
tree's think time. So if yielding queue has done bunch of IO, then service
tree's mean think time will be low and we will not yield? I guess that's
the reason you are trying to check average number of queues in following

Does cfq_group_busy_queues_wl() give average number of queues. Were you


What happens to cfq_group considerations? So we can yield slices across
groups. That does not seem right. If existing group has requests on other
service trees, these should be processed first.

Can we handle all this logic in select_queue(). I think it might turn out
to be simpler. I mean in this function if we just mark the existing queue
as yielding and also setup who to yield the queue to and let
select_queue() take the decision whether to idle or choose new queue etc.
I guess it should be possible and code might turn out to be simpler to
read.

--

From: Jeff Moyer
Date: Tuesday, June 1, 2010 - 1:01 pm

ok.



The only callers of the yield method are yielding to a kernel thread
that is always going to exist.  I didn't see a need to add any


Sorry for confusing you.  Your deduction is correct.  Is there a way I


yield_to is not a flag.  ;-)  It is not cleared as it is only valid if
the cfq_cfqq_yield flag is set.  If you would find it more sensible, I

I'll give that a shot, thanks for the suggestion.  My concern is that,
when employing cgroups, you may never actually yield the queue depending
on where the file system's journal thread ends up.

Cheers,
Jeff
--

From: Jeff Moyer
Date: Tuesday, May 18, 2010 - 11:20 am

This patch gets CFQ back in line with deadline for iozone runs, especially
those testing small files + fsync timings.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 fs/jbd2/journal.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index c03d4dc..28848a6 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -41,6 +41,7 @@
 #include <linux/hash.h>
 #include <linux/log2.h>
 #include <linux/vmalloc.h>
+#include <linux/blkdev.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/jbd2.h>
@@ -580,6 +581,11 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 	while (tid_gt(tid, journal->j_commit_sequence)) {
 		jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
 				  tid, journal->j_commit_sequence);
+		/*
+		 * Give up our I/O scheduler time slice to allow the journal
+		 * thread to issue I/O.
+		 */
+		blk_yield(journal->j_dev->bd_disk->queue, journal->j_task);
 		wake_up(&journal->j_wait_commit);
 		spin_unlock(&journal->j_state_lock);
 		wait_event(journal->j_wait_done_commit,
-- 
1.6.5.2

--

From: Jeff Moyer
Date: Tuesday, May 18, 2010 - 11:20 am

This patch uses an average think time for the entirety of the sync-noidle
workload to determine whether or not to idle on said workload.  This brings
it more in line with the policy for the sync queues in the sync workload.

Testing shows that this provided an overall increase in throughput for
a mixed workload on my hardware RAID array.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 block/cfq-iosched.c |   44 +++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 838834b..46a7fe5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -83,9 +83,14 @@ struct cfq_rb_root {
 	unsigned total_weight;
 	u64 min_vdisktime;
 	struct rb_node *active;
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
 };
 #define CFQ_RB_ROOT	(struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
-			.count = 0, .min_vdisktime = 0, }
+			.count = 0, .min_vdisktime = 0, .last_end_request = 0, \
+			.ttime_total = 0, .ttime_samples = 0, .ttime_mean = 0 }
 
 /*
  * Per process-grouping structure
@@ -962,8 +967,10 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 		goto done;
 
 	cfqg->weight = blkcg->weight;
-	for_each_cfqg_st(cfqg, i, j, st)
+	for_each_cfqg_st(cfqg, i, j, st) {
 		*st = CFQ_RB_ROOT;
+		st->last_end_request = jiffies;
+	}
 	RB_CLEAR_NODE(&cfqg->rb_node);
 
 	/*
@@ -1795,9 +1802,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 
 	/*
 	 * Otherwise, we do only if they are the last ones
-	 * in their service tree.
+	 * in their service tree and the average think time is
+	 * less than the slice length.
 	 */
-	if (service_tree->count == 1 && cfq_cfqq_sync(cfqq))
+	if (service_tree->count == 1 && cfq_cfqq_sync(cfqq) &&
+	    (!sample_valid(service_tree->ttime_samples || 
+	     cfqq->slice_end - jiffies < ...
From: Vivek Goyal
Date: Tuesday, May 18, 2010 - 1:52 pm

^^^
Jeff, 

Are we closing sample_valid() bracket at right place here?

I am wondering where it is helping you. If it is to bring in line with
with sync tree (old implementation), then we should have also compared
the think time with slice_idle?

But comparing that here might not be the best thing as cfq_should_idle()
is used in many contexts.
 
Vivek
--

From: Vivek Goyal
Date: Tuesday, May 18, 2010 - 2:02 pm

This comparision will also might break some logic in select_queue() where
we wait for a queue/group to get busy even if queue's time slice has
expired.

********************************************************************
if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq)) {
                /*
                 * If slice had not expired at the completion of last
                 * request
                 * we might not have turned on wait_busy flag. Don't
                 * expire
                 * the queue yet. Allow the group to get backlogged.
                 *
                 * The very fact that we have used the slice, that means
                 * we
                 * have been idling all along on this queue and it should
                 * be
                 * ok to wait for this request to complete.
                 */
                if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
                    && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
                        cfqq = NULL;
                        goto keep_queue;
		}

*************************************************************************

With this change, now above condition will never be true as
cfq_should_idle() will always return false as slice has already expired.
And that will affect group loosing its fair share.

So I guess we can define new functions to check more conditions instead of
putting it in cfq_should_idle()

Vivek
--

From: Jeff Moyer
Date: Tuesday, June 1, 2010 - 12:31 pm

The idea behind the patch is to optimize the idling such that we don't
wait needlessly for more sync-noidle I/O when it likely isn't coming.
As mentioned in the patch description, it aims to bring the sync-noidle
workload, which is treated like a single cfqq, more in-line with the

I'm not sure I follow 100%.  Are you saying we should disable idling for
the sync-noidle workload if the think time is too long?  That sounds


Right, thanks for pointing this out.  Do we have a test case that
exposes this issue?  We really need to start a regression test suite for
CFQ.  Also, I had promised to run some numbers for you with cgroups
enabled and I didn't.  I'll get that data before the next posting.

Thanks for the review, Vivek!

-Jeff
--

From: Jeff Moyer
Date: Wednesday, May 26, 2010 - 8:33 am

Perhaps.  I don't have a debian system handy to test that, though.

Cheers,
Jeff
--

Previous thread: [PATCH] powerpc: Fix string library functions by Andreas Schwab on Tuesday, May 18, 2010 - 11:15 am. (2 messages)

Next thread: [PATCH] block: remove all rcu head initializations by Paul E. McKenney on Tuesday, May 18, 2010 - 11:36 am. (5 messages)