Re: [RFC PATCH 7/11] relay - Remove padding-related code from relay_read()/relay_splice_read() et al.

Previous thread: [RFC PATCH 6/11] relay - Replace relay_reserve/relay_write with non-padded versions. by Tom Zanussi on Sunday, September 28, 2008 - 10:40 pm. (1 message)

Next thread: [RFC PATCH 9/11] relay - Replace subbuf_start and notify_consumers callbacks with new_subbuf. by Tom Zanussi on Sunday, September 28, 2008 - 10:40 pm. (1 message)
From: Tom Zanussi
Date: Sunday, September 28, 2008 - 10:40 pm

Remove padding-related code from relay_read()/relay_splice_read() et al.

Because we no longer write padding, we no longer have to read it or
account for it anywhere else, greatly simplifying the related code.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>

---
 kernel/relay.c |  149
++++++++------------------------------------------------
 1 files changed, 20 insertions(+), 129 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index d382528..b55466d 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -965,72 +965,13 @@ static void relay_file_read_consume(struct
rchan_buf *buf,
 				    size_t bytes_consumed)
 {
 	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
-	size_t read_subbuf;
-
-	if (buf->subbufs_produced == buf->subbufs_consumed &&
-	    buf->offset == buf->bytes_consumed)
-		return;
-
-	if (buf->bytes_consumed + bytes_consumed > subbuf_size) {
-		relay_subbufs_consumed(buf->chan, buf->cpu, 1);
-		buf->bytes_consumed = 0;
-	}
 
 	buf->bytes_consumed += bytes_consumed;
-	if (!read_pos)
-		read_subbuf = buf->subbufs_consumed % n_subbufs;
-	else
-		read_subbuf = read_pos / buf->chan->subbuf_size;
-	if (buf->bytes_consumed + buf->padding[read_subbuf] == subbuf_size) {
-		if ((read_subbuf == buf->subbufs_produced % n_subbufs) &&
-		    (buf->offset == subbuf_size))
-			return;
-		relay_subbufs_consumed(buf->chan, buf->cpu, 1);
-		buf->bytes_consumed = 0;
-	}
-}
 
-/*
- *	relay_file_read_avail - boolean, are there unconsumed bytes
available?
- */
-static int relay_file_read_avail(struct rchan_buf *buf, size_t
read_pos)
-{
-	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
-	size_t produced = buf->subbufs_produced;
-	size_t consumed = buf->subbufs_consumed;
-
-	relay_file_read_consume(buf, read_pos, 0);
-
-	consumed = buf->subbufs_consumed;
-
-	if (unlikely(buf->offset > subbuf_size)) {
-		if (produced == consumed)
-			return 0;
-		return ...
From: Mathieu Desnoyers
Date: Monday, September 29, 2008 - 9:27 am

Hi Tom,

This question might sound a bit dumb, but I'll ask anyway : why do you
implement a splice_read rather than a splice_write in relay ?

splice_read allows reading information from a file or from a socket to a
pipe, while splice_write does the opposite.

So if you implement a relay splice_read, you probably consider the
relay buffers to be a "file", so you really have to send the information
to a pipe, and then you have to use this pipe to send the data
elsewhere.

My first reaction when looking at the splice implementation is that what
we would really want is a splice_write which would take the data from a
pipe (actually, we would have to write an actor which would make the
relay buffer behave like a pipe) and write it either to disk or to a
socket.

Is there something I am misunderstanding here ?

Thanks,


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Tom Zanussi
Date: Monday, September 29, 2008 - 10:04 pm

Yeah, that's pretty much it.  In the case of splicing from a relay file
to an output file, the path is basically relay->pipe->outfile. To do the
relay->pipe part, do_splice_to() needs a splice_read() implementation,

Yes, I think a relay splice_write implementation would take the data
from a pipe but would write it into a relay buffer i.e.
input->pipe->relay with do_splice_from() calling relay's splice_write()
implementation.  Once the data is in relay, getting it from there to the
disk or socket would follow the same path as the relay->outfile path.
At least that's the way I think of it.

For tracing, it probably doesn't make sense to splice_write file data
into relay, but it would to splice in pages of data from other tracing
sources, such as for example pages of trace data from userspace tracers,
which they could do using vmsplice i.e. userpages->pipe->relay. 

If you used SPLICE_F_MOVE for the relay->outfile part and SPLICE_F_GIFT
for the userpages->relay part, you'd get the trace data directly from
the writer to the destination without any copying.  Unfortunately,
SPLICE_F_MOVE support was removed so this won't really work right now.
Also, pages would have to be kept in a linked list instead of an array
for this.

Also, for this scheme to work, your trace stream would need to be able
to tolerate having pages from different sources inserted basically
anywhere in the stream, which means each page would have to be
self-contained.  If you could atomically insert a set of pages, you
could I guess make that a self-contained unit as well and have some kind
of flag that says the next n pages in the stream are part of the current
event. That might provide a simple way to have multi-page events or to
log blobs of binary data - rather than reserving space for them, you'd
just splice the pages directly into the buffer.

So maybe splice_read and splice_write on a list of pages, some write
functions for tracing into the pages and a userspace post-processor to
demultiplex it all ...
From: Tom Zanussi
Date: Sunday, October 5, 2008 - 10:22 pm

I decided to go another round with this after all...

The patch following this mail contains the full patch; because there
have been so many changes and it's hard to see from just looking at the
patch the end result, I'm including relay.c and relay.h at the end of
this mail.  The full patch also includes the two new files,
relay_pagewriter.c and .h in for anyone interested in seeing what those
look like.

Basically the patch includes the changes from the previous 11 that I
posted and in addition completely separates the reading part of relay
from the writing part.  With the new changes, relay really does become
just what its name says and and nothing more - it accepts pages from
tracers, and relays the data to userspace via read(2) or splice(2) (and
therefore sendfile(2)).  It doesn't allocate any buffer space and
provides no write functions - those are expected to be supplied by some
other component such as possibly the unified ring-buffer or any other
tracer that might want relay pages of trace data to userspace.

One example of such a component would be the original relay write
functions and buffers (the no-vmap page-based versions of the previous
patchset), which have been split out into a new file called
relay_pagewriter.c and provide one means of writing into pages and
feeding them into relay.  blktrace and kvmtrace have been 'ported' over
to using pagewriter instead of relay directly.

I've only tested the new relay lightly via blktrace, which seems to work
fine, and haven't looked at plugging anything else into it, but after
applying the full patch you should be able to use it to stream e.g.
ftrace/unified trace buffer/ring-buffer trace data to disk or over the
network...

Anyway, here's a brief overview of the new API (see code for details):

- relay_open():

Creates a per-cpu relay channel and by default associates debugfs files
with each per-cpu 'buffer'.  No buffer space is allocated for the
'buffers', rather they collect pages added by tracers in a list which ...
From: Tom Zanussi
Date: Sunday, October 5, 2008 - 10:22 pm

The full relay patch.

Basically it includes the changes from the previous 11 that I posted and
in addition completely separates the reading part of relay from the
writing part.  With the new changes, relay really does become just what
its name says and and nothing more - it accepts pages from tracers, and
relays the data to userspace via read(2) or splice(2) (and therefore
sendfile(2)).  It doesn't allocate any buffer space and provides no
write functions - those are expected to be supplied by some other
component such as the unified ring-buffer or any other tracer that might
want relay pages of trace data to userspace.

Includes original relay write functions and buffers (the no-vmap
page-based versions of the previous patchset), which have been split out
into a new file called relay_pagewriter.c and provide one means of
writing into pages and feeding them into relay.  blktrace and kvmtrace
have been 'ported' over to using pagewriter instead of relay directly.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>

diff --git a/block/blktrace.c b/block/blktrace.c
index eb9651c..8ba7094 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -35,7 +35,7 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 {
 	struct blk_io_trace *t;
 
-	t = relay_reserve(bt->rchan, sizeof(*t) + len);
+	t = kmalloc(sizeof(*t) + len, GFP_KERNEL);
 	if (t) {
 		const int cpu = smp_processor_id();
 
@@ -47,6 +47,8 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 		t->cpu = cpu;
 		t->pdu_len = len;
 		memcpy((void *) t + sizeof(*t), data, len);
+		pagewriter_write(bt->pagewriter, t, sizeof(*t) + len);
+		kfree(t);
 	}
 }
 
@@ -166,7 +168,7 @@ void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	if (unlikely(tsk->btrace_seq != blktrace_seq))
 		trace_note_tsk(bt, tsk);
 
-	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
+	t = kmalloc(sizeof(*t) + pdu_len, GFP_KERNEL);
 	if (t) {
 		cpu = smp_processor_id();
 		sequence = ...
From: Jens Axboe
Date: Monday, October 6, 2008 - 12:40 am

Ugh, that's no good - it's both way too expensive, and also requires an

Ditto - I don't think this approach is viable at all, sorry!

-- 
Jens Axboe

--

From: Tom Zanussi
Date: Monday, October 6, 2008 - 9:55 pm

This is was only to keep things working until I could add reserve()

The patch below adds reserve() back and changes blktrace back to using
it.  Adding it back also meant adding padding back into the equation,
but now there's a way to write a padding 'event' as part of the event
stream rather than as metadata.  For blktrace, I added a blktrace_notify
'padding message', which I'm sure isn't really what you'd want, but it
seems to do the trick for now, and didn't even require any changes in
blkparse - it happily skips over the padding as intended.

Tom

    Add pagewrite_reserve().
    
    Also added is a callback name write_padding() which can be used to
    write padding events at the end of the page if an event won't fit in
    the remaining space.

diff --git a/block/blktrace.c b/block/blktrace.c
index 8ba7094..f5b745d 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -35,7 +35,7 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 {
 	struct blk_io_trace *t;
 
-	t = kmalloc(sizeof(*t) + len, GFP_KERNEL);
+	t = pagewriter_reserve(bt->pagewriter, sizeof(*t) + len, sizeof(*t));
 	if (t) {
 		const int cpu = smp_processor_id();
 
@@ -47,8 +47,6 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 		t->cpu = cpu;
 		t->pdu_len = len;
 		memcpy((void *) t + sizeof(*t), data, len);
-		pagewriter_write(bt->pagewriter, t, sizeof(*t) + len);
-		kfree(t);
 	}
 }
 
@@ -168,7 +166,8 @@ void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	if (unlikely(tsk->btrace_seq != blktrace_seq))
 		trace_note_tsk(bt, tsk);
 
-	t = kmalloc(sizeof(*t) + pdu_len, GFP_KERNEL);
+	t = pagewriter_reserve(bt->pagewriter, sizeof(*t) + pdu_len,
+			       sizeof(*t));
 	if (t) {
 		cpu = smp_processor_id();
 		sequence = per_cpu_ptr(bt->sequence, cpu);
@@ -187,8 +186,6 @@ void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 
 		if (pdu_len)
 			memcpy((void *) t + sizeof(*t), pdu_data, ...
From: Jens Axboe
Date: Tuesday, September 30, 2008 - 2:04 am

I don't understand where you are going with this... With the
->splice_read(), we can move relay data into a pipe and pass that to a
user application (or send it off using sendfile() or splice to a
socket). So it's a way to get the data to user space, instead of using
read().

With a ->splice_write(), you could support adding data to relayfs from
userspace. Why? You want the ->splice_write() on the output end, for
your socket or file or wherever you want to send to relay data TO.

So as long as your model is that the kernel produces data and the user
app consumes them, you need the ->splice_read() and not a

I think so :-)

-- 
Jens Axboe

--

Previous thread: [RFC PATCH 6/11] relay - Replace relay_reserve/relay_write with non-padded versions. by Tom Zanussi on Sunday, September 28, 2008 - 10:40 pm. (1 message)

Next thread: [RFC PATCH 9/11] relay - Replace subbuf_start and notify_consumers callbacks with new_subbuf. by Tom Zanussi on Sunday, September 28, 2008 - 10:40 pm. (1 message)