Remove padding-related code from relay_read()/relay_splice_read() et al.
Because we no longer write padding, we no longer have to read it or
account for it anywhere else, greatly simplifying the related code.
Signed-off-by: Tom Zanussi <zanussi@comcast.net>
---
kernel/relay.c | 149
++++++++------------------------------------------------
1 files changed, 20 insertions(+), 129 deletions(-)
diff --git a/kernel/relay.c b/kernel/relay.c
index d382528..b55466d 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -965,72 +965,13 @@ static void relay_file_read_consume(struct
rchan_buf *buf,
size_t bytes_consumed)
{
size_t subbuf_size = buf->chan->subbuf_size;
- size_t n_subbufs = buf->chan->n_subbufs;
- size_t read_subbuf;
-
- if (buf->subbufs_produced == buf->subbufs_consumed &&
- buf->offset == buf->bytes_consumed)
- return;
-
- if (buf->bytes_consumed + bytes_consumed > subbuf_size) {
- relay_subbufs_consumed(buf->chan, buf->cpu, 1);
- buf->bytes_consumed = 0;
- }
buf->bytes_consumed += bytes_consumed;
- if (!read_pos)
- read_subbuf = buf->subbufs_consumed % n_subbufs;
- else
- read_subbuf = read_pos / buf->chan->subbuf_size;
- if (buf->bytes_consumed + buf->padding[read_subbuf] == subbuf_size) {
- if ((read_subbuf == buf->subbufs_produced % n_subbufs) &&
- (buf->offset == subbuf_size))
- return;
- relay_subbufs_consumed(buf->chan, buf->cpu, 1);
- buf->bytes_consumed = 0;
- }
-}
-/*
- * relay_file_read_avail - boolean, are there unconsumed bytes
available?
- */
-static int relay_file_read_avail(struct rchan_buf *buf, size_t
read_pos)
-{
- size_t subbuf_size = buf->chan->subbuf_size;
- size_t n_subbufs = buf->chan->n_subbufs;
- size_t produced = buf->subbufs_produced;
- size_t consumed = buf->subbufs_consumed;
-
- relay_file_read_consume(buf, read_pos, 0);
-
- consumed = buf->subbufs_consumed;
-
- if (unlikely(buf->offset > subbuf_size)) {
- if (produced == consumed)
- return 0;
- return ...Hi Tom, This question might sound a bit dumb, but I'll ask anyway : why do you implement a splice_read rather than a splice_write in relay ? splice_read allows reading information from a file or from a socket to a pipe, while splice_write does the opposite. So if you implement a relay splice_read, you probably consider the relay buffers to be a "file", so you really have to send the information to a pipe, and then you have to use this pipe to send the data elsewhere. My first reaction when looking at the splice implementation is that what we would really want is a splice_write which would take the data from a pipe (actually, we would have to write an actor which would make the relay buffer behave like a pipe) and write it either to disk or to a socket. Is there something I am misunderstanding here ? Thanks, -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Yeah, that's pretty much it. In the case of splicing from a relay file to an output file, the path is basically relay->pipe->outfile. To do the relay->pipe part, do_splice_to() needs a splice_read() implementation, Yes, I think a relay splice_write implementation would take the data from a pipe but would write it into a relay buffer i.e. input->pipe->relay with do_splice_from() calling relay's splice_write() implementation. Once the data is in relay, getting it from there to the disk or socket would follow the same path as the relay->outfile path. At least that's the way I think of it. For tracing, it probably doesn't make sense to splice_write file data into relay, but it would to splice in pages of data from other tracing sources, such as for example pages of trace data from userspace tracers, which they could do using vmsplice i.e. userpages->pipe->relay. If you used SPLICE_F_MOVE for the relay->outfile part and SPLICE_F_GIFT for the userpages->relay part, you'd get the trace data directly from the writer to the destination without any copying. Unfortunately, SPLICE_F_MOVE support was removed so this won't really work right now. Also, pages would have to be kept in a linked list instead of an array for this. Also, for this scheme to work, your trace stream would need to be able to tolerate having pages from different sources inserted basically anywhere in the stream, which means each page would have to be self-contained. If you could atomically insert a set of pages, you could I guess make that a self-contained unit as well and have some kind of flag that says the next n pages in the stream are part of the current event. That might provide a simple way to have multi-page events or to log blobs of binary data - rather than reserving space for them, you'd just splice the pages directly into the buffer. So maybe splice_read and splice_write on a list of pages, some write functions for tracing into the pages and a userspace post-processor to demultiplex it all ...
I decided to go another round with this after all... The patch following this mail contains the full patch; because there have been so many changes and it's hard to see from just looking at the patch the end result, I'm including relay.c and relay.h at the end of this mail. The full patch also includes the two new files, relay_pagewriter.c and .h in for anyone interested in seeing what those look like. Basically the patch includes the changes from the previous 11 that I posted and in addition completely separates the reading part of relay from the writing part. With the new changes, relay really does become just what its name says and and nothing more - it accepts pages from tracers, and relays the data to userspace via read(2) or splice(2) (and therefore sendfile(2)). It doesn't allocate any buffer space and provides no write functions - those are expected to be supplied by some other component such as possibly the unified ring-buffer or any other tracer that might want relay pages of trace data to userspace. One example of such a component would be the original relay write functions and buffers (the no-vmap page-based versions of the previous patchset), which have been split out into a new file called relay_pagewriter.c and provide one means of writing into pages and feeding them into relay. blktrace and kvmtrace have been 'ported' over to using pagewriter instead of relay directly. I've only tested the new relay lightly via blktrace, which seems to work fine, and haven't looked at plugging anything else into it, but after applying the full patch you should be able to use it to stream e.g. ftrace/unified trace buffer/ring-buffer trace data to disk or over the network... Anyway, here's a brief overview of the new API (see code for details): - relay_open(): Creates a per-cpu relay channel and by default associates debugfs files with each per-cpu 'buffer'. No buffer space is allocated for the 'buffers', rather they collect pages added by tracers in a list which ...
The full relay patch.
Basically it includes the changes from the previous 11 that I posted and
in addition completely separates the reading part of relay from the
writing part. With the new changes, relay really does become just what
its name says and and nothing more - it accepts pages from tracers, and
relays the data to userspace via read(2) or splice(2) (and therefore
sendfile(2)). It doesn't allocate any buffer space and provides no
write functions - those are expected to be supplied by some other
component such as the unified ring-buffer or any other tracer that might
want relay pages of trace data to userspace.
Includes original relay write functions and buffers (the no-vmap
page-based versions of the previous patchset), which have been split out
into a new file called relay_pagewriter.c and provide one means of
writing into pages and feeding them into relay. blktrace and kvmtrace
have been 'ported' over to using pagewriter instead of relay directly.
Signed-off-by: Tom Zanussi <zanussi@comcast.net>
diff --git a/block/blktrace.c b/block/blktrace.c
index eb9651c..8ba7094 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -35,7 +35,7 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
{
struct blk_io_trace *t;
- t = relay_reserve(bt->rchan, sizeof(*t) + len);
+ t = kmalloc(sizeof(*t) + len, GFP_KERNEL);
if (t) {
const int cpu = smp_processor_id();
@@ -47,6 +47,8 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
t->cpu = cpu;
t->pdu_len = len;
memcpy((void *) t + sizeof(*t), data, len);
+ pagewriter_write(bt->pagewriter, t, sizeof(*t) + len);
+ kfree(t);
}
}
@@ -166,7 +168,7 @@ void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
if (unlikely(tsk->btrace_seq != blktrace_seq))
trace_note_tsk(bt, tsk);
- t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
+ t = kmalloc(sizeof(*t) + pdu_len, GFP_KERNEL);
if (t) {
cpu = smp_processor_id();
sequence = ...Ugh, that's no good - it's both way too expensive, and also requires an Ditto - I don't think this approach is viable at all, sorry! -- Jens Axboe --
This is was only to keep things working until I could add reserve()
The patch below adds reserve() back and changes blktrace back to using
it. Adding it back also meant adding padding back into the equation,
but now there's a way to write a padding 'event' as part of the event
stream rather than as metadata. For blktrace, I added a blktrace_notify
'padding message', which I'm sure isn't really what you'd want, but it
seems to do the trick for now, and didn't even require any changes in
blkparse - it happily skips over the padding as intended.
Tom
Add pagewrite_reserve().
Also added is a callback name write_padding() which can be used to
write padding events at the end of the page if an event won't fit in
the remaining space.
diff --git a/block/blktrace.c b/block/blktrace.c
index 8ba7094..f5b745d 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -35,7 +35,7 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
{
struct blk_io_trace *t;
- t = kmalloc(sizeof(*t) + len, GFP_KERNEL);
+ t = pagewriter_reserve(bt->pagewriter, sizeof(*t) + len, sizeof(*t));
if (t) {
const int cpu = smp_processor_id();
@@ -47,8 +47,6 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
t->cpu = cpu;
t->pdu_len = len;
memcpy((void *) t + sizeof(*t), data, len);
- pagewriter_write(bt->pagewriter, t, sizeof(*t) + len);
- kfree(t);
}
}
@@ -168,7 +166,8 @@ void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
if (unlikely(tsk->btrace_seq != blktrace_seq))
trace_note_tsk(bt, tsk);
- t = kmalloc(sizeof(*t) + pdu_len, GFP_KERNEL);
+ t = pagewriter_reserve(bt->pagewriter, sizeof(*t) + pdu_len,
+ sizeof(*t));
if (t) {
cpu = smp_processor_id();
sequence = per_cpu_ptr(bt->sequence, cpu);
@@ -187,8 +186,6 @@ void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
if (pdu_len)
memcpy((void *) t + sizeof(*t), pdu_data, ...I don't understand where you are going with this... With the ->splice_read(), we can move relay data into a pipe and pass that to a user application (or send it off using sendfile() or splice to a socket). So it's a way to get the data to user space, instead of using read(). With a ->splice_write(), you could support adding data to relayfs from userspace. Why? You want the ->splice_write() on the output end, for your socket or file or wherever you want to send to relay data TO. So as long as your model is that the kernel produces data and the user app consumes them, you need the ->splice_read() and not a I think so :-) -- Jens Axboe --
