Extracted from:
http://lkml.org/lkml/2010/7/9/368
(executive summary)
* Throughput
* Flight recorder mode
Ring Buffer Library 83 ns/entry (512kB sub-buffers, no reader)
89 ns/entry (512kB sub-buffers: read 0.3M entries/s)
Ftrace Ring Buffer: 103 ns/entry (no reader)
187 ns/entry (read by event: read 0.4M entries/s)
Perf record (flight recorder mode unavailable)
* Discard mode
Ring Buffer Library: 96 ns/entry discarded
257 ns/entry written (read: 2.8M entries/s)
Perf Ring Buffer: 423 ns/entry written (read: 2.3M entries/s)
(Note that this number is based on the perf event approximation output (based on
a 24 bytes/entry estimation) rather than the benchmark module count due its
inaccuracy, which is caused by perf not letting the benchmark module know about
discarded events.)
It is really hard to get a clear picture of the data write overhead with perf,
because you _need_ to consume data. Making perf support flight recorder mode
would really help getting benchmarks that are easier to compare.
I understand your point about amortized synchronization. However I still don't
see how you can achieve flight recorder mode, efficient seek on multi-GB traces
without reading the whole event stream, and live streaming without sub-buffers
(and, ideally, without much headhaches involved). ;)
If you need to read non-filled pages, then you need to splice pages piece-wise.
This does not fit well with flight recorder tracing, for which the solution
Steven and I have found is to atomically exchange pages (for Ftrace) or
sub-buffers (for the generic ring buffer library) between the reader and writer.
The problem Perf has is probably more with flight recorder (overwrite) tracing
support than splice() per se, in this you are right.
OK, good to know you are open to ABI changes if I present convincing arguments.
How do you plan to read the data concurrently with the writer overwriting the
data while you are reading it without corruption ?
OK, now I get a clearer picture of what Frederic is trying to do.
This part of the email is unrelated to sub-buffers.
Given that this buffer is simply used to dump the stack unwind result then I
think my scenario above was simply mislead.
So why the copy ? Frederic seems to put the stack unwind in a special temporary
buffer. Why is it not saved directly into the trace buffers ?
Well, now that I understand what you are trying to achieve, I retract my
proposal of using a stack-like ring buffer for this. I think that the stack dump
should simply be saved directly to the ring buffer, without copy. The
dump_stack() functions might have to be extended so they don't just save text
dumbly, but can also be used to save events into the trace in binary format,
perhaps with the continuation cookie Linus was proposing.
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--