Rebase to the latest net-next.
CHANGE-LOG since v3:
1) change arguments of softirq tracepoint into original one.
2) remove tracepoint of dev_kfree_skb_irq and skb_free_datagram_locked
and add trace_kfree_skb before __kfree_skb instead of them.
3) add tracepoint to netif_rx and display it by netdev-times script.
These patch-set adds tracepoints to show us a process of packets.
Using these tracepoints and existing points, we can get the time when
packet passes through some points in transmit or receive sequence.
For example, this is an output of perf script which is attached by patch 5/5.
106133.171439sec cpu=0
irq_entry(+0.000msec irq=24:eth4)
|
softirq_entry(+0.006msec)
|
|---netif_receive_skb(+0.010msec skb=f2d15900 len=100)
| |
| skb_copy_datagram_iovec(+0.039msec 10291::10291)
|
napi_poll_exit(+0.022msec eth4)
106134.175634sec cpu=1
irq_entry(+0.000msec irq=28:eth1)
|
|---netif_rx(+0.009msec skb=f3ef0a00)
|
softirq_entry(+0.018msec)
|
|---netif_receive_skb(+0.021msec skb=f3ef0a00 len=84)
| |
| skb_copy_datagram_iovec(+0.033msec 0:swapper)
|
napi_poll_exit(+0.035msec (no_device))
The above is a receive side(eth4 is NAPI. eth1 is non-NAPI). Like this, it can
show receive sequence frominterrupt(irq_entry) to application
(skb_copy_datagram_iovec).
This script shows one NET_RX softirq and events related to it. All relative
time bases on first irq_entry which raise NET_RX softirq.
dev len Qdisc netdevice free
eth4 74 106125.030004sec 0.006msec 0.009msec
eth4 87 106125.041020sec 0.007msec 0.023msec
eth4 66 106125.042291sec 0.003msec 0.012msec
eth4 66 106125.043274sec 0.006msec 0.004msec
eth4 850 106125.044283sec 0.007msec ...From: Lai Jiangshan <laijs@cn.fujitsu.com>
Add a tracepoint for tracing when softirq action is raised.
It and the existed tracepoints complete softirq's tracepoints:
softirq_raise, softirq_entry and softirq_exit.
And when this tracepoint is used in combination with
the softirq_entry tracepoint we can determine
the softirq raise latency.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
[ factorize softirq events with DECLARE_EVENT_CLASS ]
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
include/linux/interrupt.h | 8 +++++++-
include/trace/events/irq.h | 26 ++++++++++++++++++++++++--
2 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a0384a4..d3e8e90 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -18,6 +18,7 @@
#include <asm/atomic.h>
#include <asm/ptrace.h>
#include <asm/system.h>
+#include <trace/events/irq.h>
/*
* These correspond to the IORESOURCE_IRQ_* defines in
@@ -407,7 +408,12 @@ asmlinkage void do_softirq(void);
asmlinkage void __do_softirq(void);
extern void open_softirq(int nr, void (*action)(struct softirq_action *));
extern void softirq_init(void);
-#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0)
+static inline void __raise_softirq_irqoff(unsigned int nr)
+{
+ trace_softirq_raise((struct softirq_action *)&nr, NULL);
+ or_softirq_pending(1UL << nr);
+}
+
extern void raise_softirq_irqoff(unsigned int nr);
extern void raise_softirq(unsigned int nr);
extern void wakeup_softirqd(void);
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index 0e4cfb6..3ddda02 100644
--- a/include/trace/events/irq.h
+++ b/include/trace/events/irq.h
@@ -5,7 +5,9 @@
#define _TRACE_IRQ_H
#include <linux/tracepoint.h>
-#include <linux/interrupt.h>
+
+struct ...It seems that this will break softirq_entry/exit tracepoints. __entry->vec will deref vec->action() for these two, which is not what we want. If you can't have the same tracepoint signature for the three, just split the new one in a seperate TRACE_EVENT(). Thanks. --
But for trace_softirq_entry and trace_softirq_exit, vec will not be It may be a bit of a hack, and questionable about adding another TRACE_EVENT(). There still is a pretty good space savings in using DEFINE_EVENT() over TRACE_EVENT() though. -- Steve --
Oh right... Yeah, let's keep it as is. --
Perhaps doing: trace_softirq_raise((struct softirq_action *)((unsigend long)nr), NULL); __entry->vec = (int)h; Doing the above will at least be a bit safer. -- Steve --
Agreed, I'm going to change that in the patch. Thanks. --
Commit-ID: 2bf2160d8805de64308e2e7c3cd97813cb58ed2f Gitweb: http://git.kernel.org/tip/2bf2160d8805de64308e2e7c3cd97813cb58ed2f Author: Lai Jiangshan <laijs@cn.fujitsu.com> AuthorDate: Mon, 23 Aug 2010 18:42:48 +0900 Committer: Frederic Weisbecker <fweisbec@gmail.com> CommitDate: Tue, 7 Sep 2010 17:49:34 +0200 irq: Add tracepoint to softirq_raise Add a tracepoint for tracing when softirq action is raised. This and the existing tracepoints complete softirq's tracepoints: softirq_raise, softirq_entry and softirq_exit. And when this tracepoint is used in combination with the softirq_entry tracepoint we can determine the softirq raise latency. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: David Miller <davem@davemloft.net> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C724298.4050509@jp.fujitsu.com> [ factorize softirq events with DECLARE_EVENT_CLASS ] Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> --- include/linux/interrupt.h | 8 +++++++- include/trace/events/irq.h | 26 ++++++++++++++++++++++++-- 2 files changed, 31 insertions(+), 3 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index a0384a4..531495d 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -18,6 +18,7 @@ #include <asm/atomic.h> #include <asm/ptrace.h> #include <asm/system.h> +#include <trace/events/irq.h> /* * These correspond to the IORESOURCE_IRQ_* defines in @@ -407,7 +408,12 @@ asmlinkage void do_softirq(void); asmlinkage void ...
FYI, this commit broke the Sparc build:
In file included from /home/mingo/tip/arch/sparc/include/asm/irq_32.h:11,
from /home/mingo/tip/arch/sparc/include/asm/irq.h:6,
from /home/mingo/tip/include/linux/irqnr.h:10,
from /home/mingo/tip/include/linux/irq.h:22,
from /home/mingo/tip/include/asm-generic/hardirq.h:6,
from /home/mingo/tip/arch/sparc/include/asm/hardirq_32.h:11,
from /home/mingo/tip/arch/sparc/include/asm/hardirq.h:6,
from /home/mingo/tip/include/linux/hardirq.h:10,
from /home/mingo/tip/include/linux/ftrace_event.h:7,
from /home/mingo/tip/include/trace/syscall.h:6,
from /home/mingo/tip/include/linux/syscalls.h:76,
from /home/mingo/tip/init/initramfs.c:9:
/home/mingo/tip/include/linux/interrupt.h: In function '__raise_softirq_irqoff':
/home/mingo/tip/include/linux/interrupt.h:414: error: implicit declaration of function 'local_softirq_pending'
/home/mingo/tip/include/linux/interrupt.h:414: error: lvalue required as left operand of assignment
make[2]: *** [init/initramfs.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from /home/mingo/tip/arch/sparc/include/asm/irq_32.h:11,
Ingo
--
Yeah, there is a circular dependency.
Does that fixes the issue (and if so, does that look sane)?
Thanks.
---
From fc21eaa02d4a6f0af396af6a106587e61515cd86 Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <fweisbec@gmail.com>
Date: Wed, 8 Sep 2010 14:17:31 +0200
Subject: [PATCH] irq: Fix circular headers dependency
asm-generic/hardirq.h needs asm/irq.h which might include
linux/interrupt.h as in the sparc 32 case. At this point
we need irq_cpustat generic definitions, but those are
included later in asm-generic/hardirq.h.
Then delay a bit the inclusion of irq.h from
asm-generic/hardirq.h, it doesn't need to be included early.
This fixes:
In file included from arch/sparc/include/asm/irq_32.h:11,
from arch/sparc/include/asm/irq.h:6,
from include/linux/irqnr.h:10,
from include/linux/irq.h:22,
from include/asm-generic/hardirq.h:6,
from arch/sparc/include/asm/hardirq_32.h:11,
from arch/sparc/include/asm/hardirq.h:6,
from include/linux/hardirq.h:10,
from include/linux/ftrace_event.h:7,
from include/trace/syscall.h:6,
from include/linux/syscalls.h:76,
from init/initramfs.c:9:
include/linux/interrupt.h: In function '__raise_softirq_irqoff':
include/linux/interrupt.h:414: error: implicit declaration of function 'local_softirq_pending'
include/linux/interrupt.h:414: error: lvalue required as left operand of assignment
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
include/asm-generic/hardirq.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/include/asm-generic/hardirq.h b/include/asm-generic/hardirq.h
index 62f5908..04d0a97 100644
--- a/include/asm-generic/hardirq.h
+++ ...Commit-ID: 3b8fad3e2f5f69bfd8e42d099ca8582fb2342edf Gitweb: http://git.kernel.org/tip/3b8fad3e2f5f69bfd8e42d099ca8582fb2342edf Author: Frederic Weisbecker <fweisbec@gmail.com> AuthorDate: Wed, 8 Sep 2010 14:26:00 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Thu, 9 Sep 2010 21:28:58 +0200 irq: Fix circular headers dependency asm-generic/hardirq.h needs asm/irq.h which might include linux/interrupt.h as in the sparc 32 case. At this point we need irq_cpustat generic definitions, but those are included later in asm-generic/hardirq.h. Then delay a bit the inclusion of irq.h from asm-generic/hardirq.h, it doesn't need to be included early. This fixes: include/linux/interrupt.h: In function '__raise_softirq_irqoff': include/linux/interrupt.h:414: error: implicit declaration of function 'local_softirq_pending' include/linux/interrupt.h:414: error: lvalue required as left operand of assignment Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Cc: mathieu.desnoyers@efficios.com Cc: rostedt@goodmis.org Cc: nhorman@tuxdriver.com Cc: scott.a.mcmillan@intel.com Cc: eric.dumazet@gmail.com Cc: kaneshige.kenji@jp.fujitsu.com Cc: davem@davemloft.net Cc: izumi.taku@jp.fujitsu.com Cc: kosaki.motohiro@jp.fujitsu.com LKML-Reference: <20100908122557.GA5310@nowhere> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/asm-generic/hardirq.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/asm-generic/hardirq.h b/include/asm-generic/hardirq.h index 62f5908..04d0a97 100644 --- a/include/asm-generic/hardirq.h +++ b/include/asm-generic/hardirq.h @@ -3,13 +3,13 @@ #include <linux/cache.h> #include <linux/threads.h> -#include <linux/irq.h> typedef struct { unsigned int __softirq_pending; } ____cacheline_aligned irq_cpustat_t; #include <linux/irq_cpustat.h> /* Standard ...
I could build sparc64_defconfig, but s390 is broken for me by this... --
the below makes s390 build again, not sure its completely safe for all configs though... Heiko? --- arch/s390/include/asm/hardirq.h | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/arch/s390/include/asm/hardirq.h b/arch/s390/include/asm/hardirq.h index 498bc38..9558a71 100644 --- a/arch/s390/include/asm/hardirq.h +++ b/arch/s390/include/asm/hardirq.h @@ -15,7 +15,6 @@ #include <linux/threads.h> #include <linux/sched.h> #include <linux/cache.h> -#include <linux/interrupt.h> #include <asm/lowcore.h> #define local_softirq_pending() (S390_lowcore.softirq_pending) --
We have a similar patch in our git tree to fix this issue: http://git390.marist.edu/cgi-bin/gitweb.cgi?p=linux-2.6.git;a=commitdiff;h=b722c7e6ce9... --
/me being very grumpy @ Lai.. ia64 is broken too! /me goes revert this shite --
No its not, -tip is still not buildable on s390 and ia64. Please don't ever pull crap like this again. --
With the addition of trace_softirq_raise() the softirq tracepoint got
even more convoluted. Why the tracepoints take two pointers to assign
an integer is beyond my comprehension.
But adding an extra case which treats the first pointer as an unsigned
long when the second pointer is NULL including the back and forth
type casting is just horrible.
Convert the softirq tracepoints to take a single unsigned int argument
for the softirq vector number and fix the call sites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
include/linux/interrupt.h | 2 -
include/trace/events/irq.h | 54 ++++++++++++++++-----------------------------
kernel/softirq.c | 14 ++++++-----
3 files changed, 29 insertions(+), 41 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 531495d..0ac1949 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -410,7 +410,7 @@ extern void open_softirq(int nr, void (*action)(struct softirq_action *));
extern void softirq_init(void);
static inline void __raise_softirq_irqoff(unsigned int nr)
{
- trace_softirq_raise((struct softirq_action *)(unsigned long)nr, NULL);
+ trace_softirq_raise(nr);
or_softirq_pending(1UL << nr);
}
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index 6fa7cba..1c09820 100644
--- a/include/trace/events/irq.h
+++ b/include/trace/events/irq.h
@@ -86,76 +86,62 @@ TRACE_EVENT(irq_handler_exit,
DECLARE_EVENT_CLASS(softirq,
- TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
+ TP_PROTO(unsigned int vec_nr),
- TP_ARGS(h, vec),
+ TP_ARGS(vec_nr),
TP_STRUCT__entry(
- __field( int, vec )
+ __field( unsigned int, vec )
),
TP_fast_assign(
- if (vec)
- __entry->vec = (int)(h - vec);
- else
- __entry->vec = (int)(long)h;
+ __entry->vec = vec_nr;
),
- TP_printk("vec=%d [action=%s]", __entry->vec,
+ TP_printk("vec=%u [action=%s]", __entry->vec,
...Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> A much needed cleanup indeed! --
Well, there was originally a reason for this oddness. The in __do_softirq(), "h - softirq_ve"c computation was not needed outside of the tracepoint handler in the past, but it now seems to be required with the new inlined "kstat_incr_softirqs_this_cpu()". So yes, thanks to this recent change, it now makes sense to pull this computation out of the tracepoints and do it unconditionally in the kernel code. Feel free to put my: Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Dudes, a vector computation is hardly a performance problem in that function and definitely not an excuse for designing such horrible interfaces. Thanks, tglx --
Yes, now we can be a bit more liberal. But when these tracepoints were going in, people were watching to make sure they have practically zero impact when tracing was disabled. Now that people are more use to tracepoints, they are more understanding to have cleaner code over that extra few more lines of machine code in the fast path. -- Steve --
The vector computation is compared to the extra tracing induced jumps probably not even measurable. Stop defending horrible coding with handwavy performance and impact arguments. Thanks, tglx --
From the moment markers and tracepoints infrastructures were merged, the performance overhead target has been assuming we would eventually be merging "asm goto jump labels", which replace the load+test+branch with a no-op. So compared to a 5 bytes no-op added to the fast path, this vector computation can be expected to have a higher performance impact, because skipping a no-op on modern architectures (x86 at least) adds technically zero cycles. Agreed, there is still the impact on I$, extra register pressure, some leaf functions becoming non-leaf, and added function call (which imply external side-effect, thus acting like a barrier()). But saying that all we do is to provide handwavy performance and impact arguments is a bit much. Until the asm goto are more widely deployed and until gcc 4.5 is more widely used, there are some instrumentation sites I am relunctant to consider to instrument with tracepoints (e.g. all system call entry/exit sites). However, we should not use the cost of the current load+test+branch tracepoint behavior as an excuse for adding extra performance impact to kernel code, because when it will be replaced by asm gotos, all that will be left is the performance impact inappropriately justified as insignificant compared to the impact of the old tracepoint scheme. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Can you at one point just stop your tracing lectures and look at the
facts ?
The impact of a sensible tracepoint design on the code in question
before kstat_incr_softirqs_this_cpu() was added would have been a mere
_FIVE_ bytes of text. But the original tracepoint code itself is
_TWENTY_ bytes of text larger.
So we trade horrible code plus 20 bytes text against 5 bytes of text
in the hotpath. And you tell me that these _FIVE_ bytes are impacting
performance so much that it's significant.
Now with kstat_incr_softirqs_this_cpu() the impact is zero, it even
removes code.
And talking about non impact of disabled trace points. The tracepoint
in question which made me look at the code results in deinlining
__raise_softirq_irqsoff() in net/dev/core.c. There goes your theory.
So no, you _cannot_ tell what impact a tracepoint has in reality
except by looking at the assembly output.
And what scares me way more is the size of a single tracepoint in a
code file.
Just adding "trace_softirq_entry(nr);" adds 88 bytes of text. So
that's optimized tracing code ?
All it's supposed to do is:
if (enabled)
trace_foo(nr);
Replace "if (enabled)" with your favourite code patching jump label
whatever magic. The above stupid version takes about 28, but the
"optimized" tracing code makes that 88. Brilliant. That's inlining
utter shite for no good reason. WTF is it necessary to inline all that
gunk ?
Please spare me the "jump label will make this less intrusive"
lecture. I'm not interested at all.
Let's instead look at some more facts:
#include <linux/interrupt.h>
#include <linux/module.h>
#include <trace/events/irq.h>
static struct softirq_action softirq_vec[NR_SOFTIRQS];
void test(struct softirq_action *h)
{
trace_softirq_entry(h - softirq_vec);
h->action(h);
}
Compile this code with GCC 4.5 with and without jump labels (zap the
select HAVE_ARCH_JUMP_LABEL line in arch/x86/Kconfig)
So now the !jumplabel case gives us:
../build/kernel/soft.o: ...Since I don't have your patch yet, I used the original:
void test(struct softirq_action *h)
{
trace_softirq_entry(h, softirq_vec);
h->action(h);
Well, the one jmpq is noped out, and the jmp is non conditional. I've
always thought a non conditional jmp was faster than a conditional one,
since there's no need to go into the branch prediction logic. The CPU
can simply skip to the code to jump next. Of counse, this pollutes the
Because you do the h - softvec in the tracepoint parameter? I got a
different result:
Here's the diff. I did a cut -c10- to get rid of the line numbers so I
have a better diff. There's still differences due to jump locations, but
those are easy to figure out:
I diffed nojump vs jump. The '-' is with nojump, the '+' is with jumps.
--- /tmp/s2 2010-10-19 16:40:19.000000000 -0400
+++ /tmp/s1 2010-10-19 16:40:23.000000000 -0400
@@ -1,38 +1,33 @@
-00026f0 <test>:
+00027a0 <test>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
- 48 83 ec 10 sub $0x10,%rsp
- 48 89 1c 24 mov %rbx,(%rsp)
- 4c 89 64 24 08 mov %r12,0x8(%rsp)
- e8 00 00 00 00 callq 2706 <test+0x16>
+ 41 54 push %r12
+ 53 push %rbx
+ e8 00 00 00 00 callq 27ac <test+0xc>
R_X86_64_PC32 mcount-0x4
- 8b 15 00 00 00 00 mov 0x0(%rip),%edx # 270c <test+0x1c>
-R_X86_64_PC32 __tracepoint_softirq_entry+0x4
48 89 fb mov %rdi,%rbx
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
- 85 d2 test %edx,%edx
- 75 10 jne 2723 <test+0x33>
+ e9 00 00 00 00 jmpq 27b4 <test+0x14>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There's the difference with this code. We replaced a test and jump
conditional with a single jump that will later be nop'd out.
48 89 df mov %rbx,%rdi
ff 13 callq *(%rbx)
- 48 8b 1c 24 mov ...What are you smoking ? In case the trace point is enabled the jmpq is there, so it jumps to 23 and jumps from there to 72. In case the trace point is disabled the jmpq is noped out, so it jumps I always thought, that at least some of the stuff which comes from We might consult Mathieu for further useless blurb on how CPUs work around broken code. Thanks, tglx --
What? Are you saying that conditional jumps are just as fast as non No, when we dynamically enable the tracepoint, it will jump to 25, not 23. That's what the goto part is about. We add the do_trace label to the table, and we make it point to that location. If we did it as you say, then tracepoints would never be enabled. This is not unlike what we do with the function tracer. The original code points to mcount which simply is: mcount: retq The code worked fine before, it just was not very pretty. But it seemed that gcc for you inlined the code in the wrong spot. Perhaps it's not a good idea to have the something like h - softirq_vec in the parameter of the tracepoint. Not saying that your change is not worth it. It is, because h - softirq_vec is used by others now too. -- Steve --
OK, first of all, there are some serious WTFs here:
# define JUMP_LABEL_INITIAL_NOP ".byte 0xe9 \n\t .long 0\n\t"
A jump instruction is one of the worst possible NOPs. Why are we doing
this?
The second thing that I found when implementing static_cpu_has() was
that it is actually better to encapsulate the asm goto in a small inline
which returns bool (true/false) -- gcc will happily optimize out the
variable and only see it as a flow of control thing. I would be very
curious if that wouldn't make gcc generate better code in cases like that.
gcc 4.5.0 has a bug in that there must be a flowthrough case in the asm
goto (you can't have it unconditionally branch one way or the other), so
that should be the likely case and accordingly it should be annotated
likely() so that gcc doesn't reorder. I suspect in the end one ends up
with code like this:
static __always_inline __pure bool __switch_point(...)
{
asm goto("1: " JUMP_LABEL_INITIAL_NOP
/* ... patching stuff */
: : : : t_jump);
return false;
t_jump:
return true;
}
#define SWITCH_POINT(x) unlikely(__switch_point(x))
I *suspect* this will resolve the need for hot/cold labels just fine.
-hpa
--
Good question. Safety? Jason? Interesting, we could try this. Thanks! -- Steve --
Due to not actually having a sane key type the above is not easy to
implement, but I tried:
#define _SWITCH_POINT(x)\
({ \
__label__ jl_enabled; \
bool ret = true; \
JUMP_LABEL(x, jl_enabled); \
ret = false; \
jl_enabled: \
ret; })
#define SWITCH_POINT(x) unlikely(_SWITCH_POINT(x))
#define COND_STMT(key, stmt) \
do { \
if (SWITCH_POINT(key)) { \
stmt; \
} \
} while (0)
and that's still generating these double jumps.
--
I just experimented with it, and the ({...}) construct doesn't work,
because it looks like a merged flow of control to gcc.
Replacing the ({ ... }) with an inline does indeed remove the double
jumps.
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index b67cb18..2ff829d 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -61,12 +61,22 @@ static inline int jump_label_text_reserved(void
*start, void *end)
#endif
+static __always_inline __pure bool _SWITCH_POINT(void *x)
+{
+ asm goto("# SWITCH_POINT %0\n\t"
+ ".byte 0x66,0x66,0x66,0x66,0x90\n"
+ "1:"
+ : : "i" (x) : : jl_enabled);
+ return false;
+jl_enabled:
+ return true;
+}
+
+#define SWITCH_POINT(x) unlikely(_SWITCH_POINT(x))
+
#define COND_STMT(key, stmt) \
do { \
- __label__ jl_enabled; \
- JUMP_LABEL(key, jl_enabled); \
- if (0) { \
-jl_enabled: \
+ if (SWITCH_POINT(key)) { \
stmt; \
} \
} while (0)
The key here seems to be to not use the JUMP_LABEL macro as implemented;
I have utterly failed to make JUMP_LABEL() do the right thing.
-hpa
--
What happens if you remove the do { } while (0) from JUMP_LABEL, since
it now just makes it into a asm()
-- Steve
--
ok, I tried this out for the tracepoint code, but I still seem to be
getting the double jump.
patch:
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 1947a12..7bc2537 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -66,12 +66,22 @@ static inline void jump_label_unlock(void) {}
#endif
+static __always_inline __pure bool _SWITCH_POINT(void *x)
+{
+ asm goto("# SWITCH_POINT %0\n\t"
+ ".byte 0x66,0x66,0x66,0x66,0x90\n"
+ "1:"
+ : : "i" (x) : : jl_enabled);
+ return false;
+jl_enabled:
+ return true;
+}
+
+#define SWITCH_POINT(x) unlikely(_SWITCH_POINT(x))
+
#define COND_STMT(key, stmt) \
do { \
- __label__ jl_enabled; \
- JUMP_LABEL(key, jl_enabled); \
- if (0) { \
-jl_enabled: \
+ if (SWITCH_POINT(key)) { \
stmt; \
} \
} while (0)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index a4a90b6..1f8d14f 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -146,12 +146,7 @@ static inline void tracepoint_update_probe_range(struct tracepoint *begin,
extern struct tracepoint __tracepoint_##name; \
static inline void trace_##name(proto) \
{ \
- JUMP_LABEL(&__tracepoint_##name.state, do_trace); \
- return; \
-do_trace: \
- __DO_TRACE(&__tracepoint_##name, \
- TP_PROTO(data_proto), \
- TP_ARGS(data_args)); \
+ COND_STMT(&__tracepoint_##name.state, __DO_TRACE(&__tracepoint_##name, TP_PROTO(data_proto), TP_ARGS(data_args))); \
} \
static inline int \
register_trace_##name(void (*probe)(data_proto), void *data) \
disassemly:
ffffffff810360a6 <set_task_cpu>:
ffffffff810360a6: 55 push %rbp
ffffffff810360a7: 48 89 e5 mov %rsp,%rbp
ffffffff810360aa: 41 55 push %r13
ffffffff810360ac: 41 54 push ...But it makes absolutely no sense to insert an instruction that suboptimal and then convert it. Start out with a reasonable, universally acceptable, instruction, e.g. LEA on 32 bits and NOPL on 64 It of course also have the nice property that it syntactically looks exactly like any other C conditional. -hpa --
This code is dynamically patched at boot time (and module load time) with a Thanks for the hint! We'll make sure to try it out. Having the ability to force gcc to put the tracepoint in an unlikely branch is deeply needed here. I'm a bit curious about the nop vs jump overhead comparison you are referring to. It is an instruction latency benchmark or throughput benchmark ? Intel's manual "Intel 64 and IA-32 Architectures Optimization Reference Manual" http://www.intel.com/Assets/PDF/manual/248966.pdf Page C-33 (or 577 in the pdf) "7. Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero." So it mentions "jcc", but not jmp. Is there any reason for jmp to have a higher latency than jcc ? In this manual, the latency of predicted jcc is therefore 0 cycle, and its throughput is 0.5 cycle/insn. NOP (page C-29) is stated to have a latency of 0.5 to 1 cycle/insn (depending on the exact HW), and throughput of 0.5 cycle/insn. However, I have not found "jmp" explicitly in this listing. So if we were executing tracepoints in a maze of jumps, we could argue that instruction throughput is the most important there. However, if we expect the common case to be surrounded by some non-ALU instructions, latency tends to become the most important criterion. But I feel I might be missing something important that distinguish "jcc" from "jmp". Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
That's just ridiculous... start out with something sane and you at least NOP has a latency of 0.5-1.0 cycle/insns, *but has no consumers*. JMP/Jcc does have a consumer -- the IP -- and actually measuring shows that it is much, much worse than NOP and other dummy instructions. -hpa --
But how does JMP vs Jcc compare? -- Steve --
*As far as I know* they're the same, except of course that a direct JMP never mispredicts. -hpa --
sure. The idea of the 'jmp 0' was simply to be an lcd for x86, if there's a better lcd for x86, I'll update it. But note, that since the 'jmp 0' is patched to a better nop at boot, we wouldn't see much gain. And in the boot path we are using 'text_poke_early()', so avoiding that isn't going to improve things much. I've got a few fixup patches in the queue that I'm going to post first, and then I'll take a look at this change. thanks, -Jason --
One thing to consider here is that some nops are not compatible across all architectures. And it would be safer to use an atomic nops (a single instruction) too. e.g. GENERIC_NOP5 in arch/x86/include/asm/nops.h is really 2 instructions, which can cause problems if a concurrent thread is preempted between the two instructions while we patch. arch_init_ideal_nop5() is actually doing the task of finding the best nop, and it falls-back on a 5-byte nop (just like you do). HPA, do you have any recommendation for a 5-byte single-instruction nop that is efficient enough and will work on all x86 (Intel, AMD and other variants) ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
It's still a completely unnecessary waste of startup time some potentially significant fraction of the time. Startup time matters, especially as the number of tracepoints grow. -hpa --
We're still waiting for input for the best single-5-byte-instruction nop that will work on all x86 variants. Please note that the GENERIC_NOP5 is actually two instructions one next to each other, which is not appropriate here. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
On 64 bits, use P6_NOP5; it seems to not suck on any platform. On 32 bits, 3E 8D 74 26 00 (i.e. DS: + GENERIC_NOP4) seems to at least do okay. I can't say these are the *best* (in fact, they are guaranteed not the best on some significant number of chips), but they haven't sucked on any chips I have been able to measure -- and are way faster than JMP. -hpa --
This is pure conjecture, I have not measured it, but I suspect in fact that we could just change the composite nops in nops.h to use a 3E prefix instead of a separate 90 nop. Some platforms will take a penalty on the prefix, but that would be balanced against handling two instructions. The P5 core and others of the same generation might suffer, as it might have been able to do U+V pipe pairing on two instructions which it wouldn't for prefixes. -hpa --
Cool, thanks for the info! Steven and Jason should probably update their respective infrastructure to use the 32-bit 5-byte nop you propose rather than the 5-byte jump. Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Actually, I was thinking that we could take any 5 byte nop. The alternate code is executed _before_ SMP is enabled. Thus we should not have any cases where something could be executing in midstream. -- Steve --
Nay, absolutely not. See, the goal here is to find a no-op that is good enough to be left there *without* init-time dynamic patching on a range of architectures, so we can diminish the boot-time delay. This imply that we have to select a no-op that can be patched in SMP context, thus it must be a single 5-byte instruction. We could even create a EMBEDDED config that lets specify that the built-in nop should be left there for embedded systems that care about boot time. Moreover, even if it was not the case, I'd be tempted to still use a single instruction 5-byte no-op just in case interrupts or any sorts (standard interrupts, nmis, mce or whatnot) would happen to be enabled earlier than this boot time nop patching. IOW, you'd need a _very_ strong argument to support using the fragile 2 instructions nops there. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Crap, crap, crap. This has nothing to do with the arguments of that trace point, it's a compiler problem and you are just hoping that GCC will do the right thing. That's the complete wrong assumption and as Jason confirmed GCC is not up to it at all. hpa just posted code which does the _RIGHT_ _THING_ independent of any compiler madness and you tracer folks just missed it. Your jump label optimization made code even worse for todays common compilers. Just admit it and fix that mess you created or simply disable it. Thanks, tglx --
Thomas,
Can you try this patch and see if it makes the object code better?
-- Steve
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index a4a90b6..6264bd3 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -144,14 +144,19 @@ static inline void tracepoint_update_probe_range(struct tracepoint *begin,
*/
#define __DECLARE_TRACE(name, proto, args, data_proto, data_args) \
extern struct tracepoint __tracepoint_##name; \
- static inline void trace_##name(proto) \
+ static __always_inline int __trace_##name(proto) \
{ \
JUMP_LABEL(&__tracepoint_##name.state, do_trace); \
- return; \
+ return 0; \
do_trace: \
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args)); \
+ return 1; \
+ } \
+ static inline void trace_##name(proto) \
+ { \
+ unlikely(__trace_##name(args)); \
} \
static inline int \
register_trace_##name(void (*probe)(data_proto), void *data) \
--
Yeah, I figured. Do you have CC_OPTIMIZE_FOR_SIZE set? And if you do, what happens if you disable it? -- Steve --
Hmm. Indeed. That gets rid of the double jump. Thanks, tglx --
-Os unfortunately drops a bunch of optimizations.
With gcc 4.5.1 there is actually a way to guarantee to get rid of double
jumps, which is that you tell gcc that it is branching to one of two
targets:
asm goto("1: .byte 0xe9 ; .long %l[t_no]-2f\n"
"2:\n"
/* patching infrastructure goes here */
: : "i" (bit) : : t_no, t_yes);
__builtin_unreachable();
t_no:
return false;
t_yes:
return true;
[The open-coding of the jump is necessary to force the 5-byte form
instead of the 2-byte form.]
The patching machinery can recognize the case where the jump offset is
zero and patch in a NOP instead.
There does, however, seem to be a couple of problems:
a) gcc 4.5.1 is required due to a bug in previous versions of gcc when
an asm goto doesn't have a fallthrough case.
b) it seems to encourage gcc to actively jump around as it reorders
blocks, since gcc no longer sees a fallthrough case at all.
Not sure I have a good solution for this, at least not with current gcc.
-hpa
--
I guess some serious whacking is due. The compiler adds two jumps when the parameter changes due to (h -softvec) instead of (h, softvec) ???? Dude, you can't be serious. If you would have asked about the compiler version I'm using and told me about the compiler version you are using, then I could take that answer somehow serious. It still would miss the "Uhhhh, your compiler creates crap code" alert, because that double jump is seriously broken and braindead. And I tell you more about this. You are going to piss off a lot of users of distro compilers because they will set CC_HAVE_ASM_GOTO happily and create the code I posted. Which will break the tracer no matter what. So you tracer maniacs happily played with some experimental compiler stuff w/o even testing your crap against something which ships with distros or is the reference 4.5 compiler on kernel.org ? I prefer you sending a patch to disable this, until it's sorted out, unless you want me to add some really outrageous changelog to the patch I'm going to put into tip tomorrow night, ok ? Thanks, tglx --
Heh, gcc has always been of a black magic for what it decided. But, anyway, I'm using a self built version (vanilla from gcc.gnu.org) of Yes, gcc replaced a cmp and conditional jump with two unconditional jumps. One of these jumps on boot up will be converted to a nop. Thus the jump label code just converted a compare and conditional jump with a nop and a non conditional jump. This still sounds like a win to me, although we can do better. I guess those poor sobs using a distro kernel compiled with a distro gcc that has CC_HAVE_ASM_GOTO enabled will still be doing better than if it was doing the if (enable) code. -- Steve --
On 10/19/2010 12:49 PM, Thomas Gleixner wrote: The 'asm goto' construct in GCC-4.5 is deficient in this area. GCC assumes that all exit paths from an 'asm goto' are equally likely, so the tracing (or dynamic printk etc.) code is assumed to be hot and is emitted inline. Since they are inline like this, there are all these jumps around them and they pollute the I-Cache. I was looking at fixing it, but I think a true general purpose fix would require enhancing GCC's grammar to allow specifying of the 'likelyness' of each exit path from 'asm goto'. --
right, the next step is adding support for hot/cold labels, so the tracing code will be annotaed with a 'cold' label. Thus, not adding the 'jmp' above on line '23', and in fact moving the tracing code out-of-line. Maybe I haven't been clear on this. thanks, -Jason --
Ok, so is anybody working on doing that? GCC-4.6 stage 1 (the time when a change like this could be merged) closes in 8 days. It is unfortunate that we have this shiny new feature that can't really be used because the infrastructure is only half baked. David Daney --
Interesting. I thought the driving force for asm goto was for tracepoints, as the documentation seems to reference them. One would think that the default would have been to make it the unlikely case, as That would be nice. Thanks, -- Steve --
right, so the 'jmpq' on boot on x86 gets patched with 5 byte no-op sequence. So in the disabled case we have no-op followed by a jump yes, that is the plan. gcc does not yet support hot/cold labels...once it does the second jump will go away and the entire tracepoint code will be moved to a 'cold' section. It's not quite completely optimal yet, but The 'jmpq' in the enabled case is patched with a jmpq to the body of the yes, this is what the code should look like when we get support for hot/cold labels. I've discussed this support with gcc folk, and its the next step here. So yes, this is exacatly where we are headed. thanks, -Jason --
And that's supposed to be useful ? We do _NOT_ want to jump around disabled stuff. The noped out case should fall through into the non traced code. Otherwise that whole jumplabel thing is completely Then do not advertise it as the brilliant solution for all tracing So and at the same time the whole tracing crowd tells me, that this is already a done deal. See previous advertisments from DrTracing. I'm seriously grumpy about this especially in the context of a patch which fixes one of the worst interfaces I've seen in years. Thanks, tglx --
IOW, We now jump around the jump which jumps around the disabled code. Thanks, tglx --
Do you happen to have CONFIG_CC_OPTIMIZE_FOR_SIZE set? If so, then this is a known issue. We even originally had jump label enabled _only_ if CC_OPTIMIZE_FOR_SIZE was not set, but hpa NAK'd it. http://lkml.org/lkml/2010/9/22/482 http://lkml.org/lkml/2010/9/20/488 http://lkml.org/lkml/2010/9/24/259 -- Steve --
thanks Steve. I was about to say this. When CONFIG_CC_OPTIMIZE_FOR_SIZE is not set we don't get the double 'jmp' and the tracepoint code is moved out of line. It was mentioned that a number of distros ship with CONFIG_CC_OPTIMIZE_FOR_SIZE not set, and as Steve mentioned my original patch set was conditional on !CONFIG_CC_OPTIMIZE_FOR_SIZE. using hot/cold labels gcc can fix the CONFIG_CC_OPTIMIZE_FOR_SIZE case, but its a non-trivial amount of work for gcc. I was hoping that if jump labels are included, we could make the gcc work happen. thanks, -Jason --
That's fair. I think jump labels are still a win even in the double-jump case (especially if the the tracepoint turns into a NOP rather than a JMP.) Code generated with -Os has a bunch of other problems, too. -hpa --
I'm not sure I did, the documentation says that we have nop followed by a jmp: +The new code is a 'nopl' followed by a 'jmp'. Thus: + +nopl - 0f 1f 44 00 00 - 5 bytes +jmp - eb 3e - 2 bytes sorry if I mislead anybody about the current state of of 'jump labels'. But we have the same goal in mind, and a clear path to get there. If you don't agree with the approach - I'm all ears. And you are right - the code is not where it should be yet. thanks, -Jason --
There is no excuse for needing the second jump here, obviously... -hpa --
Now the trick is to tell gcc that. -- Steve --
Yes this was crappy code, I'm not defending it. But this code was from the original tracepoints. I just looked at when this code was added, and it was still in the time TRACE_EVENT() was in a major flux. Heck, the code resided in include/trace/irq.h and not include/trace/events/irq.h. And yes, a lot of decisions back then were put on handwaving performance and impact, and it was not just coming from us. I admit I should have cleaned it up, but I did not want to touch it until it actually broke ;-) -- Steve --
In this specific case, I think you are right. But things are not that trivial, and you know it. We have to consider: - Extra instruction cache footprint - Added register pressure - Added computation overhead of the added substraction - Frequency of code execution for all target architectures when we add tracepoints to performance sensitive code paths. As a general policy, we try to keep these at the lowest possible level, so that all tracepoints will be compiled into distro kernels without perceivable _overall_ performance overhead. It's not something that should be looked at only on a tracepoint-by-tracepoint overhead basis, but rather by looking at the overall system degradation that adding 300 tracepoints would cause. So I agree with you that it's a trade-off between interface cleanness and performance. When they were introduced, Tracepoint handlers were barely seen as citizen of the kernel code base, so all that mattered was to keep the "tracepoints off" case clean and fast. Now that tracepoint handlers seems to be increasingly accepted as part of the kernel code base, I agree that taking into account oddness performed in this handler becomes more important. It ends up being a question of balance between oddness inside the tracepoint handler and performance overhead in the off-case. The increased acceptance of the tracepoint code-base has shifted this balance slightly in favor of cleanness. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Commit-ID: f4bc6bb2d562703eafc895c37e7be20906de139d Gitweb: http://git.kernel.org/tip/f4bc6bb2d562703eafc895c37e7be20906de139d Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 19 Oct 2010 15:00:13 +0200 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Thu, 21 Oct 2010 16:50:29 +0200 tracing: Cleanup the convoluted softirq tracepoints With the addition of trace_softirq_raise() the softirq tracepoint got even more convoluted. Why the tracepoints take two pointers to assign an integer is beyond my comprehension. But adding an extra case which treats the first pointer as an unsigned long when the second pointer is NULL including the back and forth type casting is just horrible. Convert the softirq tracepoints to take a single unsigned int argument for the softirq vector number and fix the call sites. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <alpine.LFD.2.00.1010191428560.6815@localhost6.localdomain6> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: mathieu.desnoyers@efficios.com Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> --- include/linux/interrupt.h | 2 +- include/trace/events/irq.h | 54 ++++++++++++++++--------------------------- kernel/softirq.c | 16 +++++++----- 3 files changed, 30 insertions(+), 42 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 531495d..0ac1949 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -410,7 +410,7 @@ extern void open_softirq(int nr, void (*action)(struct softirq_action *)); extern void softirq_init(void); static inline void __raise_softirq_irqoff(unsigned int nr) { - trace_softirq_raise((struct softirq_action *)(unsigned long)nr, NULL); + trace_softirq_raise(nr); or_softirq_pending(1UL << nr); } diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h index 6fa7cba..1c09820 100644 --- ...
From: Neil Horman <nhorman@tuxdriver.com>
This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT to improve
the usability of napi_poll tracepoint.
<idle>-0 [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3
<idle>-0 [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1
An original patch is below.
http://marc.info/?l=linux-kernel&m=126021713809450&w=2
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
And add a fix by Steven Rostedt.
http://marc.info/?l=linux-kernel&m=126150506519173&w=2
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
include/trace/events/napi.h | 25 +++++++++++++++++++++++--
1 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
index 188deca..8fe1e93 100644
--- a/include/trace/events/napi.h
+++ b/include/trace/events/napi.h
@@ -6,10 +6,31 @@
#include <linux/netdevice.h>
#include <linux/tracepoint.h>
+#include <linux/ftrace.h>
+
+#define NO_DEV "(no_device)"
+
+TRACE_EVENT(napi_poll,
-DECLARE_TRACE(napi_poll,
TP_PROTO(struct napi_struct *napi),
- TP_ARGS(napi));
+
+ TP_ARGS(napi),
+
+ TP_STRUCT__entry(
+ __field( struct napi_struct *, napi)
+ __string( dev_name, napi->dev ? napi->dev->name : NO_DEV)
+ ),
+
+ TP_fast_assign(
+ __entry->napi = napi;
+ __assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
+ ),
+
+ TP_printk("napi poll on napi struct %p for device %s",
+ __entry->napi, __get_str(dev_name))
+);
+
+#undef NO_DEV
#endif /* _TRACE_NAPI_H_ */
--
From: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> --
Commit-ID: 3e4b10d7a4d2a78af64f8096dc7cdb3bebd65adb Gitweb: http://git.kernel.org/tip/3e4b10d7a4d2a78af64f8096dc7cdb3bebd65adb Author: Neil Horman <nhorman@tuxdriver.com> AuthorDate: Mon, 23 Aug 2010 18:43:51 +0900 Committer: Frederic Weisbecker <fweisbec@gmail.com> CommitDate: Tue, 7 Sep 2010 17:51:01 +0200 napi: Convert trace_napi_poll to TRACE_EVENT This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT to improve the usability of napi_poll tracepoint. <idle>-0 [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3 <idle>-0 [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1 The original patch is below: http://marc.info/?l=linux-kernel&m=126021713809450&w=2 [ sanagi.koki@jp.fujitsu.com: And add a fix by Steven Rostedt: http://marc.info/?l=linux-kernel&m=126150506519173&w=2 ] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C7242D7.4050009@jp.fujitsu.com> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> --- include/trace/events/napi.h | 25 +++++++++++++++++++++++-- 1 files changed, 23 insertions(+), 2 deletions(-) diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h index 188deca..8fe1e93 100644 --- a/include/trace/events/napi.h +++ b/include/trace/events/napi.h @@ -6,10 +6,31 @@ #include <linux/netdevice.h> #include <linux/tracepoint.h> +#include <linux/ftrace.h> + +#define ...
This patch adds tracepoint to consume_skb and add trace_kfree_skb before
__kfree_skb in skb_free_datagram_locked and net_tx_action.
Combinating with tracepoint on dev_hard_start_xmit, we can check how long it
takes to free transmited packets. And using it, we can calculate how many
packets driver had at that time. It is useful when a drop of transmited packet
is a problem.
sshd-6828 [000] 112689.258154: consume_skb: skbaddr=f2d99bb8
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
include/trace/events/skb.h | 17 +++++++++++++++++
net/core/datagram.c | 1 +
net/core/dev.c | 2 ++
net/core/skbuff.c | 1 +
4 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 4b2be6d..75ce9d5 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -35,6 +35,23 @@ TRACE_EVENT(kfree_skb,
__entry->skbaddr, __entry->protocol, __entry->location)
);
+TRACE_EVENT(consume_skb,
+
+ TP_PROTO(struct sk_buff *skb),
+
+ TP_ARGS(skb),
+
+ TP_STRUCT__entry(
+ __field( void *, skbaddr )
+ ),
+
+ TP_fast_assign(
+ __entry->skbaddr = skb;
+ ),
+
+ TP_printk("skbaddr=%p", __entry->skbaddr)
+);
+
TRACE_EVENT(skb_copy_datagram_iovec,
TP_PROTO(const struct sk_buff *skb, int len),
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 251997a..282806b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -243,6 +243,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
unlock_sock_fast(sk, slow);
/* skb is now orphaned, can be freed outside of locked section */
+ trace_kfree_skb(skb, skb_free_datagram_locked);
__kfree_skb(skb);
}
EXPORT_SYMBOL(skb_free_datagram_locked);
diff --git a/net/core/dev.c b/net/core/dev.c
index c9b026a..48f7977 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -129,6 +129,7 @@
#include <linux/random.h>
#include <trace/events/napi.h>
#include ...From: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> --
Commit-ID: 07dc22e7295f25526f110d704655ff0ea7687420 Gitweb: http://git.kernel.org/tip/07dc22e7295f25526f110d704655ff0ea7687420 Author: Koki Sanagi <sanagi.koki@jp.fujitsu.com> AuthorDate: Mon, 23 Aug 2010 18:46:12 +0900 Committer: Frederic Weisbecker <fweisbec@gmail.com> CommitDate: Tue, 7 Sep 2010 17:51:53 +0200 skb: Add tracepoints to freeing skb This patch adds tracepoint to consume_skb and add trace_kfree_skb before __kfree_skb in skb_free_datagram_locked and net_tx_action. Combinating with tracepoint on dev_hard_start_xmit, we can check how long it takes to free transmitted packets. And using it, we can calculate how many packets driver had at that time. It is useful when a drop of transmitted packet is a problem. sshd-6828 [000] 112689.258154: consume_skb: skbaddr=f2d99bb8 Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C724364.50903@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> --- include/trace/events/skb.h | 17 +++++++++++++++++ net/core/datagram.c | 1 + net/core/dev.c | 2 ++ net/core/skbuff.c | 1 + 4 files changed, 21 insertions(+), 0 deletions(-) diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h index 4b2be6d..75ce9d5 100644 --- a/include/trace/events/skb.h +++ b/include/trace/events/skb.h @@ -35,6 +35,23 @@ TRACE_EVENT(kfree_skb, __entry->skbaddr, __entry->protocol, __entry->location) ); +TRACE_EVENT(consume_skb, + + TP_PROTO(struct sk_buff ...
Add a perf script which shows a process of packets and processed time.
It helps us to investigate networking or network device.
If you want to use it, install perf and record perf.data like following.
#perf trace record netdev-times [script]
If you set script, perf gathers records until it ends.
If not, you must Ctrl-C to stop recording.
And if you want a report from record,
#perf trace report netdev-times [options]
If you use some options, you can limit an output.
Option is below.
tx: show only process of tx packets
rx: show only process of rx packets
dev=: show a process specified with this option
debug: work with debug mode. It shows buffer status.
For example, if you want to show a process of received packets associated
with eth4,
#perf trace report netdev-times rx dev=eth4
106133.171439sec cpu=0
irq_entry(+0.000msec irq=24:eth4)
|
softirq_entry(+0.006msec)
|
|---netif_receive_skb(+0.010msec skb=f2d15900 len=100)
| |
| skb_copy_datagram_iovec(+0.039msec 10291::10291)
|
napi_poll_exit(+0.022msec eth4)
This perf script helps us to analyze a process time of transmit/receive
sequence.
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
tools/perf/scripts/python/bin/netdev-times-record | 8 +
tools/perf/scripts/python/bin/netdev-times-report | 5 +
tools/perf/scripts/python/netdev-times.py | 464 +++++++++++++++++++++
3 files changed, 477 insertions(+), 0 deletions(-)
diff --git a/tools/perf/scripts/python/bin/netdev-times-record b/tools/perf/scripts/python/bin/netdev-times-record
new file mode 100644
index 0000000..2b59511
--- /dev/null
+++ b/tools/perf/scripts/python/bin/netdev-times-record
@@ -0,0 +1,8 @@
+#!/bin/bash
+perf record -c 1 -f -R -a -e net:net_dev_xmit -e net:net_dev_queue \
+ -e net:netif_receive_skb -e net:netif_rx \
+ -e skb:consume_skb -e skb:kfree_skb \
+ -e skb:skb_copy_datagram_iovec -e napi:napi_poll \
+ -e ...From: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> --
Nano-nits: -c 1 and -R are now default settings for tracepoints and -f is not Events already arrive in time order to the scripts. Thnaks! --
Commit-ID: 359d5106a2ff4ffa2ba129ec8f54743c341dabfc Gitweb: http://git.kernel.org/tip/359d5106a2ff4ffa2ba129ec8f54743c341dabfc Author: Koki Sanagi <sanagi.koki@jp.fujitsu.com> AuthorDate: Mon, 23 Aug 2010 18:47:09 +0900 Committer: Frederic Weisbecker <fweisbec@gmail.com> CommitDate: Tue, 7 Sep 2010 18:43:32 +0200 perf: Add a script to show packets processing Add a perf script which shows packets processing and processed time. It helps us to investigate networking or network devices. If you want to use it, install perf and record perf.data like following. If you set script, perf gathers records until it ends. If not, you must Ctrl-C to stop recording. And if you want a report from record, If you use some options, you can limit the output. Option is below. tx: show only tx packets processing rx: show only rx packets processing dev=: show processing on this device debug: work with debug mode. It shows buffer status. For example, if you want to show received packets processing associated with eth4, 106133.171439sec cpu=0 irq_entry(+0.000msec irq=24:eth4) | softirq_entry(+0.006msec) | |---netif_receive_skb(+0.010msec skb=f2d15900 len=100) | | | skb_copy_datagram_iovec(+0.039msec 10291::10291) | napi_poll_exit(+0.022msec eth4) This perf script helps us to analyze the processing time of a transmit/receive sequence. Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tom Zanussi ...
From: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> --
The entire series: Acked-by: Steven Rostedt <rostedt@goodmis.org> -- Steve --
Thanks many acks. and I have one question. These patches have several component. Patch1 is kernel component, but patch2-5 are netdev component. What tree is good to be included ? If it is not net-next, I must rebase to another tree. Thanks, Koki Sanagi. --
From: Koki Sanagi <sanagi.koki@jp.fujitsu.com> I would prefer it goes into the tracing tree or whatever is the most appropriate for patch #1. --
O.K. I'll rebase to linux-2.6-tip. Thanks, Koki Sanagi. --
No need, they apply very well :) I'll push that to -tip soon. Thanks. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing |
