Re: [PATCH v4 2/5] napi: convert trace_napi_poll to TRACE_EVENT

Previous thread: [PATCH] Fix typo in Kconfig help text by Thomas Weber on Monday, August 23, 2010 - 2:20 am. (2 messages)

Next thread: [GIT] kbuild fixes for 2.6.36-rc2 by Michal Marek on Saturday, August 21, 2010 - 1:13 pm. (2 messages)
From: Koki Sanagi
Date: Monday, August 23, 2010 - 2:41 am

Rebase to the latest net-next.

CHANGE-LOG since v3:
    1) change arguments of softirq tracepoint into original one.
    2) remove tracepoint of dev_kfree_skb_irq and skb_free_datagram_locked
       and add trace_kfree_skb before __kfree_skb instead of them.
    3) add tracepoint to netif_rx and display it by netdev-times script.

These patch-set adds tracepoints to show us a process of packets.
Using these tracepoints and existing points, we can get the time when
packet passes through some points in transmit or receive sequence.
For example, this is an output of perf script which is attached by patch 5/5.

106133.171439sec cpu=0
  irq_entry(+0.000msec irq=24:eth4)
         |
  softirq_entry(+0.006msec)
         |
         |---netif_receive_skb(+0.010msec skb=f2d15900 len=100)
         |            |
         |      skb_copy_datagram_iovec(+0.039msec 10291::10291)
         |
  napi_poll_exit(+0.022msec eth4)

106134.175634sec cpu=1
  irq_entry(+0.000msec irq=28:eth1)
         |
         |---netif_rx(+0.009msec skb=f3ef0a00)
         |
  softirq_entry(+0.018msec)
         |
         |---netif_receive_skb(+0.021msec skb=f3ef0a00 len=84)
         |            |
         |      skb_copy_datagram_iovec(+0.033msec 0:swapper)
         |
  napi_poll_exit(+0.035msec (no_device))

The above is a receive side(eth4 is NAPI. eth1 is non-NAPI). Like this, it can
show receive sequence frominterrupt(irq_entry) to application
(skb_copy_datagram_iovec). 
This script shows one NET_RX softirq and events related to it. All relative
time bases on first irq_entry which raise NET_RX softirq.

   dev    len      Qdisc               netdevice             free
   eth4    74 106125.030004sec        0.006msec             0.009msec
   eth4    87 106125.041020sec        0.007msec             0.023msec
   eth4    66 106125.042291sec        0.003msec             0.012msec
   eth4    66 106125.043274sec        0.006msec             0.004msec
   eth4   850 106125.044283sec        0.007msec       ...
From: Koki Sanagi
Date: Monday, August 23, 2010 - 2:42 am

From: Lai Jiangshan <laijs@cn.fujitsu.com>

Add a tracepoint for tracing when softirq action is raised.

It and the existed tracepoints complete softirq's tracepoints:
softirq_raise, softirq_entry and softirq_exit.

And when this tracepoint is used in combination with
the softirq_entry tracepoint we can determine
the softirq raise latency.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

[ factorize softirq events with DECLARE_EVENT_CLASS ]
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/linux/interrupt.h  |    8 +++++++-
 include/trace/events/irq.h |   26 ++++++++++++++++++++++++--
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a0384a4..d3e8e90 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -18,6 +18,7 @@
 #include <asm/atomic.h>
 #include <asm/ptrace.h>
 #include <asm/system.h>
+#include <trace/events/irq.h>
 
 /*
  * These correspond to the IORESOURCE_IRQ_* defines in
@@ -407,7 +408,12 @@ asmlinkage void do_softirq(void);
 asmlinkage void __do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action *));
 extern void softirq_init(void);
-#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0)
+static inline void __raise_softirq_irqoff(unsigned int nr)
+{
+	trace_softirq_raise((struct softirq_action *)&nr, NULL);
+	or_softirq_pending(1UL << nr);
+}
+
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
 extern void wakeup_softirqd(void);
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index 0e4cfb6..3ddda02 100644
--- a/include/trace/events/irq.h
+++ b/include/trace/events/irq.h
@@ -5,7 +5,9 @@
 #define _TRACE_IRQ_H
 
 #include <linux/tracepoint.h>
-#include <linux/interrupt.h>
+
+struct ...
From: Frederic Weisbecker
Date: Friday, September 3, 2010 - 8:29 am

It seems that this will break softirq_entry/exit tracepoints.
__entry->vec will deref vec->action() for these two, which is not
what we want.

If you can't have the same tracepoint signature for the three, just
split the new one in a seperate TRACE_EVENT().

Thanks.

--

From: Steven Rostedt
Date: Friday, September 3, 2010 - 8:39 am

But for trace_softirq_entry and trace_softirq_exit, vec will not be

It may be a bit of a hack, and questionable about adding another
TRACE_EVENT(). There still is a pretty good space savings in using
DEFINE_EVENT() over TRACE_EVENT() though.

-- Steve



--

From: Frederic Weisbecker
Date: Friday, September 3, 2010 - 8:42 am

Oh right...



Yeah, let's keep it as is.

--

From: Steven Rostedt
Date: Friday, September 3, 2010 - 8:43 am

Perhaps doing:

	trace_softirq_raise((struct softirq_action *)((unsigend long)nr),
				NULL);


			__entry->vec = (int)h;


Doing the above will at least be a bit safer.

-- Steve


--

From: Frederic Weisbecker
Date: Friday, September 3, 2010 - 8:50 am

Agreed, I'm going to change that in the patch.

Thanks.

--

From: Koki Sanagi
Date: Sunday, September 5, 2010 - 6:46 pm

I agree.

Thanks,
Koki Sanagi.

--

From: tip-bot for Lai Jiangshan
Date: Wednesday, September 8, 2010 - 1:33 am

Commit-ID:  2bf2160d8805de64308e2e7c3cd97813cb58ed2f
Gitweb:     http://git.kernel.org/tip/2bf2160d8805de64308e2e7c3cd97813cb58ed2f
Author:     Lai Jiangshan <laijs@cn.fujitsu.com>
AuthorDate: Mon, 23 Aug 2010 18:42:48 +0900
Committer:  Frederic Weisbecker <fweisbec@gmail.com>
CommitDate: Tue, 7 Sep 2010 17:49:34 +0200

irq: Add tracepoint to softirq_raise

Add a tracepoint for tracing when softirq action is raised.

This and the existing tracepoints complete softirq's tracepoints:
softirq_raise, softirq_entry and softirq_exit.

And when this tracepoint is used in combination with
the softirq_entry tracepoint we can determine
the softirq raise latency.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: David Miller <davem@davemloft.net>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4C724298.4050509@jp.fujitsu.com>
[ factorize softirq events with DECLARE_EVENT_CLASS ]
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/interrupt.h  |    8 +++++++-
 include/trace/events/irq.h |   26 ++++++++++++++++++++++++--
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a0384a4..531495d 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -18,6 +18,7 @@
 #include <asm/atomic.h>
 #include <asm/ptrace.h>
 #include <asm/system.h>
+#include <trace/events/irq.h>
 
 /*
  * These correspond to the IORESOURCE_IRQ_* defines in
@@ -407,7 +408,12 @@ asmlinkage void do_softirq(void);
 asmlinkage void ...
From: Ingo Molnar
Date: Wednesday, September 8, 2010 - 4:25 am

FYI, this commit broke the Sparc build:

In file included from /home/mingo/tip/arch/sparc/include/asm/irq_32.h:11,
                 from /home/mingo/tip/arch/sparc/include/asm/irq.h:6,
                 from /home/mingo/tip/include/linux/irqnr.h:10,
                 from /home/mingo/tip/include/linux/irq.h:22,
                 from /home/mingo/tip/include/asm-generic/hardirq.h:6,
                 from /home/mingo/tip/arch/sparc/include/asm/hardirq_32.h:11,
                 from /home/mingo/tip/arch/sparc/include/asm/hardirq.h:6,
                 from /home/mingo/tip/include/linux/hardirq.h:10,
                 from /home/mingo/tip/include/linux/ftrace_event.h:7,
                 from /home/mingo/tip/include/trace/syscall.h:6,
                 from /home/mingo/tip/include/linux/syscalls.h:76,
                 from /home/mingo/tip/init/initramfs.c:9:
/home/mingo/tip/include/linux/interrupt.h: In function '__raise_softirq_irqoff':
/home/mingo/tip/include/linux/interrupt.h:414: error: implicit declaration of function 'local_softirq_pending'
/home/mingo/tip/include/linux/interrupt.h:414: error: lvalue required as left operand of assignment
make[2]: *** [init/initramfs.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from /home/mingo/tip/arch/sparc/include/asm/irq_32.h:11,

	Ingo
--

From: Frederic Weisbecker
Date: Wednesday, September 8, 2010 - 5:26 am

Yeah, there is a circular dependency.
Does that fixes the issue (and if so, does that look sane)?

Thanks.

---
From fc21eaa02d4a6f0af396af6a106587e61515cd86 Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <fweisbec@gmail.com>
Date: Wed, 8 Sep 2010 14:17:31 +0200
Subject: [PATCH] irq: Fix circular headers dependency

asm-generic/hardirq.h needs asm/irq.h which might include
linux/interrupt.h as in the sparc 32 case. At this point
we need irq_cpustat generic definitions, but those are
included later in asm-generic/hardirq.h.

Then delay a bit the inclusion of irq.h from
asm-generic/hardirq.h, it doesn't need to be included early.

This fixes:

In file included from arch/sparc/include/asm/irq_32.h:11,
                  from arch/sparc/include/asm/irq.h:6,
                  from include/linux/irqnr.h:10,
                  from include/linux/irq.h:22,
                  from include/asm-generic/hardirq.h:6,
                  from arch/sparc/include/asm/hardirq_32.h:11,
                  from arch/sparc/include/asm/hardirq.h:6,
                  from include/linux/hardirq.h:10,
                  from include/linux/ftrace_event.h:7,
                  from include/trace/syscall.h:6,
                  from include/linux/syscalls.h:76,
                  from init/initramfs.c:9:
include/linux/interrupt.h: In function '__raise_softirq_irqoff':
include/linux/interrupt.h:414: error: implicit declaration of function 'local_softirq_pending'
include/linux/interrupt.h:414: error: lvalue required as left operand of assignment

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/asm-generic/hardirq.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/hardirq.h b/include/asm-generic/hardirq.h
index 62f5908..04d0a97 100644
--- a/include/asm-generic/hardirq.h
+++ ...
From: tip-bot for Frederic Weisbecker
Date: Thursday, September 9, 2010 - 12:54 pm

Commit-ID:  3b8fad3e2f5f69bfd8e42d099ca8582fb2342edf
Gitweb:     http://git.kernel.org/tip/3b8fad3e2f5f69bfd8e42d099ca8582fb2342edf
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Wed, 8 Sep 2010 14:26:00 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 9 Sep 2010 21:28:58 +0200

irq: Fix circular headers dependency

asm-generic/hardirq.h needs asm/irq.h which might include
linux/interrupt.h as in the sparc 32 case. At this point
we need irq_cpustat generic definitions, but those are
included later in asm-generic/hardirq.h.

Then delay a bit the inclusion of irq.h from
asm-generic/hardirq.h, it doesn't need to be included early.

This fixes:

 include/linux/interrupt.h: In function '__raise_softirq_irqoff':
 include/linux/interrupt.h:414: error: implicit declaration of function 'local_softirq_pending'
 include/linux/interrupt.h:414: error: lvalue required as left operand of assignment

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Cc: mathieu.desnoyers@efficios.com
Cc: rostedt@goodmis.org
Cc: nhorman@tuxdriver.com
Cc: scott.a.mcmillan@intel.com
Cc: eric.dumazet@gmail.com
Cc: kaneshige.kenji@jp.fujitsu.com
Cc: davem@davemloft.net
Cc: izumi.taku@jp.fujitsu.com
Cc: kosaki.motohiro@jp.fujitsu.com
LKML-Reference: <20100908122557.GA5310@nowhere>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-generic/hardirq.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/hardirq.h b/include/asm-generic/hardirq.h
index 62f5908..04d0a97 100644
--- a/include/asm-generic/hardirq.h
+++ b/include/asm-generic/hardirq.h
@@ -3,13 +3,13 @@
 
 #include <linux/cache.h>
 #include <linux/threads.h>
-#include <linux/irq.h>
 
 typedef struct {
 	unsigned int __softirq_pending;
 } ____cacheline_aligned irq_cpustat_t;
 
 #include <linux/irq_cpustat.h>	/* Standard ...
From: Peter Zijlstra
Date: Monday, October 18, 2010 - 2:44 am

I could build sparc64_defconfig, but s390 is broken for me by this...


--

From: Peter Zijlstra
Date: Monday, October 18, 2010 - 3:11 am

the below makes s390 build again, not sure its completely safe for all
configs though...

Heiko?

---
 arch/s390/include/asm/hardirq.h |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/hardirq.h
b/arch/s390/include/asm/hardirq.h
index 498bc38..9558a71 100644
--- a/arch/s390/include/asm/hardirq.h
+++ b/arch/s390/include/asm/hardirq.h
@@ -15,7 +15,6 @@
 #include <linux/threads.h>
 #include <linux/sched.h>
 #include <linux/cache.h>
-#include <linux/interrupt.h>
 #include <asm/lowcore.h>
 
 #define local_softirq_pending() (S390_lowcore.softirq_pending)

--

From: Heiko Carstens
Date: Monday, October 18, 2010 - 3:26 am

From: Peter Zijlstra
Date: Monday, October 18, 2010 - 3:48 am

/me being very grumpy @ Lai.. ia64 is broken too!

/me goes revert this shite
--

From: Peter Zijlstra
Date: Tuesday, October 19, 2010 - 4:25 am

No its not, -tip is still not buildable on s390 and ia64.

Please don't ever pull crap like this again.
--

From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 6:00 am

With the addition of trace_softirq_raise() the softirq tracepoint got
even more convoluted. Why the tracepoints take two pointers to assign
an integer is beyond my comprehension.

But adding an extra case which treats the first pointer as an unsigned
long when the second pointer is NULL including the back and forth
type casting is just horrible.

Convert the softirq tracepoints to take a single unsigned int argument
for the softirq vector number and fix the call sites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/interrupt.h  |    2 -
 include/trace/events/irq.h |   54 ++++++++++++++++-----------------------------
 kernel/softirq.c           |   14 ++++++-----
 3 files changed, 29 insertions(+), 41 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 531495d..0ac1949 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -410,7 +410,7 @@ extern void open_softirq(int nr, void (*action)(struct softirq_action *));
 extern void softirq_init(void);
 static inline void __raise_softirq_irqoff(unsigned int nr)
 {
-	trace_softirq_raise((struct softirq_action *)(unsigned long)nr, NULL);
+	trace_softirq_raise(nr);
 	or_softirq_pending(1UL << nr);
 }
 
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index 6fa7cba..1c09820 100644
--- a/include/trace/events/irq.h
+++ b/include/trace/events/irq.h
@@ -86,76 +86,62 @@ TRACE_EVENT(irq_handler_exit,
 
 DECLARE_EVENT_CLASS(softirq,
 
-	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
+	TP_PROTO(unsigned int vec_nr),
 
-	TP_ARGS(h, vec),
+	TP_ARGS(vec_nr),
 
 	TP_STRUCT__entry(
-		__field(	int,	vec			)
+		__field(	unsigned int,	vec	)
 	),
 
 	TP_fast_assign(
-		if (vec)
-			__entry->vec = (int)(h - vec);
-		else
-			__entry->vec = (int)(long)h;
+		__entry->vec = vec_nr;
 	),
 
-	TP_printk("vec=%d [action=%s]", __entry->vec,
+	TP_printk("vec=%u [action=%s]", __entry->vec,
 		  ...
From: Peter Zijlstra
Date: Tuesday, October 19, 2010 - 6:08 am

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

A much needed cleanup indeed!
--

From: Mathieu Desnoyers
Date: Tuesday, October 19, 2010 - 6:22 am

Well, there was originally a reason for this oddness. The in __do_softirq(),
"h - softirq_ve"c computation was not needed outside of the tracepoint handler
in the past, but it now seems to be required with the new inlined
"kstat_incr_softirqs_this_cpu()".

So yes, thanks to this recent change, it now makes sense to pull this
computation out of the tracepoints and do it unconditionally in the kernel code.

Feel free to put my:
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 6:41 am

Dudes, a vector computation is hardly a performance problem in that
function and definitely not an excuse for designing such horrible
interfaces.

Thanks,

	tglx
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 6:54 am

Yes, now we can be a bit more liberal. But when these tracepoints were
going in, people were watching to make sure they have practically zero
impact when tracing was disabled.

Now that people are more use to tracepoints, they are more understanding
to have cleaner code over that extra few more lines of machine code in
the fast path.

-- Steve


--

From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 7:07 am

The vector computation is compared to the extra tracing induced jumps
probably not even measurable. Stop defending horrible coding with
handwavy performance and impact arguments.

Thanks,

	tglx
--

From: Mathieu Desnoyers
Date: Tuesday, October 19, 2010 - 7:28 am

From the moment markers and tracepoints infrastructures were merged, the
performance overhead target has been assuming we would eventually be merging
"asm goto jump labels", which replace the load+test+branch with a no-op.

So compared to a 5 bytes no-op added to the fast path, this vector computation
can be expected to have a higher performance impact, because skipping a no-op on
modern architectures (x86 at least) adds technically zero cycles. Agreed, there
is still the impact on I$, extra register pressure, some leaf functions becoming
non-leaf, and added function call (which imply external side-effect, thus acting
like a barrier()). But saying that all we do is to provide handwavy performance
and impact arguments is a bit much.

Until the asm goto are more widely deployed and until gcc 4.5 is more widely
used, there are some instrumentation sites I am relunctant to consider to
instrument with tracepoints (e.g. all system call entry/exit sites). However,
we should not use the cost of the current load+test+branch tracepoint behavior
as an excuse for adding extra performance impact to kernel code, because when it
will be replaced by asm gotos, all that will be left is the performance impact
inappropriately justified as insignificant compared to the impact of the old
tracepoint scheme.

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 12:49 pm

Can you at one point just stop your tracing lectures and look at the
facts ?

The impact of a sensible tracepoint design on the code in question
before kstat_incr_softirqs_this_cpu() was added would have been a mere
_FIVE_ bytes of text. But the original tracepoint code itself is
_TWENTY_ bytes of text larger.

So we trade horrible code plus 20 bytes text against 5 bytes of text
in the hotpath. And you tell me that these _FIVE_ bytes are impacting
performance so much that it's significant.

Now with kstat_incr_softirqs_this_cpu() the impact is zero, it even
removes code.

And talking about non impact of disabled trace points. The tracepoint
in question which made me look at the code results in deinlining
__raise_softirq_irqsoff() in net/dev/core.c. There goes your theory.

So no, you _cannot_ tell what impact a tracepoint has in reality
except by looking at the assembly output.

And what scares me way more is the size of a single tracepoint in a
code file.

Just adding "trace_softirq_entry(nr);" adds 88 bytes of text. So
that's optimized tracing code ?

All it's supposed to do is:

    if (enabled)
	trace_foo(nr);

Replace "if (enabled)" with your favourite code patching jump label
whatever magic. The above stupid version takes about 28, but the
"optimized" tracing code makes that 88. Brilliant. That's inlining
utter shite for no good reason. WTF is it necessary to inline all that
gunk ?

Please spare me the "jump label will make this less intrusive"
lecture. I'm not interested at all.

Let's instead look at some more facts:

#include <linux/interrupt.h>
#include <linux/module.h>

#include <trace/events/irq.h>

static struct softirq_action softirq_vec[NR_SOFTIRQS];

void test(struct softirq_action *h)
{
	trace_softirq_entry(h - softirq_vec);

	h->action(h);
}

Compile this code with GCC 4.5 with and without jump labels (zap the
select HAVE_ARCH_JUMP_LABEL line in arch/x86/Kconfig)

So now the !jumplabel case gives us:

../build/kernel/soft.o:    ...
From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 1:55 pm

Since I don't have your patch yet, I used the original:


void test(struct softirq_action *h)
{
        trace_softirq_entry(h, softirq_vec);
        h->action(h);

Well, the one jmpq is noped out, and the jmp is non conditional. I've
always thought a non conditional jmp was faster than a conditional one,
since there's no need to go into the branch prediction logic. The CPU
can simply skip to the code to jump next. Of counse, this pollutes the 

Because you do the h - softvec in the tracepoint parameter? I got a
different result:

Here's the diff. I did a cut -c10- to get rid of the line numbers so I
have a better diff. There's still differences due to jump locations, but
those are easy to figure out:

I diffed nojump vs jump. The '-' is with nojump, the '+' is with jumps.

--- /tmp/s2	2010-10-19 16:40:19.000000000 -0400
+++ /tmp/s1	2010-10-19 16:40:23.000000000 -0400
@@ -1,38 +1,33 @@
-00026f0 <test>:
+00027a0 <test>:
 	55                   	push   %rbp
 	48 89 e5             	mov    %rsp,%rbp
-	48 83 ec 10          	sub    $0x10,%rsp
-	48 89 1c 24          	mov    %rbx,(%rsp)
-	4c 89 64 24 08       	mov    %r12,0x8(%rsp)
-	e8 00 00 00 00       	callq  2706 <test+0x16>
+	41 54                	push   %r12
+	53                   	push   %rbx
+	e8 00 00 00 00       	callq  27ac <test+0xc>
 R_X86_64_PC32	mcount-0x4
-	8b 15 00 00 00 00    	mov    0x0(%rip),%edx        # 270c <test+0x1c>
-R_X86_64_PC32	__tracepoint_softirq_entry+0x4
 	48 89 fb             	mov    %rdi,%rbx

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
-	85 d2                	test   %edx,%edx
-	75 10                	jne    2723 <test+0x33>
+	e9 00 00 00 00       	jmpq   27b4 <test+0x14>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There's the difference with this code. We replaced a test and jump
conditional with a single jump that will later be nop'd out.


 	48 89 df             	mov    %rbx,%rdi
 	ff 13                	callq  *(%rbx)
-	48 8b 1c 24          	mov   ...
From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 2:07 pm

What are you smoking ?

In case the trace point is enabled the jmpq is there, so it jumps to
23 and jumps from there to 72.

In case the trace point is disabled the jmpq is noped out, so it jumps

I always thought, that at least some of the stuff which comes from

We might consult Mathieu for further useless blurb on how CPUs work
around broken code.

Thanks,

	tglx
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 2:23 pm

What? Are you saying that conditional jumps are just as fast as non

No, when we dynamically enable the tracepoint, it will jump to 25, not
23. That's what the goto part is about. We add the do_trace label to the
table, and we make it point to that location. If we did it as you say,
then tracepoints would never be enabled.

This is not unlike what we do with the function tracer. The original
code points to mcount which simply is:

	mcount:
		retq




The code worked fine before, it just was not very pretty.

But it seemed that gcc for you inlined the code in the wrong spot.
Perhaps it's not a good idea to have the something like h - softirq_vec
in the parameter of the tracepoint. Not saying that your change is not
worth it. It is, because h - softirq_vec is used by others now too.


-- Steve


--

From: H. Peter Anvin
Date: Tuesday, October 19, 2010 - 2:48 pm

OK, first of all, there are some serious WTFs here:

# define JUMP_LABEL_INITIAL_NOP ".byte 0xe9 \n\t .long 0\n\t"

A jump instruction is one of the worst possible NOPs.  Why are we doing
this?

The second thing that I found when implementing static_cpu_has() was
that it is actually better to encapsulate the asm goto in a small inline
which returns bool (true/false) -- gcc will happily optimize out the
variable and only see it as a flow of control thing.  I would be very
curious if that wouldn't make gcc generate better code in cases like that.

gcc 4.5.0 has a bug in that there must be a flowthrough case in the asm
goto (you can't have it unconditionally branch one way or the other), so
that should be the likely case and accordingly it should be annotated
likely() so that gcc doesn't reorder.  I suspect in the end one ends up
with code like this:

static __always_inline __pure bool __switch_point(...)
{
	asm goto("1: " JUMP_LABEL_INITIAL_NOP
		 /* ... patching stuff */
		: : : : t_jump);
	return false;
t_jump:
	return true;
}

#define SWITCH_POINT(x) unlikely(__switch_point(x))

I *suspect* this will resolve the need for hot/cold labels just fine.

	-hpa

--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 3:23 pm

Good question. Safety?  Jason?


Interesting, we could try this.

Thanks!

-- Steve


--

From: Peter Zijlstra
Date: Tuesday, October 19, 2010 - 3:27 pm

Due to not actually having a sane key type the above is not easy to
implement, but I tried:

#define _SWITCH_POINT(x)\
({                                                              \
        __label__ jl_enabled;                                   \
        bool ret = true;                                        \
        JUMP_LABEL(x, jl_enabled);                              \
        ret = false;                                            \
jl_enabled:                                                     \
        ret;            })

#define SWITCH_POINT(x) unlikely(_SWITCH_POINT(x))

#define COND_STMT(key, stmt)                                    \
do {                                                            \
        if (SWITCH_POINT(key)) {                                \
                stmt;                                           \
        }                                                       \
} while (0)


and that's still generating these double jumps.

--

From: H. Peter Anvin
Date: Tuesday, October 19, 2010 - 4:39 pm

I just experimented with it, and the ({...}) construct doesn't work,
because it looks like a merged flow of control to gcc.

Replacing the ({ ... }) with an inline does indeed remove the double
jumps.

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index b67cb18..2ff829d 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -61,12 +61,22 @@ static inline int jump_label_text_reserved(void
*start, void *end)

 #endif

+static __always_inline __pure bool _SWITCH_POINT(void *x)
+{
+       asm goto("# SWITCH_POINT %0\n\t"
+                ".byte 0x66,0x66,0x66,0x66,0x90\n"
+                "1:"
+                : : "i" (x) : : jl_enabled);
+       return false;
+jl_enabled:
+       return true;
+}
+
+#define SWITCH_POINT(x)        unlikely(_SWITCH_POINT(x))
+
 #define COND_STMT(key, stmt)                                   \
 do {                                                           \
-       __label__ jl_enabled;                                   \
-       JUMP_LABEL(key, jl_enabled);                            \
-       if (0) {                                                \
-jl_enabled:                                                    \
+       if (SWITCH_POINT(key)) {                                \
                stmt;                                           \
        }                                                       \
 } while (0)


The key here seems to be to not use the JUMP_LABEL macro as implemented;
I have utterly failed to make JUMP_LABEL() do the right thing.


	-hpa
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 4:45 pm

What happens if you remove the do { } while (0) from JUMP_LABEL, since
it now just makes it into a asm()

-- Steve


--

From: Jason Baron
Date: Tuesday, October 19, 2010 - 5:43 pm

ok, I tried this out for the tracepoint code, but I still seem to be
getting the double jump.

patch:


diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 1947a12..7bc2537 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -66,12 +66,22 @@ static inline void jump_label_unlock(void) {}
 
 #endif
 
+static __always_inline __pure bool _SWITCH_POINT(void *x)
+{
+	asm goto("# SWITCH_POINT %0\n\t"
+		 ".byte 0x66,0x66,0x66,0x66,0x90\n"
+		 "1:"
+		 : : "i" (x) : : jl_enabled);
+	return false;
+jl_enabled:
+	return true;
+}
+
+#define SWITCH_POINT(x)        unlikely(_SWITCH_POINT(x))
+
 #define COND_STMT(key, stmt)					\
 do {								\
-	__label__ jl_enabled;					\
-	JUMP_LABEL(key, jl_enabled);				\
-	if (0) {						\
-jl_enabled:							\
+	if (SWITCH_POINT(key)) {                                \
 		stmt;						\
 	}							\
 } while (0)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index a4a90b6..1f8d14f 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -146,12 +146,7 @@ static inline void tracepoint_update_probe_range(struct tracepoint *begin,
 	extern struct tracepoint __tracepoint_##name;			\
 	static inline void trace_##name(proto)				\
 	{								\
-		JUMP_LABEL(&__tracepoint_##name.state, do_trace);	\
-		return;							\
-do_trace:								\
-			__DO_TRACE(&__tracepoint_##name,		\
-				TP_PROTO(data_proto),			\
-				TP_ARGS(data_args));			\
+		COND_STMT(&__tracepoint_##name.state, __DO_TRACE(&__tracepoint_##name, TP_PROTO(data_proto), TP_ARGS(data_args)));						     \
 	}								\
 	static inline int						\
 	register_trace_##name(void (*probe)(data_proto), void *data)	\


disassemly:

ffffffff810360a6 <set_task_cpu>:
ffffffff810360a6:       55                      push   %rbp
ffffffff810360a7:       48 89 e5                mov    %rsp,%rbp
ffffffff810360aa:       41 55                   push   %r13
ffffffff810360ac:       41 54                   push ...
From: H. Peter Anvin
Date: Tuesday, October 19, 2010 - 3:26 pm

But it makes absolutely no sense to insert an instruction that
suboptimal and then convert it.  Start out with a reasonable,
universally acceptable, instruction, e.g. LEA on 32 bits and NOPL on 64

It of course also have the nice property that it syntactically looks
exactly like any other C conditional.

	-hpa
--

From: Mathieu Desnoyers
Date: Tuesday, October 19, 2010 - 3:41 pm

This code is dynamically patched at boot time (and module load time) with a

Thanks for the hint! We'll make sure to try it out. Having the ability to force
gcc to put the tracepoint in an unlikely branch is deeply needed here.

I'm a bit curious about the nop vs jump overhead comparison you are referring
to. It is an instruction latency benchmark or throughput benchmark ?

Intel's manual "Intel 64 and IA-32 Architectures Optimization Reference Manual"

http://www.intel.com/Assets/PDF/manual/248966.pdf

Page C-33 (or 577 in the pdf)

"7. Selection of conditional jump instructions should be based on the
    recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to
    improve the predictability of branches. When branches are predicted
    successfully, the latency of jcc is effectively zero."

So it mentions "jcc", but not jmp. Is there any reason for jmp to have a higher
latency than jcc ?

In this manual, the latency of predicted jcc is therefore 0 cycle, and its
throughput is 0.5 cycle/insn.

NOP (page C-29) is stated to have a latency of 0.5 to 1 cycle/insn (depending on
the exact HW), and throughput of 0.5 cycle/insn.

However, I have not found "jmp" explicitly in this listing.

So if we were executing tracepoints in a maze of jumps, we could argue that
instruction throughput is the most important there. However, if we expect the
common case to be surrounded by some non-ALU instructions, latency tends to
become the most important criterion.

But I feel I might be missing something important that distinguish "jcc" from
"jmp".

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: H. Peter Anvin
Date: Tuesday, October 19, 2010 - 3:49 pm

That's just ridiculous... start out with something sane and you at least

NOP has a latency of 0.5-1.0 cycle/insns, *but has no consumers*.

JMP/Jcc does have a consumer -- the IP -- and actually measuring shows
that it is much, much worse than NOP and other dummy instructions.

	-hpa
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 4:05 pm

But how does JMP vs Jcc compare?

-- Steve


--

From: H. Peter Anvin
Date: Tuesday, October 19, 2010 - 4:09 pm

*As far as I know* they're the same, except of course that a direct JMP
never mispredicts.

	-hpa
--

From: Jason Baron
Date: Wednesday, October 20, 2010 - 8:27 am

sure. The idea of the 'jmp 0' was simply to be an lcd for x86, if
there's a better lcd for x86, I'll update it. But note, that since the
'jmp 0' is patched to a better nop at boot, we wouldn't see much gain.
And in the boot path we are using 'text_poke_early()', so avoiding that
isn't going to improve things much.

I've got a few fixup patches in the queue that I'm going to post first,
and then I'll take a look at this change.

thanks,

-Jason

--

From: Mathieu Desnoyers
Date: Wednesday, October 20, 2010 - 8:41 am

One thing to consider here is that some nops are not compatible across all
architectures. And it would be safer to use an atomic nops (a single
instruction) too. e.g.  GENERIC_NOP5 in arch/x86/include/asm/nops.h is really 2
instructions, which can cause problems if a concurrent thread is preempted
between the two instructions while we patch.

arch_init_ideal_nop5() is actually doing the task of finding the best nop, and
it falls-back on a 5-byte nop (just like you do).

HPA, do you have any recommendation for a 5-byte single-instruction nop that is
efficient enough and will work on all x86 (Intel, AMD and other variants) ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: H. Peter Anvin
Date: Monday, October 25, 2010 - 2:54 pm

It's still a completely unnecessary waste of startup time some
potentially significant fraction of the time.  Startup time matters,
especially as the number of tracepoints grow.

	-hpa
--

From: Mathieu Desnoyers
Date: Monday, October 25, 2010 - 3:01 pm

We're still waiting for input for the best single-5-byte-instruction nop that
will work on all x86 variants. Please note that the GENERIC_NOP5 is actually two
instructions one next to each other, which is not appropriate here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: H. Peter Anvin
Date: Monday, October 25, 2010 - 3:12 pm

On 64 bits, use P6_NOP5; it seems to not suck on any platform.

On 32 bits, 3E 8D 74 26 00 (i.e. DS: + GENERIC_NOP4) seems to at least
do okay.

I can't say these are the *best* (in fact, they are guaranteed not the
best on some significant number of chips), but they haven't sucked on
any chips I have been able to measure -- and are way faster than JMP.

	-hpa
--

From: H. Peter Anvin
Date: Monday, October 25, 2010 - 3:19 pm

This is pure conjecture, I have not measured it, but I suspect in fact
that we could just change the composite nops in nops.h to use a 3E
prefix instead of a separate 90 nop.  Some platforms will take a penalty
on the prefix, but that would be balanced against handling two instructions.

The P5 core and others of the same generation might suffer, as it might
have been able to do U+V pipe pairing on two instructions which it
wouldn't for prefixes.

	-hpa

--

From: Mathieu Desnoyers
Date: Monday, October 25, 2010 - 3:55 pm

Cool, thanks for the info! Steven and Jason should probably update their
respective infrastructure to use the 32-bit 5-byte nop you propose rather than
the 5-byte jump.

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Steven Rostedt
Date: Monday, October 25, 2010 - 5:39 pm

Actually, I was thinking that we could take any 5 byte nop. The
alternate code is executed _before_ SMP is enabled. Thus we should not
have any cases where something could be executing in midstream.

-- Steve


--

From: Mathieu Desnoyers
Date: Monday, October 25, 2010 - 6:14 pm

Nay, absolutely not. See, the goal here is to find a no-op that is good enough
to be left there *without* init-time dynamic patching on a range of
architectures, so we can diminish the boot-time delay. This imply that we have
to select a no-op that can be patched in SMP context, thus it must be a single
5-byte instruction. We could even create a EMBEDDED config that lets specify
that the built-in nop should be left there for embedded systems that care about
boot time.

Moreover, even if it was not the case, I'd be tempted to still use a single
instruction 5-byte no-op just in case interrupts or any sorts (standard
interrupts, nmis, mce or whatnot) would happen to be enabled earlier than this
boot time nop patching.

IOW, you'd need a _very_ strong argument to support using the fragile 2
instructions nops there.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 3:04 pm

Crap, crap, crap. This has nothing to do with the arguments of that
trace point, it's a compiler problem and you are just hoping that GCC
will do the right thing.

That's the complete wrong assumption and as Jason confirmed GCC is not
up to it at all.

hpa just posted code which does the _RIGHT_ _THING_ independent of any
compiler madness and you tracer folks just missed it.

Your jump label optimization made code even worse for todays common
compilers. Just admit it and fix that mess you created or simply
disable it.

Thanks,

	tglx
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 3:33 pm

Thomas,

Can you try this patch and see if it makes the object code better?

-- Steve


diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index a4a90b6..6264bd3 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -144,14 +144,19 @@ static inline void tracepoint_update_probe_range(struct tracepoint *begin,
  */
 #define __DECLARE_TRACE(name, proto, args, data_proto, data_args)	\
 	extern struct tracepoint __tracepoint_##name;			\
-	static inline void trace_##name(proto)				\
+	static __always_inline int __trace_##name(proto)		\
 	{								\
 		JUMP_LABEL(&__tracepoint_##name.state, do_trace);	\
-		return;							\
+		return 0;						\
 do_trace:								\
 			__DO_TRACE(&__tracepoint_##name,		\
 				TP_PROTO(data_proto),			\
 				TP_ARGS(data_args));			\
+			return 1;					\
+	}								\
+	static inline void trace_##name(proto)				\
+	{								\
+		unlikely(__trace_##name(args));				\
 	}								\
 	static inline int						\
 	register_trace_##name(void (*probe)(data_proto), void *data)	\


--

From: Thomas Gleixner
Date: Thursday, October 21, 2010 - 9:18 am

Nope, same result.

Thanks,

--

From: Steven Rostedt
Date: Thursday, October 21, 2010 - 10:05 am

Yeah, I figured. Do you have CC_OPTIMIZE_FOR_SIZE set? And if you do,
what happens if you disable it?

-- Steve


--

From: Thomas Gleixner
Date: Thursday, October 21, 2010 - 12:56 pm

Hmm. Indeed. That gets rid of the double jump.

Thanks,

	tglx
--

From: H. Peter Anvin
Date: Monday, October 25, 2010 - 3:31 pm

-Os unfortunately drops a bunch of optimizations.

With gcc 4.5.1 there is actually a way to guarantee to get rid of double
jumps, which is that you tell gcc that it is branching to one of two
targets:

                asm goto("1: .byte 0xe9 ; .long %l[t_no]-2f\n"
			 "2:\n"
		         /* patching infrastructure goes here */
                         : : "i" (bit) : : t_no, t_yes);
	        __builtin_unreachable();
t_no:
                return false;
t_yes:
                return true;

[The open-coding of the jump is necessary to force the 5-byte form
instead of the 2-byte form.]

The patching machinery can recognize the case where the jump offset is
zero and patch in a NOP instead.

There does, however, seem to be a couple of problems:

a) gcc 4.5.1 is required due to a bug in previous versions of gcc when
an asm goto doesn't have a fallthrough case.

b) it seems to encourage gcc to actively jump around as it reorders
blocks, since gcc no longer sees a fallthrough case at all.

Not sure I have a good solution for this, at least not with current gcc.

	-hpa
--

From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 2:45 pm

I guess some serious whacking is due.

The compiler adds two jumps when the parameter changes due to
 (h -softvec) instead of (h, softvec) ????

Dude, you can't be serious.

If you would have asked about the compiler version I'm using and told
me about the compiler version you are using, then I could take that
answer somehow serious.

It still would miss the "Uhhhh, your compiler creates crap code"
alert, because that double jump is seriously broken and braindead.

And I tell you more about this. You are going to piss off a lot of
users of distro compilers because they will set CC_HAVE_ASM_GOTO
happily and create the code I posted. Which will break the tracer no
matter what.

So you tracer maniacs happily played with some experimental compiler
stuff w/o even testing your crap against something which ships with
distros or is the reference 4.5 compiler on kernel.org ?

I prefer you sending a patch to disable this, until it's sorted out,
unless you want me to add some really outrageous changelog to the
patch I'm going to put into tip tomorrow night, ok ?

Thanks,

	tglx
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 3:14 pm

Heh, gcc has always been of a black magic for what it decided. But,
anyway, I'm using a self built version (vanilla from gcc.gnu.org) of


Yes, gcc replaced a cmp and conditional jump with two unconditional
jumps. One of these jumps on boot up will be converted to a nop. Thus
the jump label code just converted a compare and conditional jump with a
nop and a non conditional jump.

This still sounds like a win to me, although we can do better. I guess
those poor sobs using a distro kernel compiled with a distro gcc that
has CC_HAVE_ASM_GOTO enabled will still be doing better than if it was
doing the if (enable) code.

-- Steve


--

From: David Daney
Date: Tuesday, October 19, 2010 - 2:16 pm

On 10/19/2010 12:49 PM, Thomas Gleixner wrote:

The 'asm goto' construct in GCC-4.5 is deficient in this area.

GCC assumes that all exit paths from an 'asm goto' are equally likely, 
so the tracing (or dynamic printk etc.) code is assumed to be hot and is 
emitted inline.  Since they are inline like this, there are all these 
jumps around them and they pollute the I-Cache.

I was looking at fixing it, but I think a true general purpose fix would 
require enhancing GCC's grammar to allow specifying of the 'likelyness' 
of each exit path from 'asm goto'.


--

From: Jason Baron
Date: Tuesday, October 19, 2010 - 2:32 pm

right, the next step is adding support for hot/cold labels, so the
tracing code will be annotaed with a 'cold' label. Thus, not adding the
'jmp' above on line '23', and in fact moving the tracing code
out-of-line. Maybe I haven't been clear on this.

thanks,

-Jason
--

From: David Daney
Date: Tuesday, October 19, 2010 - 2:38 pm

Ok, so is anybody working on doing that?  GCC-4.6 stage 1 (the time when 
a change like this could be merged) closes in 8 days.

It is unfortunate that we have this shiny new feature that can't really 
be used because the infrastructure is only half baked.

David Daney
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 2:47 pm

Interesting. I thought the driving force for asm goto was for
tracepoints, as the documentation seems to reference them. One would
think that the default would have been to make it the unlikely case, as

That would be nice.

Thanks,

-- Steve


--

From: Jason Baron
Date: Tuesday, October 19, 2010 - 2:28 pm

right, so the 'jmpq' on boot on x86 gets patched with 5 byte no-op
sequence. So in the disabled case we have no-op followed by a jump

yes, that is the plan. gcc does not yet support hot/cold labels...once
it does the second jump will go away and the entire tracepoint code will
be moved to a 'cold' section. It's not quite completely optimal yet, but

The 'jmpq' in the enabled case is patched with a jmpq to the body of the

yes, this is what the code should look like when we get support for
hot/cold labels. I've discussed this support with gcc folk, and its the
next step here. So yes, this is exacatly where we are headed.

thanks,

-Jason
--

From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 2:55 pm

And that's supposed to be useful ? We do _NOT_ want to jump around
disabled stuff. The noped out case should fall through into the non
traced code. Otherwise that whole jumplabel thing is completely

Then do not advertise it as the brilliant solution for all tracing


So and at the same time the whole tracing crowd tells me, that this is
already a done deal. See previous advertisments from DrTracing. I'm
seriously grumpy about this especially in the context of a patch which
fixes one of the worst interfaces I've seen in years.

Thanks,

	tglx
--

From: Thomas Gleixner
Date: Tuesday, October 19, 2010 - 3:17 pm

IOW, We now jump around the jump which jumps around the disabled code.

Thanks,

	tglx

--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 6:36 pm

Do you happen to have CONFIG_CC_OPTIMIZE_FOR_SIZE set? If so, then this
is a known issue. We even originally had jump label enabled _only_ if
CC_OPTIMIZE_FOR_SIZE was not set, but hpa NAK'd it.

http://lkml.org/lkml/2010/9/22/482

http://lkml.org/lkml/2010/9/20/488

http://lkml.org/lkml/2010/9/24/259

-- Steve


--

From: Jason Baron
Date: Tuesday, October 19, 2010 - 6:52 pm

thanks Steve. I was about to say this. When CONFIG_CC_OPTIMIZE_FOR_SIZE
is not set we don't get the double 'jmp' and the tracepoint code is
moved out of line. It was mentioned that a number of distros ship with
CONFIG_CC_OPTIMIZE_FOR_SIZE not set, and as Steve mentioned my original
patch set was conditional on !CONFIG_CC_OPTIMIZE_FOR_SIZE.

using hot/cold labels gcc can fix the CONFIG_CC_OPTIMIZE_FOR_SIZE case,
but its a non-trivial amount of work for gcc. I was hoping that if jump
labels are included, we could make the gcc work happen.

thanks,

-Jason
--

From: H. Peter Anvin
Date: Monday, October 25, 2010 - 3:32 pm

That's fair.  I think jump labels are still a win even in the
double-jump case (especially if the the tracepoint turns into a NOP
rather than a JMP.)

Code generated with -Os has a bunch of other problems, too.

	-hpa
--

From: Jason Baron
Date: Tuesday, October 19, 2010 - 3:38 pm

I'm not sure I did, the documentation says that we have nop followed by
a jmp:

+The new code is a 'nopl' followed by a 'jmp'. Thus:
+
+nopl - 0f 1f 44 00 00 - 5 bytes
+jmp  - eb 3e          - 2 bytes



sorry if I mislead anybody about the current state of of 'jump labels'.
But we have the same goal in mind, and a clear path to get there. If you
don't agree with the approach - I'm all ears. And you are right - the code is
not where it should be yet.

thanks,

-Jason
--

From: H. Peter Anvin
Date: Tuesday, October 19, 2010 - 3:44 pm

There is no excuse for needing the second jump here, obviously...

	-hpa
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 3:56 pm

Now the trick is to tell gcc that.

-- Steve


--

From: H. Peter Anvin
Date: Tuesday, October 19, 2010 - 3:57 pm

Yes.

	-hpa
--

From: Steven Rostedt
Date: Tuesday, October 19, 2010 - 7:46 am

Yes this was crappy code, I'm not defending it. But this code was from
the original tracepoints. I just looked at when this code was added, and
it was still in the time TRACE_EVENT() was in a major flux. Heck, the
code resided in include/trace/irq.h and not include/trace/events/irq.h.
And yes, a lot of decisions back then were put on handwaving performance
and impact, and it was not just coming from us.

I admit I should have cleaned it up, but I did not want to touch it
until it actually broke ;-)

-- Steve


--

From: Mathieu Desnoyers
Date: Tuesday, October 19, 2010 - 7:00 am

In this specific case, I think you are right. But things are not that trivial,
and you know it. We have to consider:

- Extra instruction cache footprint
- Added register pressure
- Added computation overhead of the added substraction
- Frequency of code execution

for all target architectures when we add tracepoints to performance sensitive
code paths. As a general policy, we try to keep these at the lowest possible
level, so that all tracepoints will be compiled into distro kernels without
perceivable _overall_ performance overhead. It's not something that should be
looked at only on a tracepoint-by-tracepoint overhead basis, but rather by
looking at the overall system degradation that adding 300 tracepoints would
cause.

So I agree with you that it's a trade-off between interface cleanness and
performance. When they were introduced, Tracepoint handlers were barely seen as
citizen of the kernel code base, so all that mattered was to keep the
"tracepoints off" case clean and fast. Now that tracepoint handlers seems to be
increasingly accepted as part of the kernel code base, I agree that taking into
account oddness performed in this handler becomes more important. It ends up
being a question of balance between oddness inside the tracepoint handler and
performance overhead in the off-case. The increased acceptance of the tracepoint
code-base has shifted this balance slightly in favor of cleanness.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: tip-bot for Thomas Gleixner
Date: Thursday, October 21, 2010 - 7:52 am

Commit-ID:  f4bc6bb2d562703eafc895c37e7be20906de139d
Gitweb:     http://git.kernel.org/tip/f4bc6bb2d562703eafc895c37e7be20906de139d
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 19 Oct 2010 15:00:13 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 21 Oct 2010 16:50:29 +0200

tracing: Cleanup the convoluted softirq tracepoints

With the addition of trace_softirq_raise() the softirq tracepoint got
even more convoluted. Why the tracepoints take two pointers to assign
an integer is beyond my comprehension.

But adding an extra case which treats the first pointer as an unsigned
long when the second pointer is NULL including the back and forth
type casting is just horrible.

Convert the softirq tracepoints to take a single unsigned int argument
for the softirq vector number and fix the call sites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <alpine.LFD.2.00.1010191428560.6815@localhost6.localdomain6>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: mathieu.desnoyers@efficios.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>

---
 include/linux/interrupt.h  |    2 +-
 include/trace/events/irq.h |   54 ++++++++++++++++---------------------------
 kernel/softirq.c           |   16 +++++++-----
 3 files changed, 30 insertions(+), 42 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 531495d..0ac1949 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -410,7 +410,7 @@ extern void open_softirq(int nr, void (*action)(struct softirq_action *));
 extern void softirq_init(void);
 static inline void __raise_softirq_irqoff(unsigned int nr)
 {
-	trace_softirq_raise((struct softirq_action *)(unsigned long)nr, NULL);
+	trace_softirq_raise(nr);
 	or_softirq_pending(1UL << nr);
 }
 
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index 6fa7cba..1c09820 100644
--- ...
From: Koki Sanagi
Date: Monday, August 23, 2010 - 2:43 am

From: Neil Horman <nhorman@tuxdriver.com>

This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT to improve
the usability of napi_poll tracepoint.

          <idle>-0     [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3
          <idle>-0     [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1

An original patch is below.
http://marc.info/?l=linux-kernel&m=126021713809450&w=2
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

And add a fix by Steven Rostedt.
http://marc.info/?l=linux-kernel&m=126150506519173&w=2

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/trace/events/napi.h |   25 +++++++++++++++++++++++--
 1 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
index 188deca..8fe1e93 100644
--- a/include/trace/events/napi.h
+++ b/include/trace/events/napi.h
@@ -6,10 +6,31 @@
 
 #include <linux/netdevice.h>
 #include <linux/tracepoint.h>
+#include <linux/ftrace.h>
+
+#define NO_DEV "(no_device)"
+
+TRACE_EVENT(napi_poll,
 
-DECLARE_TRACE(napi_poll,
 	TP_PROTO(struct napi_struct *napi),
-	TP_ARGS(napi));
+
+	TP_ARGS(napi),
+
+	TP_STRUCT__entry(
+		__field(	struct napi_struct *,	napi)
+		__string(	dev_name, napi->dev ? napi->dev->name : NO_DEV)
+	),
+
+	TP_fast_assign(
+		__entry->napi = napi;
+		__assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
+	),
+
+	TP_printk("napi poll on napi struct %p for device %s",
+		__entry->napi, __get_str(dev_name))
+);
+
+#undef NO_DEV
 
 #endif /* _TRACE_NAPI_H_ */
 

--

From: David Miller
Date: Monday, August 23, 2010 - 8:52 pm

From: Koki Sanagi <sanagi.koki@jp.fujitsu.com>

Acked-by: David S. Miller <davem@davemloft.net>
--

From: tip-bot for Neil Horman
Date: Wednesday, September 8, 2010 - 1:34 am

Commit-ID:  3e4b10d7a4d2a78af64f8096dc7cdb3bebd65adb
Gitweb:     http://git.kernel.org/tip/3e4b10d7a4d2a78af64f8096dc7cdb3bebd65adb
Author:     Neil Horman <nhorman@tuxdriver.com>
AuthorDate: Mon, 23 Aug 2010 18:43:51 +0900
Committer:  Frederic Weisbecker <fweisbec@gmail.com>
CommitDate: Tue, 7 Sep 2010 17:51:01 +0200

napi: Convert trace_napi_poll to TRACE_EVENT

This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT
to improve the usability of napi_poll tracepoint.

          <idle>-0     [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3
          <idle>-0     [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1

The original patch is below:
http://marc.info/?l=linux-kernel&m=126021713809450&w=2

[ sanagi.koki@jp.fujitsu.com: And add a fix by Steven Rostedt:
http://marc.info/?l=linux-kernel&m=126150506519173&w=2 ]

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4C7242D7.4050009@jp.fujitsu.com>
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/trace/events/napi.h |   25 +++++++++++++++++++++++--
 1 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
index 188deca..8fe1e93 100644
--- a/include/trace/events/napi.h
+++ b/include/trace/events/napi.h
@@ -6,10 +6,31 @@
 
 #include <linux/netdevice.h>
 #include <linux/tracepoint.h>
+#include <linux/ftrace.h>
+
+#define ...
From: Koki Sanagi
Date: Monday, August 23, 2010 - 2:46 am

This patch adds tracepoint to consume_skb and add trace_kfree_skb before
__kfree_skb in skb_free_datagram_locked and net_tx_action.
Combinating with tracepoint on dev_hard_start_xmit, we can check how long it
takes to free transmited packets. And using it, we can calculate how many
packets driver had at that time. It is useful when a drop of transmited packet
is a problem.

            sshd-6828  [000] 112689.258154: consume_skb: skbaddr=f2d99bb8

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/trace/events/skb.h |   17 +++++++++++++++++
 net/core/datagram.c        |    1 +
 net/core/dev.c             |    2 ++
 net/core/skbuff.c          |    1 +
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 4b2be6d..75ce9d5 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -35,6 +35,23 @@ TRACE_EVENT(kfree_skb,
 		__entry->skbaddr, __entry->protocol, __entry->location)
 );
 
+TRACE_EVENT(consume_skb,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(	void *,	skbaddr	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+	),
+
+	TP_printk("skbaddr=%p", __entry->skbaddr)
+);
+
 TRACE_EVENT(skb_copy_datagram_iovec,
 
 	TP_PROTO(const struct sk_buff *skb, int len),
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 251997a..282806b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -243,6 +243,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 	unlock_sock_fast(sk, slow);
 
 	/* skb is now orphaned, can be freed outside of locked section */
+	trace_kfree_skb(skb, skb_free_datagram_locked);
 	__kfree_skb(skb);
 }
 EXPORT_SYMBOL(skb_free_datagram_locked);
diff --git a/net/core/dev.c b/net/core/dev.c
index c9b026a..48f7977 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -129,6 +129,7 @@
 #include <linux/random.h>
 #include <trace/events/napi.h>
 #include ...
From: David Miller
Date: Monday, August 23, 2010 - 8:53 pm

From: Koki Sanagi <sanagi.koki@jp.fujitsu.com>

Acked-by: David S. Miller <davem@davemloft.net>
--

From: tip-bot for Koki Sanagi
Date: Wednesday, September 8, 2010 - 1:35 am

Commit-ID:  07dc22e7295f25526f110d704655ff0ea7687420
Gitweb:     http://git.kernel.org/tip/07dc22e7295f25526f110d704655ff0ea7687420
Author:     Koki Sanagi <sanagi.koki@jp.fujitsu.com>
AuthorDate: Mon, 23 Aug 2010 18:46:12 +0900
Committer:  Frederic Weisbecker <fweisbec@gmail.com>
CommitDate: Tue, 7 Sep 2010 17:51:53 +0200

skb: Add tracepoints to freeing skb

This patch adds tracepoint to consume_skb and add trace_kfree_skb
before __kfree_skb in skb_free_datagram_locked and net_tx_action.
Combinating with tracepoint on dev_hard_start_xmit, we can check
how long it takes to free transmitted packets. And using it, we can
calculate how many packets driver had at that time. It is useful when
a drop of transmitted packet is a problem.

            sshd-6828  [000] 112689.258154: consume_skb: skbaddr=f2d99bb8

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4C724364.50903@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/trace/events/skb.h |   17 +++++++++++++++++
 net/core/datagram.c        |    1 +
 net/core/dev.c             |    2 ++
 net/core/skbuff.c          |    1 +
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 4b2be6d..75ce9d5 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -35,6 +35,23 @@ TRACE_EVENT(kfree_skb,
 		__entry->skbaddr, __entry->protocol, __entry->location)
 );
 
+TRACE_EVENT(consume_skb,
+
+	TP_PROTO(struct sk_buff ...
From: Koki Sanagi
Date: Monday, August 23, 2010 - 2:47 am

Add a perf script which shows a process of packets and processed time.
It helps us to investigate networking or network device.

If you want to use it, install perf and record perf.data like following.

#perf trace record netdev-times [script]

If you set script, perf gathers records until it ends.
If not, you must Ctrl-C to stop recording.

And if you want a report from record,

#perf trace report netdev-times [options]

If you use some options, you can limit an output.
Option is below.

tx: show only process of tx packets
rx: show only process of rx packets
dev=: show a process specified with this option
debug: work with debug mode. It shows buffer status.

For example, if you want to show a process of received packets associated
with eth4,

#perf trace report netdev-times rx dev=eth4
106133.171439sec cpu=0
  irq_entry(+0.000msec irq=24:eth4)
         |
  softirq_entry(+0.006msec)
         |
         |---netif_receive_skb(+0.010msec skb=f2d15900 len=100)
         |            |
         |      skb_copy_datagram_iovec(+0.039msec 10291::10291)
         |
  napi_poll_exit(+0.022msec eth4)

This perf script helps us to analyze a process time of transmit/receive
sequence.

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 tools/perf/scripts/python/bin/netdev-times-record |    8 +
 tools/perf/scripts/python/bin/netdev-times-report |    5 +
 tools/perf/scripts/python/netdev-times.py         |  464 +++++++++++++++++++++
 3 files changed, 477 insertions(+), 0 deletions(-)

diff --git a/tools/perf/scripts/python/bin/netdev-times-record b/tools/perf/scripts/python/bin/netdev-times-record
new file mode 100644
index 0000000..2b59511
--- /dev/null
+++ b/tools/perf/scripts/python/bin/netdev-times-record
@@ -0,0 +1,8 @@
+#!/bin/bash
+perf record -c 1 -f -R -a -e net:net_dev_xmit -e net:net_dev_queue	\
+		-e net:netif_receive_skb -e net:netif_rx		\
+		-e skb:consume_skb -e skb:kfree_skb			\
+		-e skb:skb_copy_datagram_iovec -e napi:napi_poll	\
+		-e ...
From: David Miller
Date: Monday, August 23, 2010 - 8:53 pm

From: Koki Sanagi <sanagi.koki@jp.fujitsu.com>

Acked-by: David S. Miller <davem@davemloft.net>
--

From: Frederic Weisbecker
Date: Tuesday, September 7, 2010 - 9:57 am

Nano-nits:

-c 1 and -R are now default settings for tracepoints and -f is not






Events already arrive in time order to the scripts.

Thnaks!

--

From: tip-bot for Koki Sanagi
Date: Wednesday, September 8, 2010 - 1:35 am

Commit-ID:  359d5106a2ff4ffa2ba129ec8f54743c341dabfc
Gitweb:     http://git.kernel.org/tip/359d5106a2ff4ffa2ba129ec8f54743c341dabfc
Author:     Koki Sanagi <sanagi.koki@jp.fujitsu.com>
AuthorDate: Mon, 23 Aug 2010 18:47:09 +0900
Committer:  Frederic Weisbecker <fweisbec@gmail.com>
CommitDate: Tue, 7 Sep 2010 18:43:32 +0200

perf: Add a script to show packets processing

Add a perf script which shows packets processing and processed
time. It helps us to investigate networking or network devices.

If you want to use it, install perf and record perf.data like
following.

If you set script, perf gathers records until it ends.
If not, you must Ctrl-C to stop recording.

And if you want a report from record,

If you use some options, you can limit the output.
Option is below.

tx: show only tx packets processing
rx: show only rx packets processing
dev=: show processing on this device
debug: work with debug mode. It shows buffer status.

For example, if you want to show received packets processing
associated with eth4,

106133.171439sec cpu=0
  irq_entry(+0.000msec irq=24:eth4)
         |
  softirq_entry(+0.006msec)
         |
         |---netif_receive_skb(+0.010msec skb=f2d15900 len=100)
         |            |
         |      skb_copy_datagram_iovec(+0.039msec 10291::10291)
         |
  napi_poll_exit(+0.022msec eth4)

This perf script helps us to analyze the processing time of a
transmit/receive sequence.

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Zanussi ...
From: David Miller
Date: Monday, August 23, 2010 - 8:53 pm

From: Koki Sanagi <sanagi.koki@jp.fujitsu.com>

Acked-by: David S. Miller <davem@davemloft.net>
--

From: Steven Rostedt
Date: Monday, August 30, 2010 - 4:50 pm

The entire series:

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve


--

From: Koki Sanagi
Date: Thursday, September 2, 2010 - 7:10 pm

Thanks many acks. and I have one question.

These patches have several component.

Patch1 is kernel component, but patch2-5 are netdev component.
What tree is good to be included ?
If it is not net-next, I must rebase to another tree.

Thanks,
Koki Sanagi.

--

From: David Miller
Date: Thursday, September 2, 2010 - 7:17 pm

From: Koki Sanagi <sanagi.koki@jp.fujitsu.com>

I would prefer it goes into the tracing tree or whatever is the most appropriate
for patch #1.
--

From: Koki Sanagi
Date: Thursday, September 2, 2010 - 7:55 pm

O.K. I'll rebase to linux-2.6-tip.

Thanks,
Koki Sanagi.

--

From: Frederic Weisbecker
Date: Thursday, September 2, 2010 - 9:46 pm

No need, they apply very well :)

I'll push that to -tip soon.

Thanks.

--

From: Koki Sanagi
Date: Thursday, September 2, 2010 - 10:12 pm

O.K. Thanks!

Koki Sanagi.


--

Previous thread: [PATCH] Fix typo in Kconfig help text by Thomas Weber on Monday, August 23, 2010 - 2:20 am. (2 messages)

Next thread: [GIT] kbuild fixes for 2.6.36-rc2 by Michal Marek on Saturday, August 21, 2010 - 1:13 pm. (2 messages)