The transactional API patch between the generic and model-specific code introduced several important bugs with event scheduling, at least on X86. If you had pinned events, e.g., watchdog, and were over-committing the PMU, you would get bogus counts. The bug was showing up on Intel CPU because events would move around more often that on AMD. But the problem also existed on AMD, though harder to expose. The issues were: - group_sched_in() was missing a cancel_txn() in the error path - cpuc->n_added was not properly maintained, leading to missing actions in hw_perf_enable(), i.e., n_running being 0. You cannot update n_added until you know the transaction has succeeded. In case of failed transaction n_added was not adjusted back. - in case of failed transactions, event_sched_out() was called and eventually invoked x86_disable_event() to touch the HW reg. But with transactions, on X86, event_sched_in() does not touch HW registers, it simply collects events into a list. Thus, you could end up calling x86_disable_event() on a counter which did not correspond to the current event when idx != -1. The patch modifies the generic and X86 code to avoid all those problems. First, we keep track of the number of events added last. In case the transaction fails, we substract them from n_added. This approach is necessary (as opposed to delaying updates to n_added) because not all event updates use the transaction API, e.g., single events. Second, we encapsulate the event_sched_in() and event_sched_out() in group_sched_in() inside the transaction. That makes the operations symmetrical and you can also detect that you are inside a transaction and skip the HW reg access by checking cpuc->group_flag. With this patch, you can now overcommit the PMU even with pinned system-wide events present and still get valid counts. Signed-off-by: Stephane Eranian <eranian@google.com> diff --git a/arch/x86/kernel/cpu/perf_event.c ...
Does this patch differ from the one you send earlier? --
Yes, it does. It changes the way n_added is updated. The first version was buggy (perf top would crash the machine). You cannot delay updating n_added until commit, because there are paths where you don't go through transactions. This version instead keeps the number of events last added and speculatively updates n_added (assuming success). If the transaction succeeds, then no correction is done (subtracting 0), if it fails, then the number of events just added to n_added is subtracted. --
OK, just to see if I got it right, I wrote a similar patch from just the
changelog.
Does the below look good?
---
arch/x86/kernel/cpu/perf_event.c | 13 +++++++++++++
kernel/perf_event.c | 11 +++++++----
2 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 856aeeb..be84944 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -106,6 +106,7 @@ struct cpu_hw_events {
int n_events;
int n_added;
+ int n_txn;
int assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
u64 tags[X86_PMC_IDX_MAX];
struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */
@@ -983,6 +984,7 @@ static int x86_pmu_enable(struct perf_event *event)
out:
cpuc->n_events = n;
cpuc->n_added += n - n0;
+ cpuc->n_txn += n - n0;
return 0;
}
@@ -1089,6 +1091,14 @@ static void x86_pmu_disable(struct perf_event *event)
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
int i;
+ /*
+ * If we're called during a txn, we don't need to do anything.
+ * The events never got scheduled and ->cancel_txn will truncate
+ * the event_list.
+ */
+ if (cpuc->group_flags & PERF_EVENT_TXN_STARTED)
+ return;
+
x86_pmu_stop(event);
for (i = 0; i < cpuc->n_events; i++) {
@@ -1379,6 +1389,7 @@ static void x86_pmu_start_txn(const struct pmu *pmu)
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
cpuc->group_flag |= PERF_EVENT_TXN_STARTED;
+ cpuc->n_txn = 0;
}
/*
@@ -1391,6 +1402,8 @@ static void x86_pmu_cancel_txn(const struct pmu *pmu)
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
cpuc->group_flag &= ~PERF_EVENT_TXN_STARTED;
+ cpuc->n_added -= cpuc->n_txn;
+ cpuc->n_events -= cpuc->n_txn;
}
/*
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index e099650..ca79f2a 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -652,8 +652,11 @@ ...Ok, let me check. I think it is almost right. -- Stephane Eranian | EMEA Software Engineering Google France | 38 avenue de l'Opéra | 75002 Paris Tel : +33 (0) 1 42 68 53 00 This email may be confidential or privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it went to the wrong person. Thanks --
Ok, the patch look good expect it needs:
static int x86_pmu_commit_txn(const struct pmu *pmu)
{
......
/*
* copy new assignment, now we know it is possible
* will be used by hw_perf_enable()
*/
memcpy(cpuc->assign, assign, n*sizeof(int));
cpuc->n_txn = 0;
return 0;
}
Because you always call cancel_txn() even when commit()
succeeds. I don't really understand why. I think it could be
avoided by clearing the group_flag in commit_txn() if it
succeeds. It would also make the logical flow more natural. Why
cancel something that has succeeded. You cancel when you fail/abort.
--
Stephane Eranian | EMEA Software Engineering
Google France | 38 avenue de l'Opéra | 75002 Paris
Tel : +33 (0) 1 42 68 53 00
This email may be confidential or privileged. If you received this
communication by mistake, please
don't forward it to anyone else, please erase all copies and
attachments, and please let me know that
it went to the wrong person. Thanks
--
Gah, I forgot about that. I think I suggested to Lin to do that and then promptly forgot. Let me add that and at least push this patch fwd, we can try and clean up that detail later on. --
How about something like the below?
---
Subject: perf: Cleanup {start,commit,cancel}_txn details
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue May 25 17:49:05 CEST 2010
Clarify some of the transactional group scheduling API details
and change it so that a successfull ->commit_txn also closes
the transaction.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
arch/powerpc/kernel/perf_event.c | 7 ++++---
arch/sparc/kernel/perf_event.c | 7 ++++---
arch/x86/kernel/cpu/perf_event.c | 14 +++++---------
include/linux/perf_event.h | 27 ++++++++++++++++++++++-----
kernel/perf_event.c | 8 +-------
5 files changed, 36 insertions(+), 27 deletions(-)
Index: linux-2.6/arch/powerpc/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_event.c
+++ linux-2.6/arch/powerpc/kernel/perf_event.c
@@ -754,7 +754,7 @@ static int power_pmu_enable(struct perf_
* skip the schedulability test here, it will be peformed
* at commit time(->commit_txn) as a whole
*/
- if (cpuhw->group_flag & PERF_EVENT_TXN_STARTED)
+ if (cpuhw->group_flag & PERF_EVENT_TXN)
goto nocheck;
if (check_excludes(cpuhw->event, cpuhw->flags, n0, 1))
@@ -858,7 +858,7 @@ void power_pmu_start_txn(const struct pm
{
struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
- cpuhw->group_flag |= PERF_EVENT_TXN_STARTED;
+ cpuhw->group_flag |= PERF_EVENT_TXN;
cpuhw->n_txn_start = cpuhw->n_events;
}
@@ -871,7 +871,7 @@ void power_pmu_cancel_txn(const struct p
{
struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
- cpuhw->group_flag &= ~PERF_EVENT_TXN_STARTED;
+ cpuhw->group_flag &= ~PERF_EVENT_TXN;
}
/*
@@ -897,6 +897,7 @@ int power_pmu_commit_txn(const struct pm
for (i = cpuhw->n_txn_start; i < n; ++i)
cpuhw->event[i]->hw.config = cpuhw->events[i];
+ cpuhw->group_flag &= ~PERF_EVENT_TXN;
...Can't we still get in the group_error: branch with either scenario? !txn will get there if ->enable() of a group sibling fails, txn will get there if either ->enable() or ->commit_txn() fails. (Although typically ->enable() would not fail for txn) --
You're right. We must keep it because of failure in the siblings' loop. --
Commit-ID: 8d2cacbbb8deadfae78aa16e4e1ee619bdd7019e Gitweb: http://git.kernel.org/tip/8d2cacbbb8deadfae78aa16e4e1ee619bdd7019e Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Tue, 25 May 2010 17:49:05 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Wed, 9 Jun 2010 11:12:34 +0200 perf: Cleanup {start,commit,cancel}_txn details Clarify some of the transactional group scheduling API details and change it so that a successfull ->commit_txn also closes the transaction. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1274803086.5882.1752.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/powerpc/kernel/perf_event.c | 7 ++++--- arch/sparc/kernel/perf_event.c | 7 ++++--- arch/x86/kernel/cpu/perf_event.c | 14 +++++--------- include/linux/perf_event.h | 27 ++++++++++++++++++++++----- kernel/perf_event.c | 9 +-------- 5 files changed, 36 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c index 43b83c3..ac2a8c2 100644 --- a/arch/powerpc/kernel/perf_event.c +++ b/arch/powerpc/kernel/perf_event.c @@ -754,7 +754,7 @@ static int power_pmu_enable(struct perf_event *event) * skip the schedulability test here, it will be peformed * at commit time(->commit_txn) as a whole */ - if (cpuhw->group_flag & PERF_EVENT_TXN_STARTED) + if (cpuhw->group_flag & PERF_EVENT_TXN) goto nocheck; if (check_excludes(cpuhw->event, cpuhw->flags, n0, 1)) @@ -858,7 +858,7 @@ void power_pmu_start_txn(const struct pmu *pmu) { struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); - cpuhw->group_flag |= PERF_EVENT_TXN_STARTED; + cpuhw->group_flag |= ...
cancel_txn() clears the transaction flag, so it is needed after both success and fail transaction, although the function name is a bit misleading. Peter's patch adds the clear of transaction flag into each implementation of ->commit_txn. So cancel_txn() is only called after fail transaction now. Thanks, --
Yes and I think it is less prone to confusion now. --
Commit-ID: 90151c35b19633e0cab5a6c80f1ba4a51e7c913b Gitweb: http://git.kernel.org/tip/90151c35b19633e0cab5a6c80f1ba4a51e7c913b Author: Stephane Eranian <eranian@google.com> AuthorDate: Tue, 25 May 2010 16:23:10 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Mon, 31 May 2010 08:46:10 +0200 perf_events: Fix event scheduling issues introduced by transactional API The transactional API patch between the generic and model-specific code introduced several important bugs with event scheduling, at least on X86. If you had pinned events, e.g., watchdog, and were over-committing the PMU, you would get bogus counts. The bug was showing up on Intel CPU because events would move around more often that on AMD. But the problem also existed on AMD, though harder to expose. The issues were: - group_sched_in() was missing a cancel_txn() in the error path - cpuc->n_added was not properly maintained, leading to missing actions in hw_perf_enable(), i.e., n_running being 0. You cannot update n_added until you know the transaction has succeeded. In case of failed transaction n_added was not adjusted back. - in case of failed transactions, event_sched_out() was called and eventually invoked x86_disable_event() to touch the HW reg. But with transactions, on X86, event_sched_in() does not touch HW registers, it simply collects events into a list. Thus, you could end up calling x86_disable_event() on a counter which did not correspond to the current event when idx != -1. The patch modifies the generic and X86 code to avoid all those problems. First, we keep track of the number of events added last. In case the transaction fails, we substract them from n_added. This approach is necessary (as opposed to delaying updates to n_added) because not all event updates use the transaction API, e.g., single events. Second, we encapsulate the event_sched_in() and event_sched_out() in group_sched_in() inside the transaction. That makes the ...
