[PATCH v2 5/11] X86 details for user space breakpoint assistance.

Previous thread: RFC [Patch] Remove "please try 'cgroup_disable=memory' option if you don't want memory cgroups" printk at boot time. by Larry Woodman on Wednesday, March 31, 2010 - 8:28 am. (9 messages)

Next thread: [PATCH] proc: pagemap: Hold mmap_sem during page walk by San Mehat on Wednesday, March 31, 2010 - 10:23 am. (18 messages)
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:51 am

Uprobes Patches

Changelog:
 - Added trace_event interface for uprobes.
 - Addressed comments from Andrew Morton and Randy Dunlap.

For previous posting: please refer: http://lkml.org/lkml/2010/3/20/107

This patchset implements Uprobes which enables you to dynamically break
into any routine in a user space application and collect information
non-disruptively.

This patchset is a rework based on suggestions from discussions on lkml
in January this year (http://lkml.org/lkml/2010/1/11/92 and
http://lkml.org/lkml/2010/1/27/19).  This implementation of uprobes
doesnt depend on utrace.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)

When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-process "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.

Advantages of uprobes over conventional debugging include:

1. Non-disruptive.
Unlike current ptrace based mechanisms, uprobes tracing wouldnt
involve signals, stopping threads and context switching between the
tracer and tracee.

2. Much better handling of multithreaded programs because of XOL.
Current ptrace based mechanisms use single stepping inline, i.e they
copy back the original instruction on hitting a breakpoint.  In such
mechanisms tracers have to stop all the threads on a breakpoint hit or
tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint ...
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:51 am

Move Macro W to asm/insn.h

Macro W used to know if the instructions are valid for
user-space/kernel space.  This macro is used by kprobes and
user_bkpt. (i.e user space breakpoint assistance layer.) So moving it
to a common header file asm/insn.h.

TODO: replace macro W with bits in inat table.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 arch/x86/include/asm/insn.h |    7 +++++++
 arch/x86/kernel/kprobes.c   |    7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)


diff --git a/arch/x86/include/asm/insn.h b/arch/x86/include/asm/insn.h
index 96c2e0a..8586820 100644
--- a/arch/x86/include/asm/insn.h
+++ b/arch/x86/include/asm/insn.h
@@ -23,6 +23,13 @@
 /* insn_attr_t is defined in inat.h */
 #include <asm/inat.h>
 
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
+	  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
+	  (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) |   \
+	  (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf))    \
+	 << (row % 32))
+
 struct insn_field {
 	union {
 		insn_value_t value;
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index b43bbae..4379b40 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -66,12 +66,6 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
 
 #define stack_addr(regs) ((unsigned long *)kernel_stack_pointer(regs))
 
-#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
-	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
-	  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
-	  (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) |   \
-	  (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf))    \
-	 << (row % 32))
 	/*
 	 * Undefined/reserved opcodes, conditional jump, Opcode Extension
 	 * ...
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:51 am

Move replace_page() to mm/memory.c

Move replace_page from mm/ksm.c to mm/memory.c.
User bkpt will use background page replacement approach to insert/delete
breakpoints. Background page replacement approach will be based on
replace_page.  Now replace_page() loses its static attribute.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

 include/linux/mm.h |    2 ++
 mm/ksm.c           |   59 ----------------------------------------------------
 mm/memory.c        |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 59 deletions(-)


diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..0f43355 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -854,6 +854,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
+int replace_page(struct vm_area_struct *vma, struct page *page,
+		struct page *kpage, pte_t orig_pte);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/ksm.c b/mm/ksm.c
index 8cdfc2a..83fb4fb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -766,65 +766,6 @@ out:
 	return err;
 }
 
-/**
- * replace_page - replace page in vma by new ksm page
- * @vma:      vma that holds the pte pointing to page
- * @page:     the page we are replacing by kpage
- * @kpage:    the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
-			struct page *kpage, pte_t orig_pte)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *ptep;
-	spinlock_t *ptl;
-	unsigned long addr;
-	int err = -EFAULT;
-
-	addr ...
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:51 am

Enhance replace_page() to support pagecache

Currently replace_page would work only for anonymous pages.
This patch enhances replace_page() to work for pagecache pages

This enhancement is useful for user_bkpt's replace_page based
background page replacement for insertion and removal of breakpoints.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

 mm/memory.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)


diff --git a/mm/memory.c b/mm/memory.c
index 66a3632..5acf686 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2617,7 +2617,10 @@ int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	get_page(kpage);
-	page_add_anon_rmap(kpage, vma, addr);
+	if (PageAnon(kpage))
+		page_add_anon_rmap(kpage, vma, addr);
+	else
+		page_add_file_rmap(kpage);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
--

From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:51 am

User Space Breakpoint Assistance Layer (USER_BKPT)

Changelog:
	Use k(un)map_atomic instead of k(un)map.
	Remove BUG_ON.
	Few parameter changes to be more consistent with sparse.
	Added kernel-doc comments whereever necessary.
	Introduce a check to detect if post_xol can sleep.

Currently there is no mechanism in kernel to insert/remove breakpoints.

This patch implements user space breakpoint assistance layer provides
kernel subsystems with architecture independent interface to establish
breakpoints in user applications. This patch provides core
implementation of user_bkpt and also wrappers for architecture dependent
methods.

USER_BKPT currently supports both single stepping inline and execution
out of line strategies. Two different probepoints in the same process
can have two different strategies. It handles pre-processing and
post-processing of singlestep after a breakpoint hit.

Single stepping inline strategy is the traditional method where original
instructions replace the breakpointed instructions on a breakpoint hit.
This method works well with single threaded applications. However its
racy with multithreaded applications.

Execution out of line strategy single steps on a copy of the
instruction. This method works well for both single-threaded and
multithreaded applications.

There could be other strategies like emulating an instruction. However
they are currently not implemented.

Insertion and removal of breakpoints is by "Background page
replacement". i.e make a copy of the page, modify its the contents, set
the pagetable and flush the tlbs. This page uses enhanced replace_page
to cow the page. Modified page is only reflected for the interested
process. Others sharing the page will still see the old copy.

You need to follow this up with the USER_BKPT patch for your
architecture.

Uprobes uses this facility to insert/remove breakpoint.

TODO: Merge user_bkpt layer with uprobes.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar ...
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:52 am

x86 support for user breakpoint Infrastructure

Changelog:
	set USER_BKPT_FIX_SLEEPY if post_xol might sleep.

This patch provides x86 specific userspace breakpoint assistance
implementation details. It uses the "x86: instruction decoder API" patch
to do validate and analyze the instructions. This analysis is used at
the time of post-processing of breakpoint hit to do the necessary
fix-ups.

Almost all instructions are handled for traditional strategy and
execution out of line strategy. Instruction handled include the RIP
relative instructions.

This patch requires "x86: instruction decoder API" patch.
http://lkml.org/lkml/2009/6/1/459

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 arch/x86/Kconfig                 |    1 
 arch/x86/include/asm/user_bkpt.h |   43 +++
 arch/x86/kernel/Makefile         |    2 
 arch/x86/kernel/user_bkpt.c      |  572 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 618 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/user_bkpt.h
 create mode 100644 arch/x86/kernel/user_bkpt.c


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0eacb1f..851cedc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -53,6 +53,7 @@ config X86
 	select HAVE_KERNEL_LZMA
 	select HAVE_KERNEL_LZO
 	select HAVE_HW_BREAKPOINT
+	select HAVE_USER_BKPT
 	select PERF_EVENTS
 	select ANON_INODES
 	select HAVE_ARCH_KMEMCHECK
diff --git a/arch/x86/include/asm/user_bkpt.h b/arch/x86/include/asm/user_bkpt.h
new file mode 100644
index 0000000..df8a4a0
--- /dev/null
+++ b/arch/x86/include/asm/user_bkpt.h
@@ -0,0 +1,43 @@
+#ifndef _ASM_USER_BKPT_H
+#define _ASM_USER_BKPT_H
+/*
+ * User-space BreakPoint support (user_bkpt) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at ...
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:52 am

Uprobes Implementation

Changelog:
	- If fixup might sleep; then do the post singlestep
	   processing in task context.

The uprobes infrastructure enables a user to dynamically establish
probepoints in user applications and collect information by executing a
handler function when a probepoint is hit.

The user specifies the virtual address and the pid of the process of
interest along with the action to be performed (handler). The handle
Uprobes is implemented on the user-space breakpoint assistance layer
and uses the execution out of line strategy. Uprobes follows lazy slot
allocation. I.e, on the first probe hit for that process, a new vma (to
hold the probed instructions for execution out of line) is allocated.
Once allocated, this vma remains for the life of the process, and is
reused as needed for subsequent probes.  A slot in the vma is allocated
for a probepoint when it is first hit.

A slot is marked for reuse when the probe gets unregistered and no
threads are using that slot.

In a multithreaded process, a probepoint once registered is active for
all threads of a process. If a thread specific action for a probepoint
is required then the handler should be implemented to do the same.

If a breakpoint already exists at a particular address (irrespective of
who inserted the breakpoint including uprobes), uprobes will refuse to
register any more probes at that address.

You need to follow this up with the uprobes patch for your
architecture.

For more information: please refer to Documentation/uprobes.txt

TODO:
1. Perf/trace events interface for uprobes.
2. Allow multiple probes at a probepoint.
3. Booster probes.
4. Allow probes to be inherited across fork.
5. probing function returns.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
---

 arch/Kconfig              |   13 +
 include/linux/sched.h     |    3 
 include/linux/tracehook.h |   18 +
 include/linux/uprobes.h   |  187 ++++++++++
 ...
From: Oleg Nesterov
Date: Tuesday, April 13, 2010 - 11:35 am

Looks like, this doesn't need get/put task_struct, you could just

This doesn't look right. We can't trust ->thread_group list even under
rcu_read_lock(). The task can exit and __exit_signal() can remove it

This is called by create_uprocess(). Who will free t->utask if t has

not sure I understand this check. Somehow we should prevent the races
with tracehook_report_exit/tracehook_report_exec, but PF_EXITING can't


can't we race with clone(CLONE_THREAD) and miss the new thread? Probably
I missed something, but afaics we need some barriers to ensure that either
tracehook_report_clone() sees current->utask != NULL or find_next_thread()



OK, uproc and p can't go away. But why it is safe to use pid_task(p) ?

I am looking at 6th patch http://marc.info/?l=linux-kernel&m=127005086102256
and xol_validate_vaddr() calls pid_task() without rcu and doesn't check
the result is not NULL.


This looks a bit strange. Why do we need "ctask" at all? It is not used,
you could just do

	if (likely(!child->utask))
		add_utask(child, uproc);


OK, iiuc this should restore the original instruction, right?

But what about clone(CLONE_VM)? In this case this child shares ->mm with
parent.

Oleg.

--

From: Srikar Dronamraju
Date: Thursday, April 15, 2010 - 2:35 am

Oh Okay, I get that the thread could be exiting from the time we
allocated the utask to the time we are cleaning up here and hence we
could be leaking utask.

Would it be okay if we explicitly (instead of the using
tracehook_report_exit) call uprobe_free_utask() just after we set
PF_EXITING. We could take the task_lock to synchronize with the

Okay. 
Would it work if we 

static struct uprobe_task *add_utask(struct task_struct *t,
					struct uprobe_process *uproc)
{
	struct uprobe_task *utask;
	int exiting = 0;

	if (!t)
		return NULL;

	utask = kzalloc(sizeof *utask, GFP_USER);
	if (unlikely(utask == NULL))
		return ERR_PTR(-ENOMEM);

	task_lock(t);
	if (unlikely(t->flags & PF_EXITING)) {
		task_unlock(t);
		kfree(utask);	
		return;
	}
	t->utask = utask;
	task_unlock(t);
	utask->uproc = uproc;
	utask->active_ppt = NULL;
	atomic_inc(&uproc->refcount);
}


NORET_TYPE void do_exit(long code)
{
...
	exit_irq_thread();                                                                                                   
                                                                                                                             
        exit_signals(tsk);  /* sets PF_EXITING */                                                                            
	task_lock(tsk)
	if (tsk->utask);
		uprobe_free_utask();
	task_unlock(tsk);
	
       ...
..

}


yeah, PF_EXITING doesnt seem to help us here. But will this be an issue once we

The tracehook_report_clone is called after the element gets added to the
thread_group list in copy_process().
Looking at three cases where current thread could be cloning a new thread.

a) current thread has a utask and tracehook_report_clone is not yet
called.  
	utask for the new thread will be created by either
tracehook_report_clone or the find_next_thread whichever comes first. 

b) current thread has no utask and tracehook_report_clone is already called.
	new thread is in the thread_group list; so a ...
From: Oleg Nesterov
Date: Monday, April 19, 2010 - 12:31 pm

Srikar, sorry for delay...


Not sure I understand....

I meant, it is not safe to use next_thread(tsk) if tsk was already


Yes, but my point was, we probably need mb's on both sides. Of course,
this is only theoretical problem, but tracehook_report_clone() can read
current->utask == NULL before the result of copy_process()->list_add_tail()


Oh, I don't know. You are going to change this code anyway, I can't see
in advance.


I tried to read the next 8/11 patch, and I have a couple more random questions.

	- uprobe_process->tg_leader is not really used ?

	- looks like, 7/11 can't be compiled without the next 8/11 ?
	  say, the next patch defines arch_uprobe_disable_sstep() but
	  it is used by 7/11

	- I don't understand why do we need uprobe_{en,dis}able_interrupts
	  helpers. pre_ssout() could just do local_irq_enable(). This path
	  leads to get_signal_to_deliver() which enables irqs anyway, it is
	  always safe to do this earlier and I don't think we need to disable
	  irqs again later. In any case, I don't understand why these helpers
	  use native_irq_xxx().

	- pre_ssout() does .xol_vaddr = xol_get_insn_slot(). This looks a
	  bit confusing, xol_get_insn_slot() should set .xol_vaddr correctly
	  under lock.

	- pre_ssout() does user_bkpt_set_ip() after user_bkpt_pre_sstep().
	  Why? Shouldn't user_bkpt_pre_sstep() always set regs->ip ?
	  Otherwise uprobe_bkpt_notifier()->user_bkpt_pre_sstep() is not
	  right.

	- I don't really understand why ->handler_in_interrupt is really
	  useful, but never mind.

	- However, handler_in_interrupt && !uses_xol_strategy() doesn't
	  look right. uprobe_bkpt_notifier() is called with irqs disabled,

Yes, I think CLONE_VM without CLONE_THREAD needs utask too, but do we need
the new uproc? OK, please forget about this for the moment.

Suppose that register_uprobe() succeeds and does set_bkpt(). What if another
process (not sub-thread) with the same ->mm hits this bp? uprobe_bkpt_notifier()
will see ->utask == NULL ...
From: Srikar Dronamraju
Date: Tuesday, April 20, 2010 - 5:43 am

Okay, cleanup_process() gets called only and only if add_utask() fails
to allocated utask struct.  Based on your inputs I will synchronize
exit_signals() and uprobe_free_utask(). However it still can happen that
uprobe calls cleanup_uprocess() with reference to task struct which has just
called __unhash_process(). Is there a way out of this?



Can you please let me know when nsproxy is set to NULL? If we are sure
that register/unregister will be called with nsproxy set, then I am


Currently we have a reference to pid struct from the time we created a
uprobe_process to the time we free the uprobe process.  So are you
suggesting that we dont have a reference to the pid structure or is that
we dont need to cache the pid struct and access it thro


On i686, (unlike x86_64), do_notify_resume() gets called with irqs
disabled.  I had tried local_irq_enable couple of times but that didnt
help probably because CONFIG_PARAVIRT is set in my .config and hence
raw_local_irq_enable resolves to

static inline void raw_local_irq_enable(void)
{
	PVOP_VCALLEE0(pv_irq_ops.irq_enable);
}

What we need is the "sti" instruction.  It looks like local_irq_enable
actually doesnt do "sti".  So I had to go back to using
native_irq_enable().

Do you have any ideas how to force local_irq_enable to do a "sti."
Or Am I missing something?

Since I wasnt sure why do_notify_resume() was called under irqs_disabled
only for x86. I disabled irqs again just to be sure that I am not


Right, user_bkpt_set_ip is redudant as user_bkpt_pre_sstep sets
	
Uprobes can run handlers either in interrupt context or in task context.
If the user is sure that his handler is not going to sleep, then he can
set handler_in_interrupt flag while registering the probe.

There is a small overhead when running the handlers in task context. 
Here is a brief benchmark on a x86_64 machine.


========================================================================
Results when running a kernel without lockdep and other ...
From: Oleg Nesterov
Date: Tuesday, April 20, 2010 - 8:30 am

Yes, but afaics we have the same issues in find_next_thread() called

In this particular case, probably we can rely on uprobe_mutex. Currently
cleanup_uprocess() is called with start == cur_t. Instead, we should use
the last task on which add_utask() succeeded, it can't exit (assuming we
fix other discussed races with exit) because uprobe_free_utask() takes
this mutex too.

However, perhaps it is better to rework this all. Say, can't we move
uprobe_free_utask() into __put_task_struct() ? Afaics, this can greatly
simplify things. If we add mm_struct->uproc, then utask doesn't need




I must have missed something. But I do not see where do we use
uprobe_process->tg_leader. We never read it, apart from

Hmm. No, I can't explain this, I know nothing about paravirt. But this
doesn't look right to me. Probably this should be discussed with paravirt

pre_ssout() does

	if (!user_bkpt.xol_vaddr)
		user_bkpt.xol_vaddr = xol_get_insn_slot();

but it could just do

	if (!user_bkpt.xol_vaddr)
		xol_get_insn_slot();

because xol_get_insn_slot() populates user_bkpt.xol_vaddr.

Btw. Why do we have the !CONFIG_USER_BKPT_XOL code in
include/linux/user_bkpt_xol.h? CONFIG_UPROBES depends on CONFIG_USER_BKPT_XOL.

Also the declarations don't look nice... Probably I missed something,
but why the code uses "void *" instead of "user_bkpt_xol_area *" for
xol_area everywhere?

OK, even if "void *" makes sense for uproc->uprobe_process, why


this overhead looks very minor. To me, it is better to simplify the
code, at least in the first version.

That said, this is up to you, I am not asking you to remove this


Yes, I was thinking about mm->struct->uproc too.


Well, we could add the list of uprobe_task's into uprobe_process, it
represents the tasks "inside" the probe hit. But yes, this is not easy,



Agreed. Although we need the new TIF_ bit for tracehook_notify_resume(),
it can't trust "if (current->utask...)" checks.


Alternatively, without the "on demand" ...
From: Srikar Dronamraju
Date: Tuesday, April 20, 2010 - 11:59 pm

Okay. I will use mm_struct->uproc, dynamic allocation of utask on probe


static int free_uprocess(struct uprobe_process *uproc)
{
	....
	put_pid(uproc->tg_leader);
        uproc->tg_leader = NULL;
	




user_bkpt_xol_area isn't exposed. This provides flexibility in changing
the algorithm for more efficient slot allocation. Currently we allocate
slots from just one page. Later on we could end-up having to allocate
from more than contiguous pages. There was some discussion about
allocating slots from TLS.  So there is more than one reason that
user_bkpt_xol can change. We could expose the struct and not access the





But do we need a new TIF bit? Can we just reuse the TIF_NOTIFY_RESUME

Okay I will try the on demand allocations in the next iteration.

Thanks again for your detailed explainations and suggestions.

--
Thanks and Regards
Srikar

--

From: Oleg Nesterov
Date: Wednesday, April 21, 2010 - 9:05 am

Yes, yes, I see it does get/put pid. But where do we actually use
                                         ^^^^^^^^^^^^^^^^^^^^^

Still can't understand... Yes, we shouldn't expose the details, but we
can just add "struct user_bkpt_xol_area;" into include file.


Probably not... But somehow tracehook_notify_resume/uprobe_notify_resume
should know we hit the bp and we need to allocate utask. Yes,
tracehook_notify_resume() can always call uprobe_notify_resume()
unconditionally, and uprobe_notify_resume() can notice the
"find_probept() && !current->utask" case, but probably it is better to
make this more explicit. And of course, the new bit should be set along
with TIF_NOTIFY_RESUME.

Or. Instead of TIF_ bit, we can use something like

	#define UTASK_PLEASE_ALLOCATE_ME ((struct uprobe_task *)1)

uprobe_bkpt_notifier() sets current->utask = UTASK_PLEASE_ALLOCATE_ME,
then tracehook_notify_resume/uprobe_notify_resume check this case.

I dunno, please do what you think right.


OK, the last questions:

1. Can't multiple write_opcode()'s race with each other?

   Say, pre_ssout() calls remove_bkpt() lockless. can't it race
   with register_uprobe() which may write to the same page?

   And, without uses_xol_strategy() there are more racy callers
   of write_opcode()... Probably something else.

2. Can't write_opcode() conflict with ksm doing replace_page() ?

3. mprotect(). write_opcode() checks !VM_WRITE. This is correct,
   otherwise we can race with the user-space writing to the same
   page.

   But suppose that the application does mprotect(PROT_WRITE) after
   register_uprobe() installs the bp, now unregister_uprobe/etc can't
   restore the original insn?

4. mremap(). What if the application does mremap() and moves the
   memory? After that vaddr of user_bkpt/uprobe no longer matches
   the virtual address of bp. This breaks uprobe_bkpt_notifier(),
   unregister_uprobe(), etc.

   Even worse. Say, unregister_uprobe() calls remove_bkpt().
   mremap()+mmap() ...
From: Srikar Dronamraju
Date: Thursday, April 22, 2010 - 6:31 am

uproc->tg_leader was used to validate looked up uproc belongs to the
process.  It was used to check if the uproc belonged to the process for
which we are currently trying to register/unregister uprobes.

Since we want to share the uproc with process that share the same mm, I

Okay, I will add the forward declaration in the include file and update

Yeah, since tracehook_notify_resume() is in fast path, its worth adding

All callers of write_opcodes should have taken uproc->mutex. 
If there are other users of write_opcode, we will have to add a way to

That's a bug, I will fix it. remove_bkpt() clearly says it needs to be


I dont think so. 
If uprobes runs on hosts, it would be calling replace_page() on text
pages. KSM for now works on anonymous pages. Even the replaced page we
add still belongs to the text VMA.

If uprobes runs on guest, KSM should be taking care of cases where

I still need to verify this. I shall get back to you on this.

I dont think we handle this case now. I think even munmap of the region
where there are probes inserted also can have the same problem.

Are there ways to handle this. 
I think taking a write lock on mmap_sem instead of the read lock could
handle this problem.

I am copying Mel Gorman and Andrea Arcangeli so that they can provide
their inputs on VM and KSM related issues.

--
Thanks and Regards
--

From: Oleg Nesterov
Date: Thursday, April 22, 2010 - 8:40 am

Well, I think the kernel should assume that the user-space can do
anything.

Hmm. And if this vma is VM_SHARED, then this bp could be actually
written to vm_file after mprotect().

But I think this doesn't really matter. When I actually look at

Yes. We need vm experts here, I am not. Still, I'd like to share my
concerns. I also added Rik and Hugh.


So, 3/11 does

	@@ -2617,7 +2617,10 @@ int replace_page(struct vm_area_struct *vma, struct page *page,
		}
	 
		get_page(kpage);
	-	page_add_anon_rmap(kpage, vma, addr);
	+	if (PageAnon(kpage))
	+		page_add_anon_rmap(kpage, vma, addr);
	+	else
	+		page_add_file_rmap(kpage);
	 
		flush_cache_page(vma, addr, pte_pfn(*ptep));
		ptep_clear_flush(vma, addr, ptep);

I see no point in this patch, please see below.

The next 4/11 patch introduces write_opcode() which roughly does:

	int write_opcode(unsigned long vaddr, user_bkpt_opcode_t opcode)
	{
		get_user_pages(write => false, &old_page);

		new_page = alloc_page_vma(...);

		... insert the bp into the new_page ...

		new_page->mapping = old_page->mapping;
		new_page->index = old_page->index;

		replace_page(old_page, new_page);
	}

This doesn't look right at all to me.

	IF PageAnon(old_page):

		in this case replace_page() calls page_add_anon_rmap() which
		needs the locked page.

	ELSE:

		I don't think the new page should evere preserve the mapping,
		this looks just wrong. It should be always anonymous.


And in fact, I do not understand why write_opcode() needs replace_page().
It could just use get_user_pages(FOLL_WRITE | FOLL_FORCE), no? It should
create the anonymous page correctly.

Either way, I think register_uprobe() should disallow the probes in
VM_SHARED/VM_MAYWRITE vmas.

Oleg.

--

From: Srikar Dronamraju
Date: Friday, April 23, 2010 - 7:58 am

When I look through the load_.*_binary and load_.*_library functions,
they seem to map the text regions MAP_PRIVATE|MAP_DENY_WRITE. (Few
exceptions like load_som_binary that seem to map text regions with
MAP_PRIVATE only).

Also if vma are marked VM_SHARED and bp are inserted through ptrace,
i.e(access_process_vm/get_user_pages), then we would still be writing to
vm_file after mprotect?


I did verify that page_add_file_rmap gets called from replace_page when 
we insert or remove a probe.
This should be because uprobes doesnt do a anon_vma_prepare() before the
alloc_page_vma(). 

We were earlier doing access_process_vm that would inturn call
get_user_pages to COW the page. However that needed that the threads of
the target process be stopped.

In the access_process_vm method,
1. we get a copy of page, 
2. flush the tlbs.
3. modify the page. 

The concern was if the threads were executing in the vicinity.
Hence we were stopping all threads while inserting/deleting breakpoints.


Background page replacement was suggested by Linus and Peter. 
In this method.
1. we get a copy of the page.
2. modify the page 
3. flush the tlbs.

This method is suppose to be atomic enuf that we dont need to stop the

Yes, we certainly could add that check. 

--
Thanks and Regards
Srikar
--

From: Oleg Nesterov
Date: Friday, April 23, 2010 - 11:53 am

Again, I didn't mean they should. But they can.

Not only VM_SHARED, the application can create the anonymous PROT_EXEC region,


Of course! but see above, PageAnon() case is possible too. I think the
code should handle this case correctly anyway, but it seems it doesn't.
Not only page_add_anon_rmap() needs the locked page, I am not not sure
page_add_anon_rmap() is fine for write_opcode() which allocates the new
page. LRU? SetPageSwapBacked?

And you seem to miss my point. I think page_add_file_rmap() is always wrong.
I mean, no matter what is the page_mapping(old_page), the new page should be



OK.

I must admit, I don't understand the usage of the lockless get_pte() in
write_opcode(). replace_page() checks orig_pte, yes. But how this check
can help write_opcode and why it is needed? I do not think it can prevent
any race, pte can be changed even before write_opcode() calls get_pte().
I guess this is only done because replace_page() requires this argument?

Oleg.

--

From: Peter Zijlstra
Date: Tuesday, May 11, 2010 - 1:47 pm

Well, they typically are not, I could imagine some JITs maybe using it,
but those would most probably be shared anonymous.

--

From: Peter Zijlstra
Date: Tuesday, May 11, 2010 - 1:44 pm

I don't think we should allow breakpoints on VM_SHARED maps.

--

From: Peter Zijlstra
Date: Tuesday, May 11, 2010 - 1:45 pm

VM_SHARED, fully agreed, MAYWRITE not so sure, MAP_PRIVATE has MAYWRITE
iirc and its perfectly fine to poke at those.

--

From: Srikar Dronamraju
Date: Wednesday, May 12, 2010 - 3:31 am

Okay, I will put a check to disallow probes in VM_SHARED vmas.

--
Thanks and Regards
Srikar
--

From: Oleg Nesterov
Date: Thursday, May 13, 2010 - 12:40 pm

Yes, sorry for confusion. Not sure where this VM_MAYWRITE came from.

But I still think this doesn't actually matter, replace_page() shouldn't
preserve the mapping, it should always install the anonymous page. I can
be wrong, of course.

(I didn't read the next version yet)

Oleg.

--

From: Linus Torvalds
Date: Thursday, May 13, 2010 - 12:59 pm

Well, if I reasd the patches right, uprobes will use "copy_to_user()" for 
the self-probing case. So that would definitely just modify a shared 
mapping.

Of course, arguably, who really cares? As long as it's not a security 
issue (and it isn't - since the person could just have written to the 
thing directly instead), I guess it doesn't much matter. But it's a bit 
sad when a probing feature either

 - changes a global mapping that may be executed by other non-related 
   processes that the prober isn't even _aware_ of.

 - changes semantics by creating a non-coherent private page

so arguably it would be good to just make the rule be that you cannot 
probe a shared mapping. Because whatever you do, it's always the wrong 
thing.

			Linus
--

From: Andi Kleen
Date: Thursday, May 13, 2010 - 3:12 pm

But isn't text usually shared?  I don't see how you could set any
break points or jump probes on text pages with that restriction.

BTW there were old patches for NUMA text duplication, maybe they could
be resurrected for that too.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Linus Torvalds
Date: Thursday, May 13, 2010 - 3:25 pm

Text is usually private, and read-only. Not generally MAP_SHARED. The 
pages end up getting shared because nobody writes to them, but that's 
almost accidental.

If you write to them, you get a nice clean COW fault, and you are 
_supposed_ to get a nice clean COW fault. It's not changing any semantics: 
the write is not visible to outside users, and those "get a private page" 
semantics were what the mmap() was all about.

In contrast, if it's a MAP_SHARED mapping and writable, the write would 
actually be _visible_ outside the process. And that's clearly totally 
wrong on all levels. Tracing a process should _never_ cause visible damage 
outside that process (you'd hope it wouldn't be all that visibel to the 
tracee either, but that's still secondary).

The alternative, ie a MAP_SHARED but read-only mapping (which looks very 
much like a private mapping) if you use get_user_pages(.force=1), the 
kernel will actually end up forcing a COW break, because making the write 
visible outside would be a security issue (you don't even have the right 
to write to the thing).

Notice how the MAP_SHARED case - writable or not - ends up doing the wrong 
thing. Arguably it does the _even_worse_ thing in the writable case, but 
in either case it's not good. 

			Linus
--

From: Roland McGrath
Date: Thursday, May 13, 2010 - 5:56 pm

Agreed.  Or, if you do, it's doing something entirely different and should
be in an interface where you're explicitly attaching it generically to the
file (what's being shared) without regard to any individual process.  But,
as you mentioned, shared, executable mappings are well outside the normal
case and there is no reason to think that a first (or fourth) version of
anything needs to support them at all.


Thanks,
Roland
--

From: Srikar Dronamraju
Date: Thursday, May 13, 2010 - 10:42 pm

Uprobes uses copy_to_user() to write data/stack and never to write to
instruction addresses.

To write an instruction uprobes either used access_process_vm or the
replace_page() based background page replacement method. This is true
even if the process was probing itself.

Soon to be posted v4 will revert to background page replacement method


Yes, I will be adding a check to discard probing if the vma has
VM_SHARED flag set. I have already committed to Oleg on this issue.
I didnt include this check in v3 patchset, because uprobes was using
access_process_vm in v3 patchset and I thought access_process_vm would
do the right thing even if VM_SHARED is set.

--
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Tuesday, May 11, 2010 - 1:43 pm

KSM only does anonymous pages, and I thought uprobes was limited to
MAP_PRIVATE|PROT_EXEC file maps.

We can't hold mmap_sem (for either read or write -- read would be
sufficient to serialize against mmap/mremap/munmap) from atomic uprobe
context, what we can do is validate that there is a INT3 on that
particular address, a mremap/munmap/munmap+mmap will either end not
having a pte entry for the address, or not have the INT3.

That said, you shouldn't be executing code on maps you're changing, much
fun can happen if you try, so I don't think we should expend too much
effort as long as the race will only result in the app crashing and not
the kernel.

--

From: Srikar Dronamraju
Date: Wednesday, May 12, 2010 - 3:41 am

Did you mean "We can hold mmap_sem?" Else I am not sure if we can
traverse the vma. Infact alloc_page_vma() needs mmap_sem to be acquired.

Okay.

--
Thanks and Regards
Srikar
--

From: Peter Zijlstra
Date: Wednesday, May 12, 2010 - 4:12 am

OK, so maybe I misunderstood, this is from the INT3 trap handler, right?

We can _not_ take a sleeping lock from trap context. Why would you want
the vma anyway?

--

From: Srikar Dronamraju
Date: Wednesday, May 12, 2010 - 7:24 am

If I am right, the initial comment was both from the unregister_uprobe()
-> write_opcode() context  and uprobe_bkpt_notifier context.

[ snipping relevant part of Oleg's mail from where the conversation started ]
---------------------------------------------------------------------

But yes, if the mmap/mremap/munmap can happen between validating the
int3 and removal of the breakpoint in the unregister_uprobe path, then
it can as well happen between the breakpoint hit and the time uprobes
does the fixups to continue execution after running the handler and
single-stepping. 

I agree with you that we shouldnt bother about mmap/mremap/munmap of the

Yes, we dont look at the vma in trap context at all. If we need to allocate a
slot in the xol_vma then we set the TIF_UPROBE do the stuff in task
context.

--
Thanks and Regards
--

From: Peter Zijlstra
Date: Tuesday, May 11, 2010 - 1:32 pm

Right so what I've suggested several times it to simply call the same
handler in both contexts. If it returns -EFAULT, set TIF_UPROBE or
whatever and try again from task context.



--

From: Frank Ch. Eigler
Date: Tuesday, May 11, 2010 - 1:57 pm

Hi -


That could work, but random partial execution & restart of the handler
will make it tricky to write a single handler that reliably produces
results.  It would likely need a flag to indicate that it failed
previously so as to throw away partial results.

- FChE
--

From: Peter Zijlstra
Date: Tuesday, May 11, 2010 - 2:01 pm

Or it shouldn't leave half-assed state around to begin with.

--

From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:52 am

X86 support for Uprobes

This patch provides x86 specific details for uprobes.
This includes interrupt notifier for uprobes, enabling/disabling
singlestep.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---

 arch/x86/Kconfig          |    1 +
 arch/x86/kernel/Makefile  |    1 +
 arch/x86/kernel/uprobes.c |   87 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 89 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/uprobes.c


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 851cedc..a860a9b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select HAVE_KERNEL_LZO
 	select HAVE_HW_BREAKPOINT
 	select HAVE_USER_BKPT
+	select HAVE_UPROBES
 	select PERF_EVENTS
 	select ANON_INODES
 	select HAVE_ARCH_KMEMCHECK
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 98c74b4..bfa48f0 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -118,6 +118,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
 obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
 
 obj-$(CONFIG_USER_BKPT)			+= user_bkpt.o
+obj-$(CONFIG_UPROBES)			+= uprobes.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..1acce22
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,87 @@
+/*
+ *  Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You ...
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:52 am

Uprobes documentation.

Changelog:
	Addressed comments from Randy Dunlap.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 Documentation/uprobes.txt |  244 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 244 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/uprobes.txt


diff --git a/Documentation/uprobes.txt b/Documentation/uprobes.txt
new file mode 100644
index 0000000..d68dcdb
--- /dev/null
+++ b/Documentation/uprobes.txt
@@ -0,0 +1,244 @@
+Title	: User-Space Probes (Uprobes)
+Authors	: Jim Keniston <jkenisto@us.ibm.com>
+	: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
+
+CONTENTS
+
+1. Concepts: Uprobes
+2. Architectures Supported
+3. Configuring Uprobes
+4. API Reference
+5. Uprobes Features and Limitations
+6. Probe Overhead
+7. TODO
+8. Uprobes Team
+9. Uprobes Example
+
+1. Concepts: Uprobes
+
+Uprobes enables you to dynamically break into any routine in a
+user application and collect debugging and performance information
+non-disruptively. You can trap at any code address, specifying a
+kernel handler routine to be invoked when the breakpoint is hit.
+
+A uprobe can be inserted on any instruction in the application's
+virtual address space.  The registration function register_uprobe()
+specifies which process is to be probed, where the probe is to be
+inserted, and what handler is to be called when the probe is hit.
+
+Uprobes-based instrumentation can be packaged as a kernel
+module.  In the simplest case, the module's init function installs
+("registers") one or more probes, and the exit function unregisters
+them.
+
+1.1 How Does a Uprobe Work?
+
+When a uprobe is registered, Uprobes makes a copy of the probed
+instruction, stops the probed application, replaces the first byte(s)
+of the probed instruction with a breakpoint instruction (e.g., int3
+on i386 and x86_64), and allows the probed application to continue.
+(When inserting the ...
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:52 am

Uprobes Samples

This provides an example uprobes module in the samples directory.

To run this module run (as root)
 insmod uprobe_example.ko vaddr=<vaddr> pid=<pid>
	 Where <vaddr> is the address where we want to place the probe.
		<pid> is the pid of the process we are interested to probe.

 example: -
# cd samples/uprobes

[get the virtual address to place the probe.]
# vaddr=0x$(objdump -T /bin/bash |awk '/echo_builtin/ {print $1}')

[Run a bash shell in the background; have it echo 4 lines.]
# (sleep 10; echo 1; echo 2; echo 3; echo 4) &
[Probe calls echo_builtin() in the background bash process.]

# insmod uprobe_example.ko vaddr=$vaddr pid=$!
# sleep 10
# rmmod uprobe_example
# dmesg | tail -n 3
Registering uprobe on pid 10875, vaddr 0x45aa30
Unregistering uprobe on pid 10875, vaddr 0x45aa30
Probepoint was hit 4 times
#
[ Output shows that echo_builtin function was hit 4 times. ]

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 samples/Kconfig                  |    7 +++
 samples/uprobes/Makefile         |   17 ++++++++
 samples/uprobes/uprobe_example.c |   83 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 samples/uprobes/Makefile
 create mode 100644 samples/uprobes/uprobe_example.c


diff --git a/samples/Kconfig b/samples/Kconfig
index 8924f72..50b8b1c 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -44,4 +44,11 @@ config SAMPLE_HW_BREAKPOINT
 	help
 	  This builds kernel hardware breakpoint example modules.
 
+config SAMPLE_UPROBES
+	tristate "Build uprobes example -- loadable module only"
+	depends on UPROBES && m
+	help
+	  This builds uprobes example module.
+
+
 endif # SAMPLES
diff --git a/samples/uprobes/Makefile b/samples/uprobes/Makefile
new file mode 100644
index 0000000..f535f6f
--- /dev/null
+++ b/samples/uprobes/Makefile
@@ -0,0 +1,17 @@
+# builds the uprobes example kernel modules;
+# then to use one (as root):
+# insmod ...
From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:53 am

Uprobes Trace_events interface 

The following patch implements trace_event support for uprobes. In its
current form it can be used to put probes at a specified text address
in a process and dump the required registers when the code flow reaches
the probed address.

This is based on trace_events for kprobes to the extent that it may
resemble that file on 2.6.34-rc3.

The following example shows how to dump the instruction pointer and %ax a
register at the probed text address.

Start a process to trace. Get the address to trace.
  [Here pid is asssumed as 3548]
  [Address to trace is 0x0000000000446420]
  [Registers to be dumped are %ip and %ax]

# cd /sys/kernel/debug/tracing/
# echo 'p 3548:0x0000000000446420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_3548_0x0000000000446420 3548:0x0000000000446420 %ip=%ip %ax=%ax
# cat events/uprobes/p_3548_0x0000000000446420/enable
0
[enable the event]
# echo 1 > events/uprobes/p_3548_0x0000000000446420/enable
# cat events/uprobes/p_3548_0x0000000000446420/enable
1
# #### do some activity on the program so that it hits the breakpoint
# cat uprobe_profile
  3548 p_3548_0x0000000000446420                                234
# head trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-3548  [001]   294.285812: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
             zsh-3548  [001]   294.285884: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
             zsh-3548  [001]   294.285894: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
             zsh-3548  [001]   294.285903: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
             zsh-3548  [001]   294.285912: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
             zsh-3548  [001]   294.285922: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1

TODO: Documentation/trace/uprobetrace.txt

Signed-off-by: Srikar Dronamraju ...
From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 2:24 pm

Note, you want to really add this to trace_entries.h instead:

FTRACE_ENTRY(uprobe, uprobe_trace_entry,

	TRACE_GRAPH_ENT,

	F_STRUCT(
		__field(	unsigned long,	ip	)
		__field(	int,		nargs	)
		__dynamic_array(unsigned long,	args	)
	),

	F_printk("%lx nrargs:%u", __entry->ip, __entry->nargs)
);


This will put this event into the events/ftrace directory. Don't worry
about the printk format, we can write a plugin for it to override it if
need be.

By adding the above, other tools can know what it encountered instead of

If you added the event to trace_entries.h then this should be done


Or is it because of this special logic that you could not use the
trace_entries.h?



--

From: Masami Hiramatsu
Date: Wednesday, March 31, 2010 - 9:16 pm

Hi Steven,



Hmm, interesting idea. But this dynamic event definition allows us
to filter events based on each argument value.


each argument can have unique name. Therefore user can write a filter
by using these names.

Moreover, dynamic events (at least kprobe-tracer) are going to support
'types' for each argument. this means that the arg[] in *probe_trace_entry
will be no longer an unsigned long array.

Thank you,

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com
--

From: Frederic Weisbecker
Date: Wednesday, May 12, 2010 - 7:57 am

Yeah, I don't think we should FTRACE_ENTRY for that. The format files for
[k|u]probes events are created dynamically on top of what the user requested,
which is a very nice feature.

--

From: Frederic Weisbecker
Date: Wednesday, May 12, 2010 - 4:02 am

That doesn't explain much what it does. Please explain its goal of


This can be shared with kprobes in a new kernel/trace/dyn_probes.h













Thanks.

--

From: Srikar Dronamraju
Date: Wednesday, May 12, 2010 - 7:34 am

Agree, the unregister_trace_uprobe() has to be called after locking 


-- 
Thanks and Regards
Srikar
--

From: Frederic Weisbecker
Date: Wednesday, May 12, 2010 - 8:15 am

You can use the non-nowake version I think. nowake is for events that might
occur when we hold the rq lock, hence when it's too dangerous to wake up.
It doesn't seem to be the case since we came here after a trap in userspace.

--

From: Srikar Dronamraju
Date: Wednesday, March 31, 2010 - 8:52 am

Slot allocation for Execution out of line strategy(XOL)

This patch provides slot allocation mechanism for execution out of
line strategy for use with user space breakpoint infrastructure.

Traditional method of replacing the original instructions on breakpoint
hit are racy when used on multithreaded applications.

Alternatives for the traditional method include:
	- Emulating the breakpointed instruction.
	- Execution out of line.

Emulating the instruction:
	This approach would use a in-kernel instruction emulator to
emulate the breakpointed instruction. This approach could be looked in
at a later point of time.

Execution out of line:
	In execution out of line strategy, a new vma is injected into
the target process, a copy of the instructions which are breakpointed is
stored in one of the slots. On breakpoint hit, the copy of the
instruction is single-stepped leaving the breakpoint instruction as is.
This method is architecture independent.

This method is useful while handling multithreaded processes.

This patch allocates one page per process for slots to be used to copy the
breakpointed instructions.

Current slot allocation mechanism:
1. Allocate one dedicated slot per user breakpoint. Each slot is big
enuf to accomodate the biggest instruction for that architecture. (16
bytes for x86).
2. We currently allocate only one page for slots. Hence the number of
slots is limited to active breakpoint hits on that process.
3. Bitmap to track used slots.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 arch/Kconfig                  |    4 +
 include/linux/user_bkpt_xol.h |   61 +++++++++
 kernel/Makefile               |    1 
 kernel/user_bkpt_xol.c        |  289 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 355 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/user_bkpt_xol.h
 create mode 100644 kernel/user_bkpt_xol.c


diff --git a/arch/Kconfig b/arch/Kconfig
index ...
Previous thread: RFC [Patch] Remove "please try 'cgroup_disable=memory' option if you don't want memory cgroups" printk at boot time. by Larry Woodman on Wednesday, March 31, 2010 - 8:28 am. (9 messages)

Next thread: [PATCH] proc: pagemap: Hold mmap_sem during page walk by San Mehat on Wednesday, March 31, 2010 - 10:23 am. (18 messages)