Uprobes Patches Changelog: - Added trace_event interface for uprobes. - Addressed comments from Andrew Morton and Randy Dunlap. For previous posting: please refer: http://lkml.org/lkml/2010/3/20/107 This patchset implements Uprobes which enables you to dynamically break into any routine in a user space application and collect information non-disruptively. This patchset is a rework based on suggestions from discussions on lkml in January this year (http://lkml.org/lkml/2010/1/11/92 and http://lkml.org/lkml/2010/1/27/19). This implementation of uprobes doesnt depend on utrace. When a uprobe is registered, Uprobes makes a copy of the probed instruction, replaces the first byte(s) of the probed instruction with a breakpoint instruction. (Uprobes uses background page replacement mechanism and ensures that the breakpoint affects only that process.) When a CPU hits the breakpoint instruction, Uprobes gets notified of trap and finds the associated uprobe. It then executes the associated handler. Uprobes single-steps its copy of the probed instruction and resumes execution of the probed process at the instruction following the probepoint. Instruction copies to be single-stepped are stored in a per-process "execution out of line (XOL) area". Currently XOL area is allocated as one page vma. Advantages of uprobes over conventional debugging include: 1. Non-disruptive. Unlike current ptrace based mechanisms, uprobes tracing wouldnt involve signals, stopping threads and context switching between the tracer and tracee. 2. Much better handling of multithreaded programs because of XOL. Current ptrace based mechanisms use single stepping inline, i.e they copy back the original instruction on hitting a breakpoint. In such mechanisms tracers have to stop all the threads on a breakpoint hit or tracers will not be able to handle all hits to the location of interest. Uprobes uses execution out of line, where the instruction to be traced is analysed at the time of breakpoint ...
Move Macro W to asm/insn.h
Macro W used to know if the instructions are valid for
user-space/kernel space. This macro is used by kprobes and
user_bkpt. (i.e user space breakpoint assistance layer.) So moving it
to a common header file asm/insn.h.
TODO: replace macro W with bits in inat table.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
arch/x86/include/asm/insn.h | 7 +++++++
arch/x86/kernel/kprobes.c | 7 -------
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/insn.h b/arch/x86/include/asm/insn.h
index 96c2e0a..8586820 100644
--- a/arch/x86/include/asm/insn.h
+++ b/arch/x86/include/asm/insn.h
@@ -23,6 +23,13 @@
/* insn_attr_t is defined in inat.h */
#include <asm/inat.h>
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+ (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
+ (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
+ (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
+ (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
+ << (row % 32))
+
struct insn_field {
union {
insn_value_t value;
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index b43bbae..4379b40 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -66,12 +66,6 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
#define stack_addr(regs) ((unsigned long *)kernel_stack_pointer(regs))
-#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
- (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
- (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
- (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
- (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
- << (row % 32))
/*
* Undefined/reserved opcodes, conditional jump, Opcode Extension
* ...Move replace_page() to mm/memory.c
Move replace_page from mm/ksm.c to mm/memory.c.
User bkpt will use background page replacement approach to insert/delete
breakpoints. Background page replacement approach will be based on
replace_page. Now replace_page() loses its static attribute.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---
include/linux/mm.h | 2 ++
mm/ksm.c | 59 ----------------------------------------------------
mm/memory.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 59 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..0f43355 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -854,6 +854,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
+int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte);
extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/ksm.c b/mm/ksm.c
index 8cdfc2a..83fb4fb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -766,65 +766,6 @@ out:
return err;
}
-/**
- * replace_page - replace page in vma by new ksm page
- * @vma: vma that holds the pte pointing to page
- * @page: the page we are replacing by kpage
- * @kpage: the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
- struct page *kpage, pte_t orig_pte)
-{
- struct mm_struct *mm = vma->vm_mm;
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *ptep;
- spinlock_t *ptl;
- unsigned long addr;
- int err = -EFAULT;
-
- addr ...Enhance replace_page() to support pagecache Currently replace_page would work only for anonymous pages. This patch enhances replace_page() to work for pagecache pages This enhancement is useful for user_bkpt's replace_page based background page replacement for insertion and removal of breakpoints. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> --- mm/memory.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 66a3632..5acf686 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2617,7 +2617,10 @@ int replace_page(struct vm_area_struct *vma, struct page *page, } get_page(kpage); - page_add_anon_rmap(kpage, vma, addr); + if (PageAnon(kpage)) + page_add_anon_rmap(kpage, vma, addr); + else + page_add_file_rmap(kpage); flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep); --
User Space Breakpoint Assistance Layer (USER_BKPT) Changelog: Use k(un)map_atomic instead of k(un)map. Remove BUG_ON. Few parameter changes to be more consistent with sparse. Added kernel-doc comments whereever necessary. Introduce a check to detect if post_xol can sleep. Currently there is no mechanism in kernel to insert/remove breakpoints. This patch implements user space breakpoint assistance layer provides kernel subsystems with architecture independent interface to establish breakpoints in user applications. This patch provides core implementation of user_bkpt and also wrappers for architecture dependent methods. USER_BKPT currently supports both single stepping inline and execution out of line strategies. Two different probepoints in the same process can have two different strategies. It handles pre-processing and post-processing of singlestep after a breakpoint hit. Single stepping inline strategy is the traditional method where original instructions replace the breakpointed instructions on a breakpoint hit. This method works well with single threaded applications. However its racy with multithreaded applications. Execution out of line strategy single steps on a copy of the instruction. This method works well for both single-threaded and multithreaded applications. There could be other strategies like emulating an instruction. However they are currently not implemented. Insertion and removal of breakpoints is by "Background page replacement". i.e make a copy of the page, modify its the contents, set the pagetable and flush the tlbs. This page uses enhanced replace_page to cow the page. Modified page is only reflected for the interested process. Others sharing the page will still see the old copy. You need to follow this up with the USER_BKPT patch for your architecture. Uprobes uses this facility to insert/remove breakpoint. TODO: Merge user_bkpt layer with uprobes. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar ...
x86 support for user breakpoint Infrastructure Changelog: set USER_BKPT_FIX_SLEEPY if post_xol might sleep. This patch provides x86 specific userspace breakpoint assistance implementation details. It uses the "x86: instruction decoder API" patch to do validate and analyze the instructions. This analysis is used at the time of post-processing of breakpoint hit to do the necessary fix-ups. Almost all instructions are handled for traditional strategy and execution out of line strategy. Instruction handled include the RIP relative instructions. This patch requires "x86: instruction decoder API" patch. http://lkml.org/lkml/2009/6/1/459 Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- arch/x86/Kconfig | 1 arch/x86/include/asm/user_bkpt.h | 43 +++ arch/x86/kernel/Makefile | 2 arch/x86/kernel/user_bkpt.c | 572 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 618 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/user_bkpt.h create mode 100644 arch/x86/kernel/user_bkpt.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 0eacb1f..851cedc 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -53,6 +53,7 @@ config X86 select HAVE_KERNEL_LZMA select HAVE_KERNEL_LZO select HAVE_HW_BREAKPOINT + select HAVE_USER_BKPT select PERF_EVENTS select ANON_INODES select HAVE_ARCH_KMEMCHECK diff --git a/arch/x86/include/asm/user_bkpt.h b/arch/x86/include/asm/user_bkpt.h new file mode 100644 index 0000000..df8a4a0 --- /dev/null +++ b/arch/x86/include/asm/user_bkpt.h @@ -0,0 +1,43 @@ +#ifndef _ASM_USER_BKPT_H +#define _ASM_USER_BKPT_H +/* + * User-space BreakPoint support (user_bkpt) for x86 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at ...
Uprobes Implementation Changelog: - If fixup might sleep; then do the post singlestep processing in task context. The uprobes infrastructure enables a user to dynamically establish probepoints in user applications and collect information by executing a handler function when a probepoint is hit. The user specifies the virtual address and the pid of the process of interest along with the action to be performed (handler). The handle Uprobes is implemented on the user-space breakpoint assistance layer and uses the execution out of line strategy. Uprobes follows lazy slot allocation. I.e, on the first probe hit for that process, a new vma (to hold the probed instructions for execution out of line) is allocated. Once allocated, this vma remains for the life of the process, and is reused as needed for subsequent probes. A slot in the vma is allocated for a probepoint when it is first hit. A slot is marked for reuse when the probe gets unregistered and no threads are using that slot. In a multithreaded process, a probepoint once registered is active for all threads of a process. If a thread specific action for a probepoint is required then the handler should be implemented to do the same. If a breakpoint already exists at a particular address (irrespective of who inserted the breakpoint including uprobes), uprobes will refuse to register any more probes at that address. You need to follow this up with the uprobes patch for your architecture. For more information: please refer to Documentation/uprobes.txt TODO: 1. Perf/trace events interface for uprobes. 2. Allow multiple probes at a probepoint. 3. Booster probes. 4. Allow probes to be inherited across fork. 5. probing function returns. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> --- arch/Kconfig | 13 + include/linux/sched.h | 3 include/linux/tracehook.h | 18 + include/linux/uprobes.h | 187 ++++++++++ ...
Looks like, this doesn't need get/put task_struct, you could just This doesn't look right. We can't trust ->thread_group list even under rcu_read_lock(). The task can exit and __exit_signal() can remove it This is called by create_uprocess(). Who will free t->utask if t has not sure I understand this check. Somehow we should prevent the races with tracehook_report_exit/tracehook_report_exec, but PF_EXITING can't can't we race with clone(CLONE_THREAD) and miss the new thread? Probably I missed something, but afaics we need some barriers to ensure that either tracehook_report_clone() sees current->utask != NULL or find_next_thread() OK, uproc and p can't go away. But why it is safe to use pid_task(p) ? I am looking at 6th patch http://marc.info/?l=linux-kernel&m=127005086102256 and xol_validate_vaddr() calls pid_task() without rcu and doesn't check the result is not NULL. This looks a bit strange. Why do we need "ctask" at all? It is not used, you could just do if (likely(!child->utask)) add_utask(child, uproc); OK, iiuc this should restore the original instruction, right? But what about clone(CLONE_VM)? In this case this child shares ->mm with parent. Oleg. --
Oh Okay, I get that the thread could be exiting from the time we
allocated the utask to the time we are cleaning up here and hence we
could be leaking utask.
Would it be okay if we explicitly (instead of the using
tracehook_report_exit) call uprobe_free_utask() just after we set
PF_EXITING. We could take the task_lock to synchronize with the
Okay.
Would it work if we
static struct uprobe_task *add_utask(struct task_struct *t,
struct uprobe_process *uproc)
{
struct uprobe_task *utask;
int exiting = 0;
if (!t)
return NULL;
utask = kzalloc(sizeof *utask, GFP_USER);
if (unlikely(utask == NULL))
return ERR_PTR(-ENOMEM);
task_lock(t);
if (unlikely(t->flags & PF_EXITING)) {
task_unlock(t);
kfree(utask);
return;
}
t->utask = utask;
task_unlock(t);
utask->uproc = uproc;
utask->active_ppt = NULL;
atomic_inc(&uproc->refcount);
}
NORET_TYPE void do_exit(long code)
{
...
exit_irq_thread();
exit_signals(tsk); /* sets PF_EXITING */
task_lock(tsk)
if (tsk->utask);
uprobe_free_utask();
task_unlock(tsk);
...
..
}
yeah, PF_EXITING doesnt seem to help us here. But will this be an issue once we
The tracehook_report_clone is called after the element gets added to the
thread_group list in copy_process().
Looking at three cases where current thread could be cloning a new thread.
a) current thread has a utask and tracehook_report_clone is not yet
called.
utask for the new thread will be created by either
tracehook_report_clone or the find_next_thread whichever comes first.
b) current thread has no utask and tracehook_report_clone is already called.
new thread is in the thread_group list; so a ...Srikar, sorry for delay...
Not sure I understand....
I meant, it is not safe to use next_thread(tsk) if tsk was already
Yes, but my point was, we probably need mb's on both sides. Of course,
this is only theoretical problem, but tracehook_report_clone() can read
current->utask == NULL before the result of copy_process()->list_add_tail()
Oh, I don't know. You are going to change this code anyway, I can't see
in advance.
I tried to read the next 8/11 patch, and I have a couple more random questions.
- uprobe_process->tg_leader is not really used ?
- looks like, 7/11 can't be compiled without the next 8/11 ?
say, the next patch defines arch_uprobe_disable_sstep() but
it is used by 7/11
- I don't understand why do we need uprobe_{en,dis}able_interrupts
helpers. pre_ssout() could just do local_irq_enable(). This path
leads to get_signal_to_deliver() which enables irqs anyway, it is
always safe to do this earlier and I don't think we need to disable
irqs again later. In any case, I don't understand why these helpers
use native_irq_xxx().
- pre_ssout() does .xol_vaddr = xol_get_insn_slot(). This looks a
bit confusing, xol_get_insn_slot() should set .xol_vaddr correctly
under lock.
- pre_ssout() does user_bkpt_set_ip() after user_bkpt_pre_sstep().
Why? Shouldn't user_bkpt_pre_sstep() always set regs->ip ?
Otherwise uprobe_bkpt_notifier()->user_bkpt_pre_sstep() is not
right.
- I don't really understand why ->handler_in_interrupt is really
useful, but never mind.
- However, handler_in_interrupt && !uses_xol_strategy() doesn't
look right. uprobe_bkpt_notifier() is called with irqs disabled,
Yes, I think CLONE_VM without CLONE_THREAD needs utask too, but do we need
the new uproc? OK, please forget about this for the moment.
Suppose that register_uprobe() succeeds and does set_bkpt(). What if another
process (not sub-thread) with the same ->mm hits this bp? uprobe_bkpt_notifier()
will see ->utask == NULL ...Okay, cleanup_process() gets called only and only if add_utask() fails
to allocated utask struct. Based on your inputs I will synchronize
exit_signals() and uprobe_free_utask(). However it still can happen that
uprobe calls cleanup_uprocess() with reference to task struct which has just
called __unhash_process(). Is there a way out of this?
Can you please let me know when nsproxy is set to NULL? If we are sure
that register/unregister will be called with nsproxy set, then I am
Currently we have a reference to pid struct from the time we created a
uprobe_process to the time we free the uprobe process. So are you
suggesting that we dont have a reference to the pid structure or is that
we dont need to cache the pid struct and access it thro
On i686, (unlike x86_64), do_notify_resume() gets called with irqs
disabled. I had tried local_irq_enable couple of times but that didnt
help probably because CONFIG_PARAVIRT is set in my .config and hence
raw_local_irq_enable resolves to
static inline void raw_local_irq_enable(void)
{
PVOP_VCALLEE0(pv_irq_ops.irq_enable);
}
What we need is the "sti" instruction. It looks like local_irq_enable
actually doesnt do "sti". So I had to go back to using
native_irq_enable().
Do you have any ideas how to force local_irq_enable to do a "sti."
Or Am I missing something?
Since I wasnt sure why do_notify_resume() was called under irqs_disabled
only for x86. I disabled irqs again just to be sure that I am not
Right, user_bkpt_set_ip is redudant as user_bkpt_pre_sstep sets
Uprobes can run handlers either in interrupt context or in task context.
If the user is sure that his handler is not going to sleep, then he can
set handler_in_interrupt flag while registering the probe.
There is a small overhead when running the handlers in task context.
Here is a brief benchmark on a x86_64 machine.
========================================================================
Results when running a kernel without lockdep and other ...Yes, but afaics we have the same issues in find_next_thread() called In this particular case, probably we can rely on uprobe_mutex. Currently cleanup_uprocess() is called with start == cur_t. Instead, we should use the last task on which add_utask() succeeded, it can't exit (assuming we fix other discussed races with exit) because uprobe_free_utask() takes this mutex too. However, perhaps it is better to rework this all. Say, can't we move uprobe_free_utask() into __put_task_struct() ? Afaics, this can greatly simplify things. If we add mm_struct->uproc, then utask doesn't need I must have missed something. But I do not see where do we use uprobe_process->tg_leader. We never read it, apart from Hmm. No, I can't explain this, I know nothing about paravirt. But this doesn't look right to me. Probably this should be discussed with paravirt pre_ssout() does if (!user_bkpt.xol_vaddr) user_bkpt.xol_vaddr = xol_get_insn_slot(); but it could just do if (!user_bkpt.xol_vaddr) xol_get_insn_slot(); because xol_get_insn_slot() populates user_bkpt.xol_vaddr. Btw. Why do we have the !CONFIG_USER_BKPT_XOL code in include/linux/user_bkpt_xol.h? CONFIG_UPROBES depends on CONFIG_USER_BKPT_XOL. Also the declarations don't look nice... Probably I missed something, but why the code uses "void *" instead of "user_bkpt_xol_area *" for xol_area everywhere? OK, even if "void *" makes sense for uproc->uprobe_process, why this overhead looks very minor. To me, it is better to simplify the code, at least in the first version. That said, this is up to you, I am not asking you to remove this Yes, I was thinking about mm->struct->uproc too. Well, we could add the list of uprobe_task's into uprobe_process, it represents the tasks "inside" the probe hit. But yes, this is not easy, Agreed. Although we need the new TIF_ bit for tracehook_notify_resume(), it can't trust "if (current->utask...)" checks. Alternatively, without the "on demand" ...
Okay. I will use mm_struct->uproc, dynamic allocation of utask on probe
static int free_uprocess(struct uprobe_process *uproc)
{
....
put_pid(uproc->tg_leader);
uproc->tg_leader = NULL;
user_bkpt_xol_area isn't exposed. This provides flexibility in changing
the algorithm for more efficient slot allocation. Currently we allocate
slots from just one page. Later on we could end-up having to allocate
from more than contiguous pages. There was some discussion about
allocating slots from TLS. So there is more than one reason that
user_bkpt_xol can change. We could expose the struct and not access the
But do we need a new TIF bit? Can we just reuse the TIF_NOTIFY_RESUME
Okay I will try the on demand allocations in the next iteration.
Thanks again for your detailed explainations and suggestions.
--
Thanks and Regards
Srikar
--
Yes, yes, I see it does get/put pid. But where do we actually use
^^^^^^^^^^^^^^^^^^^^^
Still can't understand... Yes, we shouldn't expose the details, but we
can just add "struct user_bkpt_xol_area;" into include file.
Probably not... But somehow tracehook_notify_resume/uprobe_notify_resume
should know we hit the bp and we need to allocate utask. Yes,
tracehook_notify_resume() can always call uprobe_notify_resume()
unconditionally, and uprobe_notify_resume() can notice the
"find_probept() && !current->utask" case, but probably it is better to
make this more explicit. And of course, the new bit should be set along
with TIF_NOTIFY_RESUME.
Or. Instead of TIF_ bit, we can use something like
#define UTASK_PLEASE_ALLOCATE_ME ((struct uprobe_task *)1)
uprobe_bkpt_notifier() sets current->utask = UTASK_PLEASE_ALLOCATE_ME,
then tracehook_notify_resume/uprobe_notify_resume check this case.
I dunno, please do what you think right.
OK, the last questions:
1. Can't multiple write_opcode()'s race with each other?
Say, pre_ssout() calls remove_bkpt() lockless. can't it race
with register_uprobe() which may write to the same page?
And, without uses_xol_strategy() there are more racy callers
of write_opcode()... Probably something else.
2. Can't write_opcode() conflict with ksm doing replace_page() ?
3. mprotect(). write_opcode() checks !VM_WRITE. This is correct,
otherwise we can race with the user-space writing to the same
page.
But suppose that the application does mprotect(PROT_WRITE) after
register_uprobe() installs the bp, now unregister_uprobe/etc can't
restore the original insn?
4. mremap(). What if the application does mremap() and moves the
memory? After that vaddr of user_bkpt/uprobe no longer matches
the virtual address of bp. This breaks uprobe_bkpt_notifier(),
unregister_uprobe(), etc.
Even worse. Say, unregister_uprobe() calls remove_bkpt().
mremap()+mmap() ...uproc->tg_leader was used to validate looked up uproc belongs to the process. It was used to check if the uproc belonged to the process for which we are currently trying to register/unregister uprobes. Since we want to share the uproc with process that share the same mm, I Okay, I will add the forward declaration in the include file and update Yeah, since tracehook_notify_resume() is in fast path, its worth adding All callers of write_opcodes should have taken uproc->mutex. If there are other users of write_opcode, we will have to add a way to That's a bug, I will fix it. remove_bkpt() clearly says it needs to be I dont think so. If uprobes runs on hosts, it would be calling replace_page() on text pages. KSM for now works on anonymous pages. Even the replaced page we add still belongs to the text VMA. If uprobes runs on guest, KSM should be taking care of cases where I still need to verify this. I shall get back to you on this. I dont think we handle this case now. I think even munmap of the region where there are probes inserted also can have the same problem. Are there ways to handle this. I think taking a write lock on mmap_sem instead of the read lock could handle this problem. I am copying Mel Gorman and Andrea Arcangeli so that they can provide their inputs on VM and KSM related issues. -- Thanks and Regards --
Well, I think the kernel should assume that the user-space can do
anything.
Hmm. And if this vma is VM_SHARED, then this bp could be actually
written to vm_file after mprotect().
But I think this doesn't really matter. When I actually look at
Yes. We need vm experts here, I am not. Still, I'd like to share my
concerns. I also added Rik and Hugh.
So, 3/11 does
@@ -2617,7 +2617,10 @@ int replace_page(struct vm_area_struct *vma, struct page *page,
}
get_page(kpage);
- page_add_anon_rmap(kpage, vma, addr);
+ if (PageAnon(kpage))
+ page_add_anon_rmap(kpage, vma, addr);
+ else
+ page_add_file_rmap(kpage);
flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
I see no point in this patch, please see below.
The next 4/11 patch introduces write_opcode() which roughly does:
int write_opcode(unsigned long vaddr, user_bkpt_opcode_t opcode)
{
get_user_pages(write => false, &old_page);
new_page = alloc_page_vma(...);
... insert the bp into the new_page ...
new_page->mapping = old_page->mapping;
new_page->index = old_page->index;
replace_page(old_page, new_page);
}
This doesn't look right at all to me.
IF PageAnon(old_page):
in this case replace_page() calls page_add_anon_rmap() which
needs the locked page.
ELSE:
I don't think the new page should evere preserve the mapping,
this looks just wrong. It should be always anonymous.
And in fact, I do not understand why write_opcode() needs replace_page().
It could just use get_user_pages(FOLL_WRITE | FOLL_FORCE), no? It should
create the anonymous page correctly.
Either way, I think register_uprobe() should disallow the probes in
VM_SHARED/VM_MAYWRITE vmas.
Oleg.
--
When I look through the load_.*_binary and load_.*_library functions, they seem to map the text regions MAP_PRIVATE|MAP_DENY_WRITE. (Few exceptions like load_som_binary that seem to map text regions with MAP_PRIVATE only). Also if vma are marked VM_SHARED and bp are inserted through ptrace, i.e(access_process_vm/get_user_pages), then we would still be writing to vm_file after mprotect? I did verify that page_add_file_rmap gets called from replace_page when we insert or remove a probe. This should be because uprobes doesnt do a anon_vma_prepare() before the alloc_page_vma(). We were earlier doing access_process_vm that would inturn call get_user_pages to COW the page. However that needed that the threads of the target process be stopped. In the access_process_vm method, 1. we get a copy of page, 2. flush the tlbs. 3. modify the page. The concern was if the threads were executing in the vicinity. Hence we were stopping all threads while inserting/deleting breakpoints. Background page replacement was suggested by Linus and Peter. In this method. 1. we get a copy of the page. 2. modify the page 3. flush the tlbs. This method is suppose to be atomic enuf that we dont need to stop the Yes, we certainly could add that check. -- Thanks and Regards Srikar --
Again, I didn't mean they should. But they can. Not only VM_SHARED, the application can create the anonymous PROT_EXEC region, Of course! but see above, PageAnon() case is possible too. I think the code should handle this case correctly anyway, but it seems it doesn't. Not only page_add_anon_rmap() needs the locked page, I am not not sure page_add_anon_rmap() is fine for write_opcode() which allocates the new page. LRU? SetPageSwapBacked? And you seem to miss my point. I think page_add_file_rmap() is always wrong. I mean, no matter what is the page_mapping(old_page), the new page should be OK. I must admit, I don't understand the usage of the lockless get_pte() in write_opcode(). replace_page() checks orig_pte, yes. But how this check can help write_opcode and why it is needed? I do not think it can prevent any race, pte can be changed even before write_opcode() calls get_pte(). I guess this is only done because replace_page() requires this argument? Oleg. --
Well, they typically are not, I could imagine some JITs maybe using it, but those would most probably be shared anonymous. --
I don't think we should allow breakpoints on VM_SHARED maps. --
VM_SHARED, fully agreed, MAYWRITE not so sure, MAP_PRIVATE has MAYWRITE iirc and its perfectly fine to poke at those. --
Okay, I will put a check to disallow probes in VM_SHARED vmas. -- Thanks and Regards Srikar --
Yes, sorry for confusion. Not sure where this VM_MAYWRITE came from. But I still think this doesn't actually matter, replace_page() shouldn't preserve the mapping, it should always install the anonymous page. I can be wrong, of course. (I didn't read the next version yet) Oleg. --
Well, if I reasd the patches right, uprobes will use "copy_to_user()" for the self-probing case. So that would definitely just modify a shared mapping. Of course, arguably, who really cares? As long as it's not a security issue (and it isn't - since the person could just have written to the thing directly instead), I guess it doesn't much matter. But it's a bit sad when a probing feature either - changes a global mapping that may be executed by other non-related processes that the prober isn't even _aware_ of. - changes semantics by creating a non-coherent private page so arguably it would be good to just make the rule be that you cannot probe a shared mapping. Because whatever you do, it's always the wrong thing. Linus --
But isn't text usually shared? I don't see how you could set any break points or jump probes on text pages with that restriction. BTW there were old patches for NUMA text duplication, maybe they could be resurrected for that too. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Text is usually private, and read-only. Not generally MAP_SHARED. The pages end up getting shared because nobody writes to them, but that's almost accidental. If you write to them, you get a nice clean COW fault, and you are _supposed_ to get a nice clean COW fault. It's not changing any semantics: the write is not visible to outside users, and those "get a private page" semantics were what the mmap() was all about. In contrast, if it's a MAP_SHARED mapping and writable, the write would actually be _visible_ outside the process. And that's clearly totally wrong on all levels. Tracing a process should _never_ cause visible damage outside that process (you'd hope it wouldn't be all that visibel to the tracee either, but that's still secondary). The alternative, ie a MAP_SHARED but read-only mapping (which looks very much like a private mapping) if you use get_user_pages(.force=1), the kernel will actually end up forcing a COW break, because making the write visible outside would be a security issue (you don't even have the right to write to the thing). Notice how the MAP_SHARED case - writable or not - ends up doing the wrong thing. Arguably it does the _even_worse_ thing in the writable case, but in either case it's not good. Linus --
Agreed. Or, if you do, it's doing something entirely different and should be in an interface where you're explicitly attaching it generically to the file (what's being shared) without regard to any individual process. But, as you mentioned, shared, executable mappings are well outside the normal case and there is no reason to think that a first (or fourth) version of anything needs to support them at all. Thanks, Roland --
Uprobes uses copy_to_user() to write data/stack and never to write to instruction addresses. To write an instruction uprobes either used access_process_vm or the replace_page() based background page replacement method. This is true even if the process was probing itself. Soon to be posted v4 will revert to background page replacement method Yes, I will be adding a check to discard probing if the vma has VM_SHARED flag set. I have already committed to Oleg on this issue. I didnt include this check in v3 patchset, because uprobes was using access_process_vm in v3 patchset and I thought access_process_vm would do the right thing even if VM_SHARED is set. -- Thanks and Regards Srikar --
KSM only does anonymous pages, and I thought uprobes was limited to MAP_PRIVATE|PROT_EXEC file maps. We can't hold mmap_sem (for either read or write -- read would be sufficient to serialize against mmap/mremap/munmap) from atomic uprobe context, what we can do is validate that there is a INT3 on that particular address, a mremap/munmap/munmap+mmap will either end not having a pte entry for the address, or not have the INT3. That said, you shouldn't be executing code on maps you're changing, much fun can happen if you try, so I don't think we should expend too much effort as long as the race will only result in the app crashing and not the kernel. --
Did you mean "We can hold mmap_sem?" Else I am not sure if we can traverse the vma. Infact alloc_page_vma() needs mmap_sem to be acquired. Okay. -- Thanks and Regards Srikar --
OK, so maybe I misunderstood, this is from the INT3 trap handler, right? We can _not_ take a sleeping lock from trap context. Why would you want the vma anyway? --
If I am right, the initial comment was both from the unregister_uprobe() -> write_opcode() context and uprobe_bkpt_notifier context. [ snipping relevant part of Oleg's mail from where the conversation started ] --------------------------------------------------------------------- But yes, if the mmap/mremap/munmap can happen between validating the int3 and removal of the breakpoint in the unregister_uprobe path, then it can as well happen between the breakpoint hit and the time uprobes does the fixups to continue execution after running the handler and single-stepping. I agree with you that we shouldnt bother about mmap/mremap/munmap of the Yes, we dont look at the vma in trap context at all. If we need to allocate a slot in the xol_vma then we set the TIF_UPROBE do the stuff in task context. -- Thanks and Regards --
Right so what I've suggested several times it to simply call the same handler in both contexts. If it returns -EFAULT, set TIF_UPROBE or whatever and try again from task context. --
Hi - That could work, but random partial execution & restart of the handler will make it tricky to write a single handler that reliably produces results. It would likely need a flag to indicate that it failed previously so as to throw away partial results. - FChE --
Or it shouldn't leave half-assed state around to begin with. --
X86 support for Uprobes This patch provides x86 specific details for uprobes. This includes interrupt notifier for uprobes, enabling/disabling singlestep. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> --- arch/x86/Kconfig | 1 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/uprobes.c | 87 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 89 insertions(+), 0 deletions(-) create mode 100644 arch/x86/kernel/uprobes.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 851cedc..a860a9b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -54,6 +54,7 @@ config X86 select HAVE_KERNEL_LZO select HAVE_HW_BREAKPOINT select HAVE_USER_BKPT + select HAVE_UPROBES select PERF_EVENTS select ANON_INODES select HAVE_ARCH_KMEMCHECK diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 98c74b4..bfa48f0 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -118,6 +118,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o obj-$(CONFIG_USER_BKPT) += user_bkpt.o +obj-$(CONFIG_UPROBES) += uprobes.o ### # 64 bit specific files diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c new file mode 100644 index 0000000..1acce22 --- /dev/null +++ b/arch/x86/kernel/uprobes.c @@ -0,0 +1,87 @@ +/* + * Userspace Probes (UProbes) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You ...
Uprobes documentation. Changelog: Addressed comments from Randy Dunlap. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Documentation/uprobes.txt | 244 +++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 244 insertions(+), 0 deletions(-) create mode 100644 Documentation/uprobes.txt diff --git a/Documentation/uprobes.txt b/Documentation/uprobes.txt new file mode 100644 index 0000000..d68dcdb --- /dev/null +++ b/Documentation/uprobes.txt @@ -0,0 +1,244 @@ +Title : User-Space Probes (Uprobes) +Authors : Jim Keniston <jkenisto@us.ibm.com> + : Srikar Dronamraju <srikar@linux.vnet.ibm.com> + +CONTENTS + +1. Concepts: Uprobes +2. Architectures Supported +3. Configuring Uprobes +4. API Reference +5. Uprobes Features and Limitations +6. Probe Overhead +7. TODO +8. Uprobes Team +9. Uprobes Example + +1. Concepts: Uprobes + +Uprobes enables you to dynamically break into any routine in a +user application and collect debugging and performance information +non-disruptively. You can trap at any code address, specifying a +kernel handler routine to be invoked when the breakpoint is hit. + +A uprobe can be inserted on any instruction in the application's +virtual address space. The registration function register_uprobe() +specifies which process is to be probed, where the probe is to be +inserted, and what handler is to be called when the probe is hit. + +Uprobes-based instrumentation can be packaged as a kernel +module. In the simplest case, the module's init function installs +("registers") one or more probes, and the exit function unregisters +them. + +1.1 How Does a Uprobe Work? + +When a uprobe is registered, Uprobes makes a copy of the probed +instruction, stops the probed application, replaces the first byte(s) +of the probed instruction with a breakpoint instruction (e.g., int3 +on i386 and x86_64), and allows the probed application to continue. +(When inserting the ...
Uprobes Samples
This provides an example uprobes module in the samples directory.
To run this module run (as root)
insmod uprobe_example.ko vaddr=<vaddr> pid=<pid>
Where <vaddr> is the address where we want to place the probe.
<pid> is the pid of the process we are interested to probe.
example: -
# cd samples/uprobes
[get the virtual address to place the probe.]
# vaddr=0x$(objdump -T /bin/bash |awk '/echo_builtin/ {print $1}')
[Run a bash shell in the background; have it echo 4 lines.]
# (sleep 10; echo 1; echo 2; echo 3; echo 4) &
[Probe calls echo_builtin() in the background bash process.]
# insmod uprobe_example.ko vaddr=$vaddr pid=$!
# sleep 10
# rmmod uprobe_example
# dmesg | tail -n 3
Registering uprobe on pid 10875, vaddr 0x45aa30
Unregistering uprobe on pid 10875, vaddr 0x45aa30
Probepoint was hit 4 times
#
[ Output shows that echo_builtin function was hit 4 times. ]
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
samples/Kconfig | 7 +++
samples/uprobes/Makefile | 17 ++++++++
samples/uprobes/uprobe_example.c | 83 ++++++++++++++++++++++++++++++++++++++
3 files changed, 107 insertions(+), 0 deletions(-)
create mode 100644 samples/uprobes/Makefile
create mode 100644 samples/uprobes/uprobe_example.c
diff --git a/samples/Kconfig b/samples/Kconfig
index 8924f72..50b8b1c 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -44,4 +44,11 @@ config SAMPLE_HW_BREAKPOINT
help
This builds kernel hardware breakpoint example modules.
+config SAMPLE_UPROBES
+ tristate "Build uprobes example -- loadable module only"
+ depends on UPROBES && m
+ help
+ This builds uprobes example module.
+
+
endif # SAMPLES
diff --git a/samples/uprobes/Makefile b/samples/uprobes/Makefile
new file mode 100644
index 0000000..f535f6f
--- /dev/null
+++ b/samples/uprobes/Makefile
@@ -0,0 +1,17 @@
+# builds the uprobes example kernel modules;
+# then to use one (as root):
+# insmod ...Uprobes Trace_events interface
The following patch implements trace_event support for uprobes. In its
current form it can be used to put probes at a specified text address
in a process and dump the required registers when the code flow reaches
the probed address.
This is based on trace_events for kprobes to the extent that it may
resemble that file on 2.6.34-rc3.
The following example shows how to dump the instruction pointer and %ax a
register at the probed text address.
Start a process to trace. Get the address to trace.
[Here pid is asssumed as 3548]
[Address to trace is 0x0000000000446420]
[Registers to be dumped are %ip and %ax]
# cd /sys/kernel/debug/tracing/
# echo 'p 3548:0x0000000000446420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_3548_0x0000000000446420 3548:0x0000000000446420 %ip=%ip %ax=%ax
# cat events/uprobes/p_3548_0x0000000000446420/enable
0
[enable the event]
# echo 1 > events/uprobes/p_3548_0x0000000000446420/enable
# cat events/uprobes/p_3548_0x0000000000446420/enable
1
# #### do some activity on the program so that it hits the breakpoint
# cat uprobe_profile
3548 p_3548_0x0000000000446420 234
# head trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-3548 [001] 294.285812: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
zsh-3548 [001] 294.285884: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
zsh-3548 [001] 294.285894: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
zsh-3548 [001] 294.285903: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
zsh-3548 [001] 294.285912: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
zsh-3548 [001] 294.285922: p_3548_0x0000000000446420: (0x446420) %ip=446421 %ax=1
TODO: Documentation/trace/uprobetrace.txt
Signed-off-by: Srikar Dronamraju ...Note, you want to really add this to trace_entries.h instead: FTRACE_ENTRY(uprobe, uprobe_trace_entry, TRACE_GRAPH_ENT, F_STRUCT( __field( unsigned long, ip ) __field( int, nargs ) __dynamic_array(unsigned long, args ) ), F_printk("%lx nrargs:%u", __entry->ip, __entry->nargs) ); This will put this event into the events/ftrace directory. Don't worry about the printk format, we can write a plugin for it to override it if need be. By adding the above, other tools can know what it encountered instead of If you added the event to trace_entries.h then this should be done Or is it because of this special logic that you could not use the trace_entries.h? --
Hi Steven, Hmm, interesting idea. But this dynamic event definition allows us to filter events based on each argument value. each argument can have unique name. Therefore user can write a filter by using these names. Moreover, dynamic events (at least kprobe-tracer) are going to support 'types' for each argument. this means that the arg[] in *probe_trace_entry will be no longer an unsigned long array. Thank you, -- Masami Hiramatsu e-mail: mhiramat@redhat.com --
Yeah, I don't think we should FTRACE_ENTRY for that. The format files for [k|u]probes events are created dynamically on top of what the user requested, which is a very nice feature. --
That doesn't explain much what it does. Please explain its goal of This can be shared with kprobes in a new kernel/trace/dyn_probes.h Thanks. --
Agree, the unregister_trace_uprobe() has to be called after locking -- Thanks and Regards Srikar --
You can use the non-nowake version I think. nowake is for events that might occur when we hold the rq lock, hence when it's too dangerous to wake up. It doesn't seem to be the case since we came here after a trap in userspace. --
Slot allocation for Execution out of line strategy(XOL) This patch provides slot allocation mechanism for execution out of line strategy for use with user space breakpoint infrastructure. Traditional method of replacing the original instructions on breakpoint hit are racy when used on multithreaded applications. Alternatives for the traditional method include: - Emulating the breakpointed instruction. - Execution out of line. Emulating the instruction: This approach would use a in-kernel instruction emulator to emulate the breakpointed instruction. This approach could be looked in at a later point of time. Execution out of line: In execution out of line strategy, a new vma is injected into the target process, a copy of the instructions which are breakpointed is stored in one of the slots. On breakpoint hit, the copy of the instruction is single-stepped leaving the breakpoint instruction as is. This method is architecture independent. This method is useful while handling multithreaded processes. This patch allocates one page per process for slots to be used to copy the breakpointed instructions. Current slot allocation mechanism: 1. Allocate one dedicated slot per user breakpoint. Each slot is big enuf to accomodate the biggest instruction for that architecture. (16 bytes for x86). 2. We currently allocate only one page for slots. Hence the number of slots is limited to active breakpoint hits on that process. 3. Bitmap to track used slots. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- arch/Kconfig | 4 + include/linux/user_bkpt_xol.h | 61 +++++++++ kernel/Makefile | 1 kernel/user_bkpt_xol.c | 289 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 355 insertions(+), 0 deletions(-) create mode 100644 include/linux/user_bkpt_xol.h create mode 100644 kernel/user_bkpt_xol.c diff --git a/arch/Kconfig b/arch/Kconfig index ...
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
