[Ltrace-devel] Tracing multi-threaded processes

Tue Jun 21 16:49:32 UTC 2011

On Tue, Jun 21, 2011 at 12:10 PM,  <pmachata at redhat.com> wrote:
> Hi there,
>
> there is some support for multi-threaded processes in ltrace, but so far
> it was incomplete.  Everything works if the threads stay away of each
> other, but as soon as they end up in the same area of code, it all
> breaks.
>
> The problem is due to return breakpoints.  When two threads take the
> same function call, ltrace places two breakpoints over each other,
> because it has no concept of shared address space.  There are many
> problems with this, and ltrace ends up seeing unexpected breakpoints,
> and SIGSEGVing the process.
>
> The way to solve this, ltrace must first learn that there is any such
> thing as task and thread group.  Then it needs to store all the
> breakpoints in the structure shared by all the tasks in the thread
> group.  To prevent races, before any breakpoint is temporarily disabled
> (for re-enablement, namely continue_after_breakpoint), all tasks in the
> thread group must be stopped.
>
> There is a code on the branch pmachata/threads that implements this.
> Here's what the branch roughly does:
>
>  - Process * leader; was added to struct Process.  This points to a
>   process that is a thread group leader of a thread group that this
>   process is a member of.
>
>  - proper interfaces were added for handling the set of processes and
>   their tasks (add_process, remove_process, each_process, each_task).
>   The iteration interfaces (each_*) use call-backs to do the real work.
>
>  - interfaces were added for accessing the information about the
>   processes (process_leader, process_tasks, process_stopped,
>   process_status).
>
>  - a new interface task_kill is a wrapper for the SYS_tkill system call
>   that is not wrapped by glibc.  We use this to stop or continue a
>   single task.
>
>  - when we need to stop tasks for breakpoint re-enablement, we send
>   SIGSTOP.  This SIGSTOP has to be caught and sunk.  While we wait for
>   the signal to be delivered, we pump all incoming events to an event
>   queue that was created for this purpose (each_qd_event, enque_event).
>   The interface next_event takes events from the queue if there are
>   any.
>
>  - all this, the event interception, sinking of SIGSTOP etc., is very
>   platform specific.  So thread group now can have a registered event
>   handler (install_event_handler, destroy_event_handler).  If present,
>   this is called at the beginning of handle_event.  The registered
>   handler can do whatever it wishes with the event in question, and
>   return either NULL (if the event was handled or sunk) or the original
>   (possibly modified) event that is then handled by the default handler
>   as usual.
>
>  - there have also been some small cleanups.
>
> For some reason, attaching to running multi-threaded task doesn't work
> (this was one of the first things that I fixed, but apparently it got
> broken in the meantime), so that's what I'll be doing next.
>
> Then comes cleaning it all up and making the git history of my branch a
> bit less messy, at which point I'd ask some of you to review the (rather
> large) patch.  I also need to verify that it works on non-x86
> architectures, so far I was only working with x86_64.  I'll keep you
> posted as my work progresses.

Sounds great, I look forward to taking a look at the code when it is ready.