[PATCH 5/9] fbsd-nat: Defer any ineligible events reported by wait.

John Baldwin jhb@FreeBSD.org
Mon Mar 27 20:49:45 GMT 2023


On 3/20/23 12:09 PM, Simon Marchi wrote:
> On 2/28/23 13:18, John Baldwin wrote:
>> If wait_1 finds an event for a thread or process that does not match
>> the set of threads and processes previously resumed, defer the event.
>> If the event is for a specific thread, suspend the thread and continue
>> the associated process before waiting for another event.
> 
> I do not understand this, because it doesn't match my mental model of
> how things should work (which is derived from working with linux-nat).
> In that model, a thread with a pending event should never be
> ptrace-resumed.  So the last sentence doesn't make sense, it implies
> that a thread has a pending and is ptrace-resumed (since we suspend it).

The difference here might be that on FreeBSD ptrace can only PT_CONTINUE
and wait() for entire processes.  When a thread contains multiple threads,
PT_CONTINUE resumes all threads belonging to that process.  Individual
threads can be controlled by using PT_SUSPEND on a specific LWP.  Those
LWPs will stay asleep after PT_CONTINUE.  If any thread encounters a
stop event (breakpoint, fork event, etc.) the entire process stops (the
kernel signals threads in userland if needed to get them back into the
kernel to stop).  Once it is stopped it reports an event via wait().

Currently on FreeBSD wait() only reports a single event at once.  So
if multiple threads hit a breakpoint, wait() returns for the first one,
PT_LWPINFO is used on the pid to find out what specific event happened
(and which LWP it happened for).  The other thread will keep its event
pending in the kernel, and after PT_CONTINUE it will immediately trigger
another stop and an event reported from wait().

>> One specific example of such an event is if a thread is created while
>> another thread in the same process hits a breakpoint.  If the second
>> thread's event is reported first, the target resume method does not
>> yet "know" about the new thread and will not suspend it via
>> PT_SUSPEND.
> 
> I was surprised to read that the resume method suspends some threads.  I
> don't think it should, but I do see the current code.  The current code
> appears to interpret the resume request as: "ensure the state of thread
> resumption looks exactly like this", so it suspends threads as needed,
> to match the requested resumption state.  I don't think it's supposed to
> work like that.  The resume method should only resume some threads, and
> if the core of GDB wants some threads stopped, it will call the stop
> method.

See above.  I can't individually suspend/resume threads.  I can only
resume processes, but decide which set of threads within a process
to keep resumed when resuming the entire process.

> Anyhow, just to be sure I understand the problem you describe: when
> fbsd-nat reports the breakpoint hit from "another thread", is the
> "thread created" event still in the kernel, not consumed by GDB?

Yes.  What happens is that GDB tries to single-step a thread over a
breakpoint, so it does a resume() of a single ptid_t.  To implement
this, fbsd-nat's resume has to explicitly PT_SUSPEND all the other
LWPs in the process before doing a PT_CONTINUE so that only the
stepped thread will execute.  However, this logic doesn't yet know
about the new thread (and it hasn't started executing in userland
yet).  However, once the new thread starts executing, it immediately
reports a ptrace event.  This confuses GDB currently becuase that
ptrace event can be reported right after the PT_CONTINUE.  I think
in part this is because unlike on Linux, when thread A creates thread
B, thread A doesn't report anything.  Instead, only thread B reports
an event when it starts.  On Linux thread creation is more like fork
where thread A reports an event saying it created B.
  
>> When wait is called, it will probably return the event
>> from the first thread before the result of the step from second
>> thread.  This is the case reported in PR 21497.
> 
>> ---
>>   gdb/fbsd-nat.c | 34 +++++++++++++++++++++++++++++++++-
>>   1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
>> index ca278b871ef..3f7278c6ea0 100644
>> --- a/gdb/fbsd-nat.c
>> +++ b/gdb/fbsd-nat.c
>> @@ -1514,7 +1514,39 @@ fbsd_nat_target::wait (ptid_t ptid, struct target_waitstatus *ourstatus,
>>     if (is_async_p ())
>>       async_file_flush ();
>>   
>> -  wptid = wait_1 (ptid, ourstatus, target_options);
>> +  while (1)
>> +    {
>> +      wptid = wait_1 (ptid, ourstatus, target_options);
>> +
>> +      /* If no event was found, just return.  */
>> +      if (ourstatus->kind () == TARGET_WAITKIND_IGNORE
>> +	  || ourstatus->kind () == TARGET_WAITKIND_NO_RESUMED)
>> +	break;
>> +
>> +      /* If an event is reported for a thread or process while
>> +	 stepping some other thread, suspend the thread reporting the
>> +	 event and defer the event until it can be reported to the
>> +	 core.  */
>> +      if (!wptid.matches (resume_ptid))
>> +	{
>> +	  add_pending_event (wptid, *ourstatus);
>> +	  fbsd_nat_debug_printf ("deferring event [%s], [%s]",
>> +				 target_pid_to_str (wptid).c_str (),
>> +				 ourstatus->to_string ().c_str ());
>> +	  if (wptid.pid () == resume_ptid.pid ())
>> +	    {
>> +	      fbsd_nat_debug_printf ("suspending thread [%s]",
>> +				     target_pid_to_str (wptid).c_str ());
>> +	      if (ptrace (PT_SUSPEND, wptid.lwp (), NULL, 0) == -1)
>> +		perror_with_name (("ptrace (PT_SUSPEND)"));
> 
> 
> This is the part I don't understand.  After wait_1 returned an event for
> a specific thread, I would expect this thread to be ptrace-stopped.  So,
> I don't see the need to suspend it.  You'd just leave it in the "resumed
> from the
> until.
> 
>> +	      if (ptrace (PT_CONTINUE, wptid.pid (), (caddr_t) 1, 0) == -1)
>> +		perror_with_name (("ptrace (PT_CONTINUE)"));
> 
> I don't get why you continue `wptid.pid ()`.  What assures you that this
> particular thread was stopped previously?  Perhaps I don't understand
> because I assume somethings about pids/lwps and ptrace that are only
> true on Linux.

In this specific case, here is the sequence of events:

- process P1 contains two threads, T1, and T2
- T1 creates a new thread T3, but T3 isn't yet scheduled to execute
- T2 hits a breakpoint so the process stops
- wait() returns P1
- PT_LWPINFO reports the breakpoint event for T2
- GDB undoes the breakpoint and tries to single-step T2 over the
   breakpoint
- GDB calls resume with step=1 and scope_ptid=ptid(P1, T2)
- fbsd_nat::resume iterates over the threads it knows about, but
   it only knows about T1 and T2
   - PT_SUSPEND T1
   - PT_RESUME T2
- the P1 process is now resumed via PT_CONTINUE
- T3 starts executing and reports its "born" event
- wait() returns P1
- PT_LWPINFO reports the thread birth event for T3

Previously at this point fbsd_nat::wait() returned an event for T3
which triggers the assertion failure in the PR.  With this patch
what happens now is:

- fbsd_nat::wait() realizes T3 isn't "resumed" from the core's
   perspective so explicitly suspends it via PT_SUSPEND
- the P1 process is continued via PT_CONTINUE
- T2 completes its step and reports the event
- wait() returns P1
- PT_LWPINFO reports the step-complete event for P2
- the event for T2 is returned from fbsd_nat::wait() to the core

-- 
John Baldwin



More information about the Gdb-patches mailing list