This is the mail archive of the
gdb-patches@sourceware.org
mailing list for the GDB project.
Re: Fix a couple of multiexec races
- From: Pedro Alves <pedro at codesourcery dot com>
- To: gdb-patches at sourceware dot org
- Cc: Vladimir Prus <vladimir at codesourcery dot com>
- Date: Mon, 8 Feb 2010 15:54:31 +0000
- Subject: Re: Fix a couple of multiexec races
- References: <201001131811.30403.vladimir@codesourcery.com>
On Wednesday 13 January 2010 15:11:30, Vladimir Prus wrote:
>
> The attached patch fixes a couple of glitches preventing (upcoming)
> '-exec-run --all' from working.
>
> First, consider this code inside linux_nat_wait_1:
>
> if (lp
> && ptid_is_pid (ptid)
> && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid))
> {
>
> if (debug_linux_nat)
> fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n",
> ptid_get_lwp (lp->ptid), status);
>
> if (WIFSTOPPED (lp->status))
> {
> if (WSTOPSIG (lp->status) != SIGSTOP)
> {
> stop_callback (lp, NULL);
>
> /* Resume in order to collect the sigstop. */
> ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
>
> stop_wait_callback (lp, NULL);
> }
>
> Because lp->status is naturally not NULL when calling stop_callback, and because
> stop_callback has:
>
> gdb_assert (lp->status == 0);
>
> if we ever enter the inner "if", GDB will crash. Offlist, Pedro suggested the inner if be
> just removed.
I thought about this some more, and I think this deserves more
extensive fixing. See patch (and comments in it) below, which
I've just applied. I don't like relying on the moribund breakpoints
heuristic if we can, because it _can_ fail in practice. In this
case, I think if we -exec-run --all enough inferiors, it might
happens that we have too many events handled in the new
processes (due to target_wait(PID) before this pending breakpoint
is finally pulled to infrun (target_wait(-1)), possibly too late.
In the progress on working on this patch, I noticed that I had
done something weird to the lwp->resumed flag in non-stop
mode a while ago. We certainly shouldn't be always
tagging _all_ threads as resumed just because we resumed one
thread in linux_nat_resume!
>
> Second change is to address the issue I have reported earlier. I reproduce the original
> email below.
>
>
> I have read the comment inside cancel_breakpoints_callback, but I am not sure
> I fully understand it. The situation being handled is when GDB expects an
> event in thread A, and thread B hits a breakpoint. For all-stop mode, this
> makes sense -- after all we don't support reporting more than one event, and
> when user examines event in A he might indeed delete a breakpoint. And if he
> does not deletes a breakpoint, 'continue' resumes all threads by default,
> so we'll hit breakpoint in B again.
>
> But is this reasonable for non-stop. In non-stop, we get can further stop
> events while user examines event in A. Further, 'continue' resumes only
> the current thread. So, if we don't report event in B, user does not have
> any idea it has to be resumed.
Aaaaahhh, you meant _not_ reasonable for non-stop. That changes the
whole perspective compared to what I thought you were saying last
time I read this. :-) Yes, I agree.
>
> What I observe, specifically, is when I use '-exec-run --all' (a local patch),
> one inferior is not resumed. Here's the relevant bit of 'debug lin-lwp' output:
>
> *running,thread-id="2"
> &"linux_nat_wait: [process 4659]\n"
> &"LLW: waitpid 4656 received Trace/breakpoint trap (stopped)\n"
> LWP 4656 got an event 00057f, leaving pending.
> &"LLW: waitpid 4659 received Trace/breakpoint trap (stopped)\n"
> &"LLW: Candidate event Trace/breakpoint trap (stopped) in process 4659.\n"
> &"CB: Push back breakpoint for process 4656\n"
I've confirmed, using your pending MI patch (which I'll review next),
that `-exec-run --all' in non-stop, that this problem is now gone.
>
> And there's no further mention of this pid in the log; it remains stopped. The
> attached patch appears to improve things. Pedro, what do you think?
> Note that this patch also comments out the code that tries to stop a thread if
> it got != SIGSTOP. As discussed offlist, that code fires assertion inside
> stop_callback.
Yeah, that bit was obviously busted.
--
Pedro Alves
2010-02-08 Pedro Alves <pedro@codesourcery.com>
gdb/
* linux-nat.c (linux_nat_resume): In non-stop, also only tag
resumed LWPs as resumed.
(linux_nat_wait_1): If there's no resumed LWP in the set of LWPs
we're waiting for, bail out with TARGET_WAITKIND_IGNORE, instead
of throwing an internal error. If an LWP of a process we're not
waiting for reports a signal, don't force collecting a SIGSTOP,
and if it was breakpoint hit in non-stop mode, cancel it. Don't
go through all LWPs cancelling breakpoints in non-stop mode.
(resume_stopped_resumed_lwps): New.
(linux_nat_wait): Use it.
---
gdb/linux-nat.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 105 insertions(+), 21 deletions(-)
Index: src/gdb/linux-nat.c
===================================================================
--- src.orig/gdb/linux-nat.c 2010-02-08 15:13:45.000000000 +0000
+++ src/gdb/linux-nat.c 2010-02-08 15:16:42.000000000 +0000
@@ -1912,14 +1912,8 @@ linux_nat_resume (struct target_ops *ops
resume_many = (ptid_equal (minus_one_ptid, ptid)
|| ptid_is_pid (ptid));
- if (!non_stop)
- {
- /* Mark the lwps we're resuming as resumed. */
- iterate_over_lwps (minus_one_ptid, resume_clear_callback, NULL);
- iterate_over_lwps (ptid, resume_set_callback, NULL);
- }
- else
- iterate_over_lwps (minus_one_ptid, resume_set_callback, NULL);
+ /* Mark the lwps we're resuming as resumed. */
+ iterate_over_lwps (ptid, resume_set_callback, NULL);
/* See if it's the current inferior that should be handled
specially. */
@@ -3262,8 +3256,20 @@ retry:
lp = NULL;
status = 0;
- /* Make sure there is at least one LWP that has been resumed. */
- gdb_assert (iterate_over_lwps (ptid, resumed_callback, NULL));
+ /* Make sure that of those LWPs we want to get an event from, there
+ is at least one LWP that has been resumed. If there's none, just
+ bail out. The core may just be flushing asynchronously all
+ events. */
+ if (iterate_over_lwps (ptid, resumed_callback, NULL) == NULL)
+ {
+ ourstatus->kind = TARGET_WAITKIND_IGNORE;
+
+ if (debug_linux_nat_async)
+ fprintf_unfiltered (gdb_stdlog, "LLW: exit (no resumed LWP)\n");
+
+ restore_child_signals_mask (&prev_mask);
+ return minus_one_ptid;
+ }
/* First check if there is a LWP with a wait status pending. */
if (pid == -1)
@@ -3394,6 +3400,8 @@ retry:
&& ptid_is_pid (ptid)
&& ptid_get_pid (lp->ptid) != ptid_get_pid (ptid))
{
+ gdb_assert (lp->resumed);
+
if (debug_linux_nat)
fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n",
ptid_get_lwp (lp->ptid), status);
@@ -3402,12 +3410,30 @@ retry:
{
if (WSTOPSIG (lp->status) != SIGSTOP)
{
- stop_callback (lp, NULL);
-
- /* Resume in order to collect the sigstop. */
- ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
-
- stop_wait_callback (lp, NULL);
+ /* Cancel breakpoint hits. The breakpoint may
+ be removed before we fetch events from this
+ process to report to the core. It is best
+ not to assume the moribund breakpoints
+ heuristic always handles these cases --- it
+ could be too many events go through to the
+ core before this one is handled. All-stop
+ always cancels breakpoint hits in all
+ threads. */
+ if (non_stop
+ && lp->waitstatus.kind == TARGET_WAITKIND_IGNORE
+ && WSTOPSIG (lp->status) == SIGTRAP
+ && cancel_breakpoint (lp))
+ {
+ /* Throw away the SIGTRAP. */
+ lp->status = 0;
+
+ if (debug_linux_nat)
+ fprintf (stderr,
+ "LLW: LWP %ld hit a breakpoint while waiting "
+ "for another process; cancelled it\n",
+ ptid_get_lwp (lp->ptid));
+ }
+ lp->stopped = 1;
}
else
{
@@ -3597,12 +3623,19 @@ retry:
starvation. */
if (pid == -1)
select_event_lwp (ptid, &lp, &status);
- }
- /* Now that we've selected our final event LWP, cancel any
- breakpoints in other LWPs that have hit a GDB breakpoint. See
- the comment in cancel_breakpoints_callback to find out why. */
- iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp);
+ /* Now that we've selected our final event LWP, cancel any
+ breakpoints in other LWPs that have hit a GDB breakpoint.
+ See the comment in cancel_breakpoints_callback to find out
+ why. */
+ iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp);
+
+ /* In all-stop, from the core's perspective, all LWPs are now
+ stopped until a new resume action is sent over. */
+ iterate_over_lwps (minus_one_ptid, resume_clear_callback, NULL);
+ }
+ else
+ lp->resumed = 0;
if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
{
@@ -3628,6 +3661,47 @@ retry:
return lp->ptid;
}
+/* Resume LWPs that are currently stopped without any pending status
+ to report, but are resumed from the core's perspective. */
+
+static int
+resume_stopped_resumed_lwps (struct lwp_info *lp, void *data)
+{
+ ptid_t *wait_ptid_p = data;
+
+ if (lp->stopped
+ && lp->resumed
+ && lp->status == 0
+ && lp->waitstatus.kind == TARGET_WAITKIND_IGNORE)
+ {
+ gdb_assert (is_executing (lp->ptid));
+
+ /* Don't bother if there's a breakpoint at PC that we'd hit
+ immediately, and we're not waiting for this LWP. */
+ if (!ptid_match (lp->ptid, *wait_ptid_p))
+ {
+ struct regcache *regcache = get_thread_regcache (lp->ptid);
+ CORE_ADDR pc = regcache_read_pc (regcache);
+
+ if (breakpoint_inserted_here_p (get_regcache_aspace (regcache), pc))
+ return 0;
+ }
+
+ if (debug_linux_nat)
+ fprintf_unfiltered (gdb_stdlog,
+ "RSRL: resuming stopped-resumed LWP %s\n",
+ target_pid_to_str (lp->ptid));
+
+ linux_ops->to_resume (linux_ops, pid_to_ptid (GET_LWP (lp->ptid)),
+ lp->step, TARGET_SIGNAL_0);
+ lp->stopped = 0;
+ memset (&lp->siginfo, 0, sizeof (lp->siginfo));
+ lp->stopped_by_watchpoint = 0;
+ }
+
+ return 0;
+}
+
static ptid_t
linux_nat_wait (struct target_ops *ops,
ptid_t ptid, struct target_waitstatus *ourstatus,
@@ -3642,6 +3716,16 @@ linux_nat_wait (struct target_ops *ops,
if (target_can_async_p ())
async_file_flush ();
+ /* Resume LWPs that are currently stopped without any pending status
+ to report, but are resumed from the core's perspective. LWPs get
+ in this state if we find them stopping at a time we're not
+ interested in reporting the event (target_wait(specific_process),
+ for example, see linux_nat_wait_1), and meanwhile the event
+ became uninteresting. Don't bother resuming LWPs we're not going
+ to wait for if they'd stop immediately. */
+ if (non_stop)
+ iterate_over_lwps (minus_one_ptid, resume_stopped_resumed_lwps, &ptid);
+
event_ptid = linux_nat_wait_1 (ops, ptid, ourstatus, target_options);
/* If we requested any event, and something came out, assume there