This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] RFC: All stepping threads need the same checks and cleanup as the currently stepping thread


On Wed, Jan 22, 2014 at 6:03 AM, Pedro Alves <palves@redhat.com> wrote:
> On 01/21/2014 10:17 PM, Sterling Augustine wrote:
>
>> But the problem is that the thread out of the step range is not the
>> currently stepped thread.
>
> Then I'm quite confused.  Each thread has its own step range.  How come
> the thread that is not stepped has a step range at all then?

It is stepped. We are switching back to it after a complicated set of
events, see below.

>> Process_event_stop_test calls
>> switch_back_to_stepped_thread, which, in turn, calls resume, bypassing
>> the extra logic in process_event_stop_test to fix up the step range,
>> and leading to the assertion error.
>
> The issue seems to me, as previously discussed, not really about missing
> the "fix up" of the step range, but rather that we overstep the thread
> by mistake.

That is incorrect. The thread's stepping range looks something like
this, in source code:

x = 1; foo(); x = 2;

With the step range equivalent to the single line.

But the stepped-thread gets stopped in foo--that's what all the extra
logic in process_event_stop_test does to fix up the step range.

So the thread is not past the step range at all and will hit it
eventually, but it is outside it. There is logic in
process_event_stop_test to handle this exact case.

> Running the thread through process_event_stop_test makes us
> detect that the step finished (before we ever get to fix up the step
> range).

The step didn't finish. The thread stopped deeper in the stack.

> The thing missing is a testcase clearly showing that that's indeed
> the issue in question.  I spent a few days trying to write one from
> scratch a while ago, but failed, because linux-nat.c always gives
> preference to reporting the stepping/SIGTRAP thread if there are multiple
> simultaneous events, and it seemed like another signal needs to be
> involved to trigger this.

Even if I could release this app (which I can't), it is several
gigabytes big, and it takes a while to hit the case--it is obviously a
race that is unusual extremely uncommon.

I have spent a solid week trying to write a case to hit this
independently, but I can't.

> Perhaps we could confirm all this already in a log produced by
> your extra debug outputs ran against your big app?

I have attached a severely trimmed log to the bug here:

https://sourceware.org/bugzilla/show_bug.cgi?id=16292

The relevant lines in the log as far as I can tell are:

157292: LWP 28899 Takes a trace/breakpoint trap, and is inside its step range.
157293: LWP 29437 has a breakpoint pushed back.
157307: LWP 28899 is resumed and prepared to step (still inside step range).
159355: GDB switches contexts from LWP 28899 to LWP 29437
159808: LWP 29437 Takes a trace/breakpoint trap, and GDB sends SIGSTOP
to all other threads
161427: LWP 28899 Takes a Profiling timer expired trap, instead of a
SIGSTOP. outside of its step range.
161466: switch_to_stepped_thread decides to restart LWP 28899
(switch_to_stepped_thread).
161488: assertion failure.

> I'm not convinced the first two branches in switch_back_to_stepped_thread
> should be changed at all.  So without those that reduces to exactly the
> original patch I had shown you originally:
>
>  https://github.com/palves/gdb/commit/b6b55ba610f8db5d89ec7405c93013a10d9a1c20
>
> Does that alone fix things for you?

Yes, by itself it does work, but I thought you had rejected that for
some reason. The second patch you made breaks things by bypassing the
step-range fix up in process_event_stop_test.

> In that branch, I then later rewrote that fix differently:
>
>  https://github.com/palves/gdb/commit/1d56ddf439b6f7e7fa9759cf1f8e02106eea6af5
>
> The idea of that "better fix" was to handle the case mentioned in
> this comment:
>
> +       There might be some cases where this loses signal
> +       information, if a signal has arrived at exactly the same
> +       time that the PC changed, but this is the best we can do
> +       with the information available.
>
> by setting a breakpoint at the current PC, and re-resuming the thread.
> That means that if there was indeed some other signal/event pending,
> we'd collect it first.  But that's unfinished, and breaks hardware-step
> targets again in the process, for it only handles software-step targets.
> The thing preventing moving this forward is a testcase (or a log showing
> clearly that the problem is what I say above it is, which should show
> the steps needed to construct a testcase).

This doesn't seem to address the case at hand. In any case, the second
patch does not fix it.

I would be fine with your first patch though.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]