This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Locating incorrectly handled system calls


Hi,

My question is, what is the best way to locate incorrectly handled
system calls which cause gdb to not stop correctly when debugging a
large multithreaded application?

I am attempting to fix a problem where one thread in a multithreaded
application does not stop when a breakpoint is hit.

I am using gdbserver (from gdb 7.3.1), on linux (2.6.31.5).

To repeat this problem, I set a breakpoint at an often called function
(hundreds of times per second) and enter the command "continue N"
where N is a largish number.

My understanding is that this problem occurs because a system call,
such as sleep, is not re-entered by the application when the call
returns, and that the solution is to locate the bad system call and
cause it to handle the early interrupt correctly by recalling the
particular function involved. Refer to section 5.5.5 "Interrupted
System Calls" in the current online documentation.

where it is recommended that:

sleep(10);

be replaced by

int unslept = 10;
while(unslept > 0)sleep(unslept);

However my application is very large, with millions of lines of code.
When the error happens, using the ps command (ps -p PID -l -L) it is
possible to locate the wild thread and cause it to stop (with kill -5
thread-pid. Other signals also work). When the thread stops, it is
obvious that the badly handled call was some time prior to the point
indicated by the thread.

I have attempted to locate the error point by checking for program
status in /proc/<PID>/status periodically. Also invoking sigpending.
However the results are confusing, and do not lead to definite
location of the bad system call. Instead, the stoppage appears to be
randomly proportional to CPU time.

Is it possible for a thread to check that it is supposed to be stopped
and to abort, so as to locate the problem code?

Is it certain that the badly behaved thread actually executed the
incorrectly coded code?

Hopefully I have asked the correct questions here.
Thanks in advance,


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]