This is the mail archive of the
gdb-patches@sources.redhat.com
mailing list for the GDB project.
Re: RFC: nptl threading patch for linux
- From: Michael Snyder <msnyder at redhat dot com>
- To: "J. Johnston" <jjohnstn at redhat dot com>
- Cc: gdb-patches at sources dot redhat dot com
- Date: Mon, 02 Jun 2003 11:49:46 -0700
- Subject: Re: RFC: nptl threading patch for linux
- Organization: Red Hat, Inc.
- References: <3EA84E74.3010101@redhat.com>
After reviewing the on-list discussion, it sounds like this patch
should go in. Jeff, would you do the honors?
Michael
"J. Johnston" wrote:
>
> The following is the last part of my revised nptl patch that has
> been broken up per Daniel J.'s suggestion. There are no generated
> files included in the patch.
>
> A bit of background is needed. First of all, in the nptl model,
> lwps and pids are distinct entities. If you perform a ps on a
> multithreaded application, you will not see its child threads show
> up. This differs from the linuxthreads model whereby an lwp was
> actually a pid. In the nptl model, you cannot issue a kill to
> a child thread via its lwp. If you want to do this, you need to use
> the tkill syscall instead. The action of sending a signal to a specific
> lwp is used commonly in lin-lwp.c. The first part of the nptl change is to
> determine at configuration time if we have the tkill syscall and then at run-time,
> if we can use it. A new function kill_lwp has been added which either calls the
> tkill syscall or the regular kill function as appropriate.
>
> Another key difference in behavior between nptl and linuxthreads is what happens
> when a thread exits. In the linuxthreads model, each thread that exits causes
> an exit event to occur. The main thread can exit before all of its children.
> In the nptl model, only the main thread generates the exit event and it only
> does so after all child threads have exited. So, to determine when an lwp
> has exited, we have to constantly check its status when we are able. When
> we get an exit event we have to determine if we are exiting the program or
> just one of the threads. Some additional logic has been added to the exit
> checking code such that if we have (lwp == pid), then we stop all active
> threads and check if they have already exited. If they have not exited,
> we resume these threads, otherwise, we delete them.
>
> Daniel J. brought up a good point regarding my previous attempt at this logic
> whereby I stopped all threads and resumed all threads. That old logic is wrong
> when scheduler_locking is set on. The new logic addresses this because
> it only stops/resumes threads which are not already stopped. Currently, if we lock
> into a linuxthreads thread and we run it until exit, we will cause a gdb_assert() to trigger
> because we will field the exit event and end up with no running threads. Under nptl, we
> never get notified of the thread exiting so for this scenario, we end up hung.
> In this instance, there is nothing we can do without a kernel change. Daniel J.
> is looking into a kernel change. For user threads, we could activate the death event
> in thread-db.c and make a low-level call to notify the lwp layer, but this does not
> handle lwps created directly by the end-user. This issue needs to be discussed
> further outside of this patch because I want to propose changing the assertion to
> allow the user to resume the other threads. I have tested the change with such a
> scenario (locking a thread and running until exit). For linuxthreads we trigger the
> assertion as before. For nptl, we hang as expected as we never get the exit event.
> A bug can be opened for this if the patch is accepted.
>
> The last piece of logic has to do with a new interface needed by the nptl libthread_db
> library to get the thread area. I have grouped this together with the lin-lwp changes
> because the lin-lwp changes cannot work without this crucial change.
>
> This patch has been tested for both linuxthreads and with an nptl kernel.
>
> Ok to commit?
>
> -- Jeff J.
>
> 2003-04-24 Jeff Johnston <jjohnstn@redhat.com>
>
> * acconfig.h: Add HAVE_TKILL_SYSCALL definition check.
> * config.in: Regenerated.
> * configure.in: Add test for syscall function and check for
> __NR_tkill macro in <syscall.h> to set HAVE_TKILL_SYSCALL.
> * configure: Regenerated.
> * thread-db.c (check_event): For create/death event breakpoints,
> loop through all messages to ensure that we read the message
> corresponding to the breakpoint we are at.
> * lin-lwp.c [HAVE_TKILL_SYSCALL]: Include <unistd.h> and
> <sys/syscall.h>.
> (kill_lwp): New function that uses tkill syscall or
> uses kill, depending on whether threading model is nptl or not.
> All callers of kill() changed to use kill_lwp().
> (lin_lwp_wait): Make special check when WIFEXITED occurs to
> see if all threads have already exited in the nptl model.
> (stop_and_resume_callback): New callback function used by the
> lin_lwp_wait thread exit handling code.
> (stop_wait_callback): Check for threads already having exited and
> delete such threads fromt the lwp list when discovered.
> (stop_callback): Don't assert retcode of kill call.
>
> Roland McGrath <roland@redhat.com>
> * i386-linux-nat.c (ps_get_thread_area): New function needed by
> nptl libthread_db.
>
> -------------------------------------------------------------------------------
> Index: lin-lwp.c
> ===================================================================
> RCS file: /cvs/src/src/gdb/lin-lwp.c,v
> retrieving revision 1.43
> diff -u -r1.43 lin-lwp.c
> --- lin-lwp.c 28 Mar 2003 21:42:41 -0000 1.43
> +++ lin-lwp.c 23 Apr 2003 22:52:44 -0000
> @@ -24,6 +24,10 @@
> #include "gdb_string.h"
> #include <errno.h>
> #include <signal.h>
> +#ifdef HAVE_TKILL_SYSCALL
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#endif
> #include <sys/ptrace.h>
> #include "gdb_wait.h"
>
> @@ -156,6 +160,7 @@
>
> /* Prototypes for local functions. */
> static int stop_wait_callback (struct lwp_info *lp, void *data);
> +static int lin_lwp_thread_alive (ptid_t ptid);
>
> /* Convert wait status STATUS to a string. Used for printing debug
> messages only. */
> @@ -627,6 +632,32 @@
> }
>
>
> +/* Issue kill to specified lwp. */
> +
> +static int tkill_failed;
> +
> +static int
> +kill_lwp (int lwpid, int signo)
> +{
> + errno = 0;
> +
> +/* Use tkill, if possible, in case we are using nptl threads. If tkill
> + fails, then we are not using nptl threads and we should be using kill. */
> +
> +#ifdef HAVE_TKILL_SYSCALL
> + if (!tkill_failed)
> + {
> + int ret = syscall (__NR_tkill, lwpid, signo);
> + if (errno != ENOSYS)
> + return ret;
> + errno = 0;
> + tkill_failed = 1;
> + }
> +#endif
> +
> + return kill (lwpid, signo);
> +}
> +
> /* Send a SIGSTOP to LP. */
>
> static int
> @@ -642,8 +673,15 @@
> "SC: kill %s **<SIGSTOP>**\n",
> target_pid_to_str (lp->ptid));
> }
> - ret = kill (GET_LWP (lp->ptid), SIGSTOP);
> - gdb_assert (ret == 0);
> + errno = 0;
> + ret = kill_lwp (GET_LWP (lp->ptid), SIGSTOP);
> + if (debug_lin_lwp)
> + {
> + fprintf_unfiltered (gdb_stdlog,
> + "SC: lwp kill %d %s\n",
> + ret,
> + errno ? safe_strerror (errno) : "ERRNO-OK");
> + }
>
> lp->signalled = 1;
> gdb_assert (lp->status == 0);
> @@ -667,11 +705,23 @@
>
> gdb_assert (lp->status == 0);
>
> - pid = waitpid (GET_LWP (lp->ptid), &status, lp->cloned ? __WCLONE : 0);
> + pid = waitpid (GET_LWP (lp->ptid), &status, 0);
> if (pid == -1 && errno == ECHILD)
> - /* OK, the proccess has disappeared. We'll catch the actual
> - exit event in lin_lwp_wait. */
> - return 0;
> + {
> + pid = waitpid (GET_LWP (lp->ptid), &status, __WCLONE);
> + if (pid == -1 && errno == ECHILD)
> + {
> + /* The thread has previously exited. We need to delete it now
> + because in the case of nptl threads, there won't be an
> + exit event unless it is the main thread. */
> + if (debug_lin_lwp)
> + fprintf_unfiltered (gdb_stdlog,
> + "SWC: %s exited.\n",
> + target_pid_to_str (lp->ptid));
> + delete_lwp (lp->ptid);
> + return 0;
> + }
> + }
>
> gdb_assert (pid == GET_LWP (lp->ptid));
>
> @@ -683,6 +733,7 @@
> status_to_str (status));
> }
>
> + /* Check if the thread has exited. */
> if (WIFEXITED (status) || WIFSIGNALED (status))
> {
> gdb_assert (num_lwps > 1);
> @@ -697,7 +748,31 @@
> target_pid_to_str (lp->ptid));
> }
> if (debug_lin_lwp)
> - fprintf_unfiltered (gdb_stdlog, "SWC: %s exited.\n",
> + fprintf_unfiltered (gdb_stdlog,
> + "SWC: %s exited.\n",
> + target_pid_to_str (lp->ptid));
> +
> + delete_lwp (lp->ptid);
> + return 0;
> + }
> +
> + /* Check if the current LWP has previously exited. For nptl threads,
> + there is no exit signal issued for LWPs that are not the
> + main thread so we should check whenever the thread is stopped. */
> + if (!lin_lwp_thread_alive (lp->ptid))
> + {
> + if (in_thread_list (lp->ptid))
> + {
> + /* Core GDB cannot deal with us deleting the current
> + thread. */
> + if (!ptid_equal (lp->ptid, inferior_ptid))
> + delete_thread (lp->ptid);
> + printf_unfiltered ("[%s exited]\n",
> + target_pid_to_str (lp->ptid));
> + }
> + if (debug_lin_lwp)
> + fprintf_unfiltered (gdb_stdlog,
> + "SWC: %s already exited.\n",
> target_pid_to_str (lp->ptid));
>
> delete_lwp (lp->ptid);
> @@ -756,7 +831,14 @@
> /* If there's another event, throw it back into the queue. */
> if (lp->status)
> {
> - kill (GET_LWP (lp->ptid), WSTOPSIG (lp->status));
> + if (debug_lin_lwp)
> + {
> + fprintf_unfiltered (gdb_stdlog,
> + "SWC: kill %s, %s\n",
> + target_pid_to_str (lp->ptid),
> + status_to_str ((int) status));
> + }
> + kill_lwp (GET_LWP (lp->ptid), WSTOPSIG (lp->status));
> }
> /* Save the sigtrap event. */
> lp->status = status;
> @@ -800,7 +882,7 @@
> target_pid_to_str (lp->ptid),
> status_to_str ((int) status));
> }
> - kill (GET_LWP (lp->ptid), WSTOPSIG (status));
> + kill_lwp (GET_LWP (lp->ptid), WSTOPSIG (status));
> }
> return 0;
> }
> @@ -1049,6 +1131,25 @@
>
> #endif
>
> +/* Stop an active thread, verify it still exists, then resume it. */
> +
> +static int
> +stop_and_resume_callback (struct lwp_info *lp, void *data)
> +{
> + struct lwp_info *ptr;
> +
> + if (!lp->stopped && !lp->signalled)
> + {
> + stop_callback (lp, NULL);
> + stop_wait_callback (lp, NULL);
> + /* Resume if the lwp still exists. */
> + for (ptr = lwp_list; ptr; ptr = ptr->next)
> + if (lp == ptr)
> + resume_callback (lp, NULL);
> + }
> + return 0;
> +}
> +
> static ptid_t
> lin_lwp_wait (ptid_t ptid, struct target_waitstatus *ourstatus)
> {
> @@ -1206,10 +1307,61 @@
> }
> }
>
> - /* Make sure we don't report a TARGET_WAITKIND_EXITED or
> - TARGET_WAITKIND_SIGNALLED event if there are still LWP's
> - left in the process. */
> + /* Check if the thread has exited. */
> if ((WIFEXITED (status) || WIFSIGNALED (status)) && num_lwps > 1)
> + {
> + if (in_thread_list (lp->ptid))
> + {
> + /* Core GDB cannot deal with us deleting the current
> + thread. */
> + if (!ptid_equal (lp->ptid, inferior_ptid))
> + delete_thread (lp->ptid);
> + printf_unfiltered ("[%s exited]\n",
> + target_pid_to_str (lp->ptid));
> + }
> +
> + /* If this is the main thread, we must stop all threads and
> + verify if they are still alive. This is because in the nptl
> + thread model, there is no signal issued for exiting LWPs
> + other than the main thread. We only get the main thread
> + exit signal once all child threads have already exited.
> + If we stop all the threads and use the stop_wait_callback
> + to check if they have exited we can determine whether this
> + signal should be ignored or whether it means the end of the
> + debugged application, regardless of which threading model
> + is being used. */
> + if (GET_PID (lp->ptid) == GET_LWP (lp->ptid))
> + {
> + lp->stopped = 1;
> + iterate_over_lwps (stop_and_resume_callback, NULL);
> + }
> +
> + if (debug_lin_lwp)
> + fprintf_unfiltered (gdb_stdlog,
> + "LLW: %s exited.\n",
> + target_pid_to_str (lp->ptid));
> +
> + delete_lwp (lp->ptid);
> +
> + /* If there is at least one more LWP, then the exit signal
> + was not the end of the debugged application and should be
> + ignored. */
> + if (num_lwps > 0)
> + {
> + /* Make sure there is at least one thread running. */
> + gdb_assert (iterate_over_lwps (running_callback, NULL));
> +
> + /* Discard the event. */
> + status = 0;
> + continue;
> + }
> + }
> +
> + /* Check if the current LWP has previously exited. In the nptl
> + thread model, LWPs other than the main thread do not issue
> + signals when they exit so we must check whenever the thread
> + has stopped. A similar check is made in stop_wait_callback(). */
> + if (num_lwps > 1 && !lin_lwp_thread_alive (lp->ptid))
> {
> if (in_thread_list (lp->ptid))
> {
> Index: acconfig.h
> ===================================================================
> RCS file: /cvs/src/src/gdb/acconfig.h,v
> retrieving revision 1.24
> diff -u -r1.24 acconfig.h
> --- acconfig.h 4 Jan 2003 00:34:42 -0000 1.24
> +++ acconfig.h 23 Apr 2003 22:52:44 -0000
> @@ -95,6 +95,9 @@
> /* Define if using Solaris thread debugging. */
> #undef HAVE_THREAD_DB_LIB
>
> +/* Define if you support the tkill syscall. */
> +#undef HAVE_TKILL_SYSCALL
> +
> /* Define on a GNU/Linux system to work around problems in sys/procfs.h. */
> #undef START_INFERIOR_TRAPS_EXPECTED
> #undef sys_quotactl
> Index: configure.in
> ===================================================================
> RCS file: /cvs/src/src/gdb/configure.in,v
> retrieving revision 1.126
> diff -u -r1.126 configure.in
> --- configure.in 26 Feb 2003 15:10:47 -0000 1.126
> +++ configure.in 23 Apr 2003 22:52:44 -0000
> @@ -384,6 +384,7 @@
> AC_CHECK_FUNCS(setpgid setpgrp)
> AC_CHECK_FUNCS(sigaction sigprocmask sigsetmask)
> AC_CHECK_FUNCS(socketpair)
> +AC_CHECK_FUNCS(syscall)
>
> dnl AC_FUNC_SETPGRP does not work when cross compiling
> dnl Instead, assume we will have a prototype for setpgrp if cross compiling.
> @@ -909,6 +910,24 @@
> if test "x$gdb_cv_thread_db_h_has_td_notalloc" = "xyes"; then
> AC_DEFINE(THREAD_DB_HAS_TD_NOTALLOC, 1,
> [Define if <thread_db.h> has the TD_NOTALLOC error code.])
> +fi
> +
> +dnl See if we have a sys/syscall header file that has __NR_tkill.
> +if test "x$ac_cv_header_sys_syscall_h" = "xyes"; then
> + AC_CACHE_CHECK([whether <sys/syscall.h> has __NR_tkill],
> + gdb_cv_sys_syscall_h_has_tkill,
> + AC_TRY_COMPILE(
> + [#include <sys/syscall.h>],
> + [int i = __NR_tkill;],
> + gdb_cv_sys_syscall_h_has_tkill=yes,
> + gdb_cv_sys_syscall_h_has_tkill=no
> + )
> + )
> +fi
> +dnl See if we can issue tkill syscall.
> +if test "x$gdb_cv_sys_syscall_h_has_tkill" = "xyes" && test "x$ac_cv_func_syscall" = "xyes"; then
> + AC_DEFINE(HAVE_TKILL_SYSCALL, 1,
> + [Define if we can use the tkill syscall.])
> fi
>
> dnl Handle optional features that can be enabled.
> Index: i386-linux-nat.c
> ===================================================================
> RCS file: /cvs/src/src/gdb/i386-linux-nat.c,v
> retrieving revision 1.44
> diff -u -r1.44 i386-linux-nat.c
> --- i386-linux-nat.c 16 Apr 2003 15:22:02 -0000 1.44
> +++ i386-linux-nat.c 23 Apr 2003 22:52:44 -0000
> @@ -70,6 +70,9 @@
> /* Defines I386_LINUX_ORIG_EAX_REGNUM. */
> #include "i386-linux-tdep.h"
>
> +/* Defines ps_err_e, struct ps_prochandle. */
> +#include "gdb_proc_service.h"
> +
> /* Prototypes for local functions. */
> static void dummy_sse_values (void);
>
> @@ -682,6 +685,21 @@
> offsetof (struct user, u_debugreg[regnum]), value);
> if (errno != 0)
> perror_with_name ("Couldn't write debug register");
> +}
> +
> +extern ps_err_e
> +ps_get_thread_area(const struct ps_prochandle *ph,
> + lwpid_t lwpid, int idx, void **base)
> +{
> + unsigned long int desc[3];
> +#define PTRACE_GET_THREAD_AREA 25
> +
> + if (ptrace (PTRACE_GET_THREAD_AREA,
> + lwpid, (void *) idx, (unsigned long) &desc) < 0)
> + return PS_ERR;
> +
> + *(int *)base = desc[1];
> + return PS_OK;
> }
>
> void