This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug runtime/15982] New: process.end probes broken on RHEL7


https://sourceware.org/bugzilla/show_bug.cgi?id=15982

            Bug ID: 15982
           Summary: process.end probes broken on RHEL7
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: dsmith at redhat dot com

On RHEL6, the UTRACE_P5_01_cmd subtest of the utrace_p5.exp testcase is failing
somewhat randomly. This test basically does something like the following:

====
# stap -e 'probe process.end { printf("end\n") }' -c whoami
dsmith
====

After lots of debugging, I believe we've got a timing issue.

Here's my current theory. The process has ended, so it sends the process'
parent a SIGCHLD to let it know one of its children has died (via a call to
do_notify_parent()). The kernel then call the utrace hook to let systemtap know
the process has died (via a call to tracehook_report_death()). Here's the
relevant code from exit_notify(), from kernel/exit.c.

====
       signal = tracehook_notify_death(tsk, &cookie, group_dead);              
        if (signal >= 0)                                                        
                signal = do_notify_parent(tsk, signal);                         

        tsk->exit_state = signal == DEATH_REAP ? EXIT_DEAD : EXIT_ZOMBIE;       

        /* mt-exec, de_thread() is waiting for us */                            
        if (thread_group_leader(tsk) &&                                         
            tsk->signal->group_exit_task &&                                     
            tsk->signal->notify_count < 0)                                      
                wake_up_process(tsk->signal->group_exit_task);                  

        write_unlock_irq(&tasklist_lock);                                       

        tracehook_report_death(tsk, signal, cookie, group_dead);                
====

The userspace portion of systemtap, stapio, when it gets the SIGCHLD
immediately turns around and tells the module to quit. Here's the code from
staprun/mainloop.c:

====
  pid_t pid = waitpid(-1, &chld_stat, WNOHANG);                                 
  if (pid != target_pid) {                                                      
    return;                                                                     
  }                                                                             

  if (chld_stat) {                                                              
    // our child exited with a non-zero status                                  
    if (WIFSIGNALED(chld_stat)) {                                               
      warn(_("Child process exited with signal %d (%s)\n"),                     
          WTERMSIG(chld_stat), strsignal(WTERMSIG(chld_stat)));                 
      target_pid_failed_p = 1;                                                  
    }                                                                           
    if (WIFEXITED(chld_stat) && WEXITSTATUS(chld_stat)) {                       
      warn(_("Child process exited with status %d\n"),                          
          WEXITSTATUS(chld_stat));                                              
      target_pid_failed_p = 1;                                                  
    }                                                                           
  }                                                                             

  dbug(2, "sending STP_EXIT\n");                                                
  rc = write(control_channel, &btype, sizeof(btype)); // send STP_EXIT          
====

What I believe is happening is that the module is exiting before the
process.end probe gets a chance to hit. It is more likely that the module's
session state is no longer STAP_SESSION_RUNNING when the process.end probe gets
hit, so the probe gets skipped.

I'm not sure of the best way to fix this. For a system where I have
consistently seen the problem, a 1/4 second sleep in stapio between getting the
signal and sending the STP_EXIT command works around the problem. But that is
an *ugly* workaround...

-- 
You are receiving this mail because:
You are the assignee for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]