Multi Threaded programs deadlock doing simple I/O operations

Mark Pizzolato list-cygwin@subscriptions.pizzolato.net
Mon Jun 20 07:09:00 GMT 2005


On Sunday, June 12, 2005 T 5:37 PM, Mark Pizzolato wrote:
> On Friday, June 10, 2005 at 3:44 PM, Mark Pizzolato wrote:
>> > On Thursday, June 09, 2005 at 6:12 PM, Mark Pizzolato wrote:
>> >> On Thursday, June 09, 2005 at 3:35 PM, Christopher Faylor wrote:
>> >> > On Wed, Jun 08, 2005 at 05:43:59PM -0700, Mark Pizzolato wrote:
>> >> > >There is a serious problem for multi threaded programs doing simple 
>> >> > >I/O
>> >> > >operations in cygwin (open, dup, fdopen, fclose, and close).
>> >> > >
>> >> > >The attached 81 line test program clearly demonstrates the issue 
>> >> > >(by
>> >> > >hanging and no longer consuming CPU or performing any I/O 
>> >> > >operations).
>> >> >
>> >> > Thanks for the relatively small test case.  That was enough to track 
>> >> > the
>> >> > problem down.  I'm generating a new snapshot with a fix for this
>> >> > problem.
>> >>
>> >> The snapshot looks good!
>> >>
>> >> This fixes the stability problems with clamav's clamd that I've been 
>> >> chasing
>> >> for a long time.
>> >
>> > Some more follow up here...I'm running with the 20050609 snapshot dll.
>> >
>> > clamav's clamd now runs better than it has ever for me on cygwin.....
>> >
>> >           until "it doesn't",
>> >
>> > once it starts to run poorly it won't run cleanly again until I reboot 
>> > the system
>> > (I haven't actually tried after merely exiting all processes ..)
>
> Well, i spoke too soon here.  There may be some interaction with many 
> recently closed tcp sessions sitting in TIME_WAIT.  I'm not sure, but 
> after some time, I can restart and experience aparrently good behavior and 
> then things get "poor" as described.
>
> If I run with the 20050607 snapshot, the new "poor" behavior doesn't 
> happen, while the test program I provided earlier in this thread hangs as 
> described. So, the fix to the original problem and the new "poor" behavior 
> are clearly related to changes between the 20050607 and the 20050609 
> snapshots.
>
>> > To be more specific about the "poor" behavior:
>> >
>> >
>> > - pthread_unlock_mutex fails leaving errno with a value of 90.  This is 
>> > in a place where there is only one path through about a dozen lines of 
>> > code and the mutex is definately locked.  there may have been a call to 
>> > pthread_create, and a definate call to pthread_cond_signal.
>> > - once the above error happens, calls (by the same thread) to accept() 
>> > fail using a file descriptor which we've been successfully using all 
>> > along and only close when the program exists.
>> >
>> > so some change introduced recently (since 1.5.17-1), and possibly in 
>> > 20050609 fixes the dup() issue but now mutex operations are failing in 
>> > strange ways.
>> >
>> > Sorry not to have a simple isolated test case for this.  The good news 
>> > is that once it breaks it won't run correcfly again until a reboot.
>
> I'm working on a test program to recreate this behavior.

Well...  The problem wasn't in cygwin.

As it happens in clamav's clamd there were several pthread_mutex_t objects
which weren't initialized to reasonable values (i.e. left to be zero instead 
of
PTHREAD_MUTEX_INITIALIZER).  Calls to pthread_mutex_lock and
pthread_mutex_unlock on the uninitialized objects, depending on timing and
sequence aparrently confused some aspect of mutex processing causing
other calls to pthread_mutex_lock and pthread_mutex_unlock to fail in
strange ways.

Appropriate patches have been submitted to the clamav team.

- Mark Pizzolato 


--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/



More information about the Cygwin mailing list