AF_UNIX/SOCK_DGRAM is dropping messages

Ken Brown kbrown@cornell.edu
Wed Apr 14 15:53:56 GMT 2021


On 4/13/2021 6:43 PM, Ken Brown via Cygwin wrote:
> On 4/13/2021 10:47 AM, Ken Brown via Cygwin wrote:
>> On 4/13/2021 10:06 AM, sten.kristian.ivarsson@gmail.com wrote:
>>> Hi Ken
>>>
>>>>>>>>>>>> Using AF_UNIX/SOCK_DGRAM with current version (3.2.0) seems
>>>> to
>>>>>>>>>>>> drop messages or at least they are not received in the same
>>>>>>>>>>>> order they are  sent
>>>>>>>>
>>>>>>>> [snip]
>>>>>>>>
>>>>>>>>> Thanks for the test case.  I can confirm the problem.  I'm not
>>>>>>>>> familiar enough with the current AF_UNIX implementation to debug
>>>>>>>>> this easily.  I'd rather spend my time on the new implementation
>>>>>>>>> (on the topic/af_unix branch).  It turns out that your test case
>>>>>>>>> fails there too, but in a completely different way, due to a bug
>>>>>>>>> in sendto for datagrams.  I'll see if I can fix that bug and then try 
>>>>>>>>> again.
>>>>>>>>>
>>>>>>>>> Ken
>>>>>>>>
>>>>>>>> Ok, too bad it wasn't our own code base but good that the "mystery"
>>>>>>>> is verified
>>>>>>>>
>>>>>>>> I finally succeed to build topic/af_unix (after finding out what
>>>>>>>> version of zlib was needed), but not with -D__WITH_AF_UNIX to
>>>>>>>> CXXFLAGS though and thus I haven’t tested it yet
>>>>>>>>
>>>>>>>> Is it sufficient to add the define to the "main" Makefile or do you
>>>>>>>> have to add it to all the Makefile:s ? I guess I can find out
>>>>>>>> though
>>>>>>>
>>>>>>> I do it on the configure line, like this:
>>>>>>>
>>>>>>>     ../af_unix/configure CXXFLAGS="-g -O0 -D__WITH_AF_UNIX" --
>>>> prefix=...
>>>>>>>
>>>>>>>> Is topic/af_unix fairly up to date with master branch ?
>>>>>>>
>>>>>>> Yes, I periodically cherry-pick commits from master to topic/af_unix.
>>>>>>> I'lldo that again right now.
>>>>>>>
>>>>>>>> Either way, I'll be glad to help out testing topic/af_unix
>>>>>>>
>>>>>>> Thanks!
>>>>>>
>>>>>> I've now pushed a fix for that sendto bug, and your test case runs
>>>>>> without error on the topic/af_unix branch.
>>>>>
>>>>> It seems like the test-case do work now with topic/af_unix in blocking
>>>>> mode, but when using non-blocking (with MSG_DONTWAIT) there are
>>>> some
>>>>> issues I think
>>>>>
>>>>> 1. When the queue is empty with non-blocking recv(), errno is set to
>>>>> EPIPE but I think it should be EAGAIN (or maybe the pipe is getting
>>>>> broken for real of some reason ?)
>>>>>
>>>>> 2. When using non-blocking recv() and no message is written at all, it
>>>>> seems like recv() blocks forever
>>>>>
>>>>> 3. Using non-blocking recv() where the "client" does send less than
>>>>> "count" messages, sometimes recv() blocks forever (as well)
>>>>>
>>>>>
>>>>> My naïve analysis of this is that for the first issue (if any) the
>>>>> wrong errno is set and for the second issue it blocks if no sendto()
>>>>> is done after the first recv(), i.e. nothing kicks the "reader thread"
>>>>> in the butt to realise the queue is empty. It is not super clear
>>>>> though what POSIX says about creating blocking descriptors and then
>>>>> using non-blocking-flags with recv(), but this works in Linux any way
>>>>
>>>> The explanation is actually much simpler.  In the recv code where a bound
>>>> datagram socket waits for a remote socket to connect to the pipe, I simply
>>>> forget to handle MSG_DONTWAIT.  I've pushed a fix.  Please retest.
>>>>
>>>> I should add that in all my work so far on the topic/af_unix branch, I've
>>>> thought mainly about stream sockets.  So there may still be things remaining
>>>> to be implemented for the datagram case.
>>>
>>> I finally got some time to test topic/af_unix in our "real" 
>>> cygwin-application (casual) and unfortunately very few of our unittests pass
>>>
>>> The symptoms are that there's unexpected eternal blocking, sometimes there's 
>>> unexpected EADDRNOTAVAIL, sometimes it looks like some memory corruption(and 
>>> core-dumps)
>>>
>>> Of course the memory corruption etc could be our self and the core-dumpsmight 
>>> be because of uncaught exceptions
>>>
>>> Needles to say is that all unittests pass on Linux, but of course 
>>> cygwin-topic/af_unix could act according to POSIX-standard and the behaviour 
>>> couldbe due to our own misinterpretation of how POSIX works
>>
>> More likely it's due to bugs in the topic/af_unix branch.  This is still very 
>> much a work in progress.
>>
>>> I will try to narrow down the quite complex logic and reproduce the problems
>>
>> That would be ideal.
>>
>>> If you of some reason wanna try it with casual, I'd be glad to help you out 
>>> (it should be easier now that last time (but there might be some 
>>> documentation missing for Cygwin still))
>>>
>>> https://bitbucket.org/casualcore/
>>
>> I'm going on vacation in a few days, but I might do this when I get back.
>>
>> Thanks for your testing.
> 
> By the way, if your code is using datagram sockets, then there are very serious 
> problems with our implementation (even aside from the performance issue that 
> we've already discussed).  For example, I don't know of any reasonable way for 
> select to test whether such a socket is ready for writing.  We'll need to solve 
> that somehow.

I'm going to follow-up on the cygwin-developers list.

Ken


More information about the Cygwin mailing list