AF_UNIX/SOCK_DGRAM is dropping messages

Ken Brown kbrown@cornell.edu
Wed Apr 14 21:58:08 GMT 2021


On 4/14/2021 1:14 PM, sten.kristian.ivarsson@gmail.com wrote:
>>>> Hi Ken
>>>>
>>>>>>>>>>>>> Using AF_UNIX/SOCK_DGRAM with current version (3.2.0)
>> seems
>>>>> to
>>>>>>>>>>>>> drop messages or at least they are not received in the same
>>>>>>>>>>>>> order they are  sent
>>>>>>>>>
>>>>>>>>> [snip]
>>>>>>>>>
>>>>>>>>>> Thanks for the test case.  I can confirm the problem.  I'm not
>>>>>>>>>> familiar enough with the current AF_UNIX implementation to
>>>>>>>>>> debug this easily.  I'd rather spend my time on the new
>>>>>>>>>> implementation (on the topic/af_unix branch).  It turns out
>>>>>>>>>> that your test case fails there too, but in a completely
>>>>>>>>>> different way, due to a bug in sendto for datagrams.  I'll see
>>>>>>>>>> if I can fix that bug and then try again.
>>>>>>>>>>
>>>>>>>>>> Ken
>>>>>>>>>
>>>>>>>>> Ok, too bad it wasn't our own code base but good that the
>> "mystery"
>>>>>>>>> is verified
>>>>>>>>>
>>>>>>>>> I finally succeed to build topic/af_unix (after finding out what
>>>>>>>>> version of zlib was needed), but not with -D__WITH_AF_UNIX to
>>>>>>>>> CXXFLAGS though and thus I haven’t tested it yet
>>>>>>>>>
>>>>>>>>> Is it sufficient to add the define to the "main" Makefile or do
>>>>>>>>> you have to add it to all the Makefile:s ? I guess I can find
>>>>>>>>> out though
>>>>>>>>
>>>>>>>> I do it on the configure line, like this:
>>>>>>>>
>>>>>>>>      ../af_unix/configure CXXFLAGS="-g -O0 -D__WITH_AF_UNIX" --
>>>>> prefix=...
>>>>>>>>
>>>>>>>>> Is topic/af_unix fairly up to date with master branch ?
>>>>>>>>
>>>>>>>> Yes, I periodically cherry-pick commits from master to topic/af_unix.
>>>>>>>> I'lldo that again right now.
>>>>>>>>
>>>>>>>>> Either way, I'll be glad to help out testing topic/af_unix
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>
>>>>>>> I've now pushed a fix for that sendto bug, and your test case runs
>>>>>>> without error on the topic/af_unix branch.
>>>>>>
>>>>>> It seems like the test-case do work now with topic/af_unix in
>>>>>> blocking mode, but when using non-blocking (with MSG_DONTWAIT)
>>>>>> there are
>>>>> some
>>>>>> issues I think
>>>>>>
>>>>>> 1. When the queue is empty with non-blocking recv(), errno is set
>>>>>> to EPIPE but I think it should be EAGAIN (or maybe the pipe is
>>>>>> getting broken for real of some reason ?)
>>>>>>
>>>>>> 2. When using non-blocking recv() and no message is written at all,
>>>>>> it seems like recv() blocks forever
>>>>>>
>>>>>> 3. Using non-blocking recv() where the "client" does send less than
>>>>>> "count" messages, sometimes recv() blocks forever (as well)
>>>>>>
>>>>>>
>>>>>> My naïve analysis of this is that for the first issue (if any) the
>>>>>> wrong errno is set and for the second issue it blocks if no
>>>>>> sendto() is done after the first recv(), i.e. nothing kicks the "reader
>> thread"
>>>>>> in the butt to realise the queue is empty. It is not super clear
>>>>>> though what POSIX says about creating blocking descriptors and then
>>>>>> using non-blocking-flags with recv(), but this works in Linux any
>>>>>> way
>>>>>
>>>>> The explanation is actually much simpler.  In the recv code where a
>>>>> bound datagram socket waits for a remote socket to connect to the
>>>>> pipe, I simply forget to handle MSG_DONTWAIT.  I've pushed a
>> fix.  Please retest.
>>>>>
>>>>> I should add that in all my work so far on the topic/af_unix branch,
>>>>> I've thought mainly about stream sockets.  So there may still be
>>>>> things remaining to be implemented for the datagram case.
>>>>
>>>> I finally got some time to test topic/af_unix in our "real"
>>>> cygwin-application
>>>> (casual) and unfortunately very few of our unittests pass
>>>>
>>>> The symptoms are that there's unexpected eternal blocking, sometimes
>>>> there's unexpected EADDRNOTAVAIL, sometimes it looks like some
>> memory
>>>> corruption (and
>>>> core-dumps)
>>>>
>>>> Of course the memory corruption etc could be our self and the
>>>> core-dumps might be because of uncaught exceptions
>>>>
>>>> Needles to say is that all unittests pass on Linux, but of course
>>>> cygwin-topic/af_unix could act according to POSIX-standard and the
>>>> behaviour couldbe due to our own misinterpretation of how POSIX works
>>>
>>> More likely it's due to bugs in the topic/af_unix branch.  This is
>>> still very much a work in progress.
>>>
>>>> I will try to narrow down the quite complex logic and reproduce the
>>>> problems
>>>
>>> That would be ideal.
>>>
>>>> If you of some reason wanna try it with casual, I'd be glad to help
>>>> you out (it should be easier now that last time (but there might be
>>>> some documentation missing for Cygwin still))
>>>>
>>>> https://bitbucket.org/casualcore/
>>>
>>> I'm going on vacation in a few days, but I might do this when I get back.
>>>
>>> Thanks for your testing.
>>
>> By the way, if your code is using datagram sockets, then there are very serious
>> problems with our implementation (even aside from the performance issue
>> that we've already discussed).  For example, I don't know of any reasonable
>> way for select to test whether such a socket is ready for writing.  We'll need to
>> solve that somehow.
> 
> If you by that mean if we're using SOCK_DGRAM, the answer is yes
> 
> I tried SOCK_STREAM (and SOCK_SEQPACKET I think) for CYGWIN 3.2.0 but that didn't work at all
> 
> As far as I understand, both all types on pretty much all implementations preserves message ordering though
> 
> I haven't tried SOCK_STREAM and/or SOCK_SEQPACKET with the topic/af_unix-branch. Is that worth a try ?

SOCK_STREAM is definitely worth a try.  The implementation of that should be 
much more reliable than the implementation of SOCK_DGRAM at the moment.  We 
don't implement SOCK_SEQPACKET.

Ken


More information about the Cygwin mailing list