AF_UNIX/SOCK_DGRAM is dropping messages

Ken Brown kbrown@cornell.edu
Thu Apr 8 21:02:10 GMT 2021


On 4/8/2021 3:47 PM, sten.kristian.ivarsson@gmail.com wrote:
>>>>>>>>>>> Using AF_UNIX/SOCK_DGRAM with current version (3.2.0) seems
>>> to
>>>>>>>>>>> drop messages or at least they are not received in the same
>>>>>>>>>>> order they are  sent
>>>>>>>
>>>>>>> [snip]
>>>>>>>
>>>>>>>> Thanks for the test case.  I can confirm the problem.  I'm not
>>>>>>>> familiar enough with the current AF_UNIX implementation to debug
>>>>>>>> this easily.  I'd rather spend my time on the new implementation
>>>>>>>> (on the topic/af_unix branch).  It turns out that your test case
>>>>>>>> fails there too, but in a completely different way, due to a bug
>>>>>>>> in sendto for datagrams.  I'll see if I can fix that bug and then try
>> again.
>>>>>>>>
>>>>>>>> Ken
>>>>>>>
>>>>>>> Ok, too bad it wasn't our own code base but good that the "mystery"
>>>>>>> is verified
>>>>>>>
>>>>>>> I finally succeed to build topic/af_unix (after finding out what
>>>>>>> version of zlib was needed), but not with -D__WITH_AF_UNIX to
>>>>>>> CXXFLAGS though and thus I haven’t tested it yet
>>>>>>>
>>>>>>> Is it sufficient to add the define to the "main" Makefile or do
>>>>>>> you have to add it to all the Makefile:s ? I guess I can find out
>>>>>>> though
>>>>>>
>>>>>> I do it on the configure line, like this:
>>>>>>
>>>>>>     ../af_unix/configure CXXFLAGS="-g -O0 -D__WITH_AF_UNIX" --
>>> prefix=...
>>>>>>
>>>>>>> Is topic/af_unix fairly up to date with master branch ?
>>>>>>
>>>>>> Yes, I periodically cherry-pick commits from master to topic/af_unix.
>>>>>> I'lldo that again right now.
>>>>>>
>>>>>>> Either way, I'll be glad to help out testing topic/af_unix
>>>>>>
>>>>>> Thanks!
>>>>>
>>>>> I've now pushed a fix for that sendto bug, and your test case runs
>>>>> without error on the topic/af_unix branch.
>>>>
>>>> It seems like the test-case do work now with topic/af_unix in
>>>> blocking mode, but when using non-blocking (with MSG_DONTWAIT) there
>>>> are
>>> some
>>>> issues I think
>>>>
>>>> 1. When the queue is empty with non-blocking recv(), errno is set to
>>>> EPIPE but I think it should be EAGAIN (or maybe the pipe is getting
>>>> broken for real of some reason ?)
>>>>
>>>> 2. When using non-blocking recv() and no message is written at all,
>>>> it seems like recv() blocks forever
>>>>
>>>> 3. Using non-blocking recv() where the "client" does send less than
>>>> "count" messages, sometimes recv() blocks forever (as well)
>>>>
>>>>
>>>> My naïve analysis of this is that for the first issue (if any) the
>>>> wrong errno is set and for the second issue it blocks if no sendto()
>>>> is done after the first recv(), i.e. nothing kicks the "reader thread"
>>>> in the butt to realise the queue is empty. It is not super clear
>>>> though what POSIX says about creating blocking descriptors and then
>>>> using non-blocking-flags with recv(), but this works in Linux any
>>>> way
>>>
>>> The explanation is actually much simpler.  In the recv code where a
>>> bound datagram socket waits for a remote socket to connect to the
>>> pipe, I simply forget to handle MSG_DONTWAIT.  I've pushed a fix.  Please
>> retest.
>>
>> I tested it and now it seems like we get EAGAIN when there's no msg on the
>> queue, but it seems like the client is blocked as well and that it cannot write
>> any more messages until it is consumed by the server, so the af_unix.cpp test-
>> client end prematurely
>>
>> If using sendto() with MSG_DONTWAIT as well, that is getting a EAGAIN, but
>> the socket in it self is not a non-blocking socket, it is just the recv() that is done
>> in a non-blocking fashion
>>
>> As I said earlier, it's a bit fuzzy (or at least for me) what POSIX mean by
>> non/blocking descriptors combined with non/blocking operations, but as far
>> as I understand, it should be possible to use blocking sendto()and messages
>> should be written (as long as some buffer is not filled) at the same time
>> someone is doing non-blocking recv()
>>
>> What is your take on this ?
> 
> I was thinking of this again and came to the conclusion that the fix semantically probably works ok
> 
> It was just me that didn't realise that only one message can be on the queue simultaneously even in blocking mode
> 
> The problem is not functional but merely a performance hog, that I guess you have already realised and you mentioned it in previous message but I guess I thought it was about some other issue
> 
> 
> So, I guess the fix works ok (I haven't done any more tests than with the sample program), but I guess out of an throughput aspect I guess it would be a good idea to let more messages be written to the queue before the first is consumed or so (I guess you already have some thoughts about this?)

I have some thoughts, but nothing definitive yet.  I'll keep thinking.

Ken


More information about the Cygwin mailing list