Unix Domain Socket Limitation?

Sat Dec 5 23:52:47 GMT 2020

On 12/4/2020 8:51 AM, Norton Allen wrote:
> On 12/3/2020 8:11 PM, Ken Brown wrote:
>> On 12/2/2020 12:30 PM, Norton Allen wrote:
>>> On 11/30/2020 9:22 PM, Norton Allen wrote:
>>>> Yeah, so now the example no longer blocks for me. Unfortunately these bugs 
>>>> are not present in my application, so I will need to keep working on this.
>>>>
>>>
>>> After paring the main application down and back up, I finally narrowed in on 
>>> the condition that was causing this blocking behavior. The issue arises when 
>>> a client connect()s twice to the same server with non-blocking unix-domain 
>>> sockets before calling select().
>>>
>>> There are a few pieces to this. With the client configured to connect() just 
>>> once, I can see that the server's select() returns as soon as the client 
>>> calls connect(), but then the server's accept() blocks until the client calls 
>>> select(). That is not proper non-blocking behavior, but it appears that the 
>>> implementation under Cygwin does require that client and server both be 
>>> communicating synchronously to accomplish the connect() operation.
>>>
>>> I tried running this under Ubuntu 16.04 and found that connect() succeeded 
>>> immediately, so no subsequent select() is required, and there does not appear 
>>> to be a possibility for this collision. That proves to hold true even if the 
>>> server is not waiting in select() to process the connect() with accept().
>>>
>>> A workaround for this issue may be to keep the socket blocking until after 
>>> connect().
>>>
>>> I have pushed the new minimal example program,  'rapid_connects' to 
>>> https://github.com/nthallen/cygwin_unix
>>>
>>> The server is run like before as:
>>>
>>>     $ ./rapid_connects server
>>>
>>> The client can be run in two different modes. To connect with just one socket:
>>>
>>>     $ ./rapid_connects client1
>>>
>>> To connect with two:
>>>
>>>     $ ./rapid_connects client2
>>>
>>> My immediate strategy will be to develop a workaround for my project. Having 
>>> spent a day inside cygwin1.dll, I can see that I have a steep learning curve 
>>> to make much of a contribution there.
>>
>> I'm traveling at the moment and unable to do any testing, but I wonder if 
>> you're bumping into an issue that was just discussed on the cygwin-developers 
>> list:
>>
>> https://cygwin.com/pipermail/cygwin-developers/2020-December/012015.html
>>
>> A different workaround is described there.
>>
>> If it's the same issue, then I don't think it will happen with the new AF_UNIX 
>> implementation.  More in a few days.
>>
> It does seem related.
> 
> A work around that is working for me is to do a blocking connect() and switch to 
> non-blocking when that completes. In my application, the connect() generally 
> occurs once at the beginning of a run, so blocking for a few milliseconds does 
> not impact responsiveness.

For the record, I can confirm that (a) the problem occurs with the current 
AF_UNIX implementation and (b) it does not occur with the new implementation (on 
the topic/af_unix branch).  With both client1 and client2, I see "connect() 
apparently succeeded immediately" using the new implementation.

The new implementation is not yet ready for prime time, but with any luck it 
might be ready within a few months.

Ken