ssh.exe on cygwin: Write error
Tue Jul 16 08:26:00 GMT 2013
Dear Cygwin list;
So I've made some progress on the problem with ssh I started out trying to solve... unfortunately, it's got me in select.cc in Cygwin.
Basically, the ssh.exe program operates as this:
Ssh sets up a connection, and starts client_loop;
client_loop monitors (in the debugging case) a single channel. It checks to see if input is to be read (from stdin in this case), and checks if there's data to write from an output buffer and also if select() says the outbound connection is writable. In the case of debugging, the network connection from ssh.exe to the server is on fd 3.
If there's data to read, it reads it into a buffer.
If there's data to send in the output buffer AND select() says that fd 3 is writable, then it calls packet_write_poll, which then calls roaming_write, which does a write() on the fd. If there's a failure to write(), then packet_write_poll sees what the error is. EAGAIN, EINTR, and EWOULDBLOCK (same as EAGAIN on Cygwin) are non-fatal. Any other error is fatal.
In debugging, what happens is that the client_loop is processing away just fine. As it happens, it's reading more data than writing on stdin. It is happily writing data on the outbound socket, using write() as called by roaming_write as called by packet_write_poll. At some point, something ?bad? occurs.
1. Select() says that the fd 3 (outbound connection) is writeable to the network.
2. Write() goes to write, but gets an error 11 (EAGAIN).
3. Many (probably 50-100) calls to select() say that the socket is not writeable, and a packet trace on the server side confirm that the flow of packets has completely stopped. I can see that peek_socket() in select.cc is returning 'peek_socket: read_ready: 0, write_ready: 0, except_ready: 0' in the strace.
4. After some time (30 seconds) select() on fd 3 returns both readable+writable. It tries to read from fd 3, but it gets an error 104 (ECONNRESET). It subsequently tries to write on the socket, and also gets an error 104 (ECONNRESET).
5. Since the write() failed, it returns that to roaming_write, which returns it to packet_write_poll. This prints the fatal error "Write failed: connection reset by peer".
6. Interestingly, the server side has not issued a tcp/ip rst. In fact, from the server perspective, it just looks like the tcp/ip connection stalled (happens right at the error 11). The server side isn't shut down till some time later.
7. Definitely, the connection does get 'backed up' so to speak - i.e. I'm pushing more data than the internet connection can handle without blocking to process data, and I would expect select() and/or write() to fail waiting for the network to clear some buffers. That said, it's almost like the socket die's or needs to reset or something after the error 11 (EAGAIN).
8. I don't see any signals or timeouts happening. Also, I've retested with Cygwin 1.7.21 with no additional success.
I'm going to keep looking, but any thoughts with the new information?
Problem reports: http://cygwin.com/problems.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
More information about the Cygwin