very poor cygwin scp performance in some situations

Thu Apr 27 01:18:00 GMT 2006

I've investigated the reports of poor network performance on cygwin
and I've made some conclusions. The main conclusion is that there are
serious problems which have nothing at all to do with ssh. Even a
straightforward TCP streaming write, as used by, eg netcat or ttcp,
can be hit by this. It's not a small effect. I was seeing 40kbytes/sec
to my openbsd box, over a 100Mbit fast-ethernet link (directly
attached with a crossover cable) versus the 11.3Mbyte/sec
theoretically available. So potential speedups of 30000% to be had
here. When people complain Cygwin performance is too low, mostly I
think it's just whining, but in this case I'm willing to accept this
one as a valid criticism :-) Note that cygwin doesn't have this
trouble receiving data over TCP, only sending. But this might well
expected, because 90% of the complexity in TCP sits at the sender's
end, and the receiver has a much simpler job to do. It's just much
easier to get receiving right than sending. Skip to the end for my
patches, keep reading for the (long-winded) rationale....

Sniffing packets on the segment in question, it is clear that the
culprit is windows' network stack waiting for all the data it sent to
be acked, under certain circumstances, before putting more data on the
wire, rather than allowing a window of unacked data as it normally
should. OpenBSD has a 200ms delayed-ack timeout, so there are a lot of
200ms delays. This explains a number of things, such as why does it
particularly occur with BSD at the other end (because Linux has some
rather nifty heuristics to force undelayed acks when that might be
good for performance (cf quickack and pingpong in the sources), and
when delaying acks it cleverly delays the minimum amount it has
measured it can get away with), whereas BSD just does the simple delay
in all cases; and why is it even worse for OpenBSD than FreeBSD
(OpenBSD does 200ms delay, FreeBSD does 100ms); and why does it
sometimes matter which party opened the connection (because: 1. by
default Windows will honour requests for TCP timestamps during TCP
handshake, but will not ask for them in the case that windows is the
initiator; 2. only every other ack is allowed to be delayed, so it's
only a problem if an odd number of packets are sent at a time; 3. TCP
timestamp option adds bytes to the headers, so reduces the segment
size, so what before may have fit in an even number of packets, now
requires an odd number).

The next discovery is a MS KB article describing (in an evasive
manner) the cause of the problem: KB823764 states that the problem
occurs in the case that you are using nonblocking sockets, AND your
sends are at least as large as the socket's send buffer. In relation
to this there are 2 points to note: 1) cygwin implements every socket
send as nonblocking, for reasons of signal handling; 2) recent
versions of windows, running on modern hardware (at least 32Mb of RAM)
default the send buffer to 8192 bytes, and OpenSSH has a send window
of also 8192 bytes, which is exactly the right size to trigger this
problem. As Corinna said:
> Note that the performance suffers again, if the socket buffer is
> smaller than the application buffer.


Further investigation reveals that it's not just cygwin that hits this
problem. It looks like Microsoft's own software has been hit by it:
Internet Explorer: KB329781 "HTTP File Upload Operation Takes a Long
Time to Complete".
The printing subsystem: KB816627 "TCP/IP port printing may be slow in
Windows 2000"

In the first KB microsoft amusingly states in relation to the fact
that upload speed is limited to 80kbps no matter what the network
bandwidth that "this behaviour is by design". The mind boggles....
In the latter KB, they hint that it's not windows' fault, but rather a
problem with the receiving TCP/IP stack (in context, the stack running
on the printer) not conforming to the RFC 1122. They say that acks for
packets with the push bit set should not be delayed, and, looking at
the sources, BSD pays no attention whatsoever to the push bit of
incoming packets. Maybe we should look at the RFCs in question.

As far as I could tell, the only RFC that they might be thinking of is
RFC813 "window and acknowledgement strategy in tcp", an RFC whose
status is unclear to me, but certainly doesn't seem to be defining any
standards, (in particular it lacks the BP14/RFC2119 "MUST" "SHOULD"
etc, language). Anyway, in section 5 it says: "So, if a segment
arrives, postpone sending an  acknowledgement  if  both  of  the
following  conditions  hold.    First,  the  push  bit is not set in
the segment, since it is a reasonable assumption that  there  is  more
 data coming  in  a  subsequent  segment. ...."

So it would seem that BSD is violating the RFC. Trouble is, it seems
*everybody* violates this RFC, and I include Microsoft's own stack in
this statement. Closer inspection shows that {Net,Open}BSD have an
interesting sysctl net.inet.tcp.ackonpush, by default set to 0, and,
sure enough, setting it to 1 gives 4.5Mbyte/sec, a 10000% increase,
and putting us within a factor of 3 of the capacity of the link. That
was a nice improvement for a 1-line fix, but there's still that factor
of 3 to try to pick up, plus it's not always convenient to make
root-level configuration changes to every peer cygwin talks to....

So I looked at {Open,Net}BSD and try to see why that sysctl is 0 by
default. It seems that originally there was no sysctl, and the
behaviour was as if the sysctl were 0. Then in netbsd cvs
sys/netinet/tcp_input.c rev 1.8 (1994), the behaviour changed to be
like having tcp.ackonpush=1. Then this patch was reversed in rev 1.47
(Mar 1998) with a changelog "Per discussion with several members of
the TCPIMPL and TCPSAT IETF working groups." and finally the patch was
reintroduced in rev 1.55 (May 1998) this time conditional on the
tcp.ack_on_push sysctl. The situation is very similar on OpenBSD. I
wasn't there but I can imagine the story: They implement a change
which improves tcp performance. Then the IETF working group people say
"but any app which is improved by this is a fundamentally *broken*
app. If it can't handle a 200ms delayed ack, how can it expect to
handle the (indistinguishable) case where the acks are delayed due to
a 200ms RTT (which is typical for an intercontinental connection)". So
they back out the change, only discover that there really are an awful
lot of *broken* apps out there, and maybe it'd be nice to be able to
talk to them at a decent speed sometimes, even if they are undeniably
broken, so back in it goes, this time with an obscure sysctl to
control it.

Because TCP provides reliable service, any implementation has to keep
data in its send buffer waiting until acks have been received. The
problem with the winsock implementation is that the granularity of
removing data from the buffer is one whole send, rather than something
smaller, like one packet. It seems to be because, at lower levels, the
windows TCP implementation is consistent with other windows IO and
works in terms of overlapped IO directly out of user buffers. There
are advantages to this system, but it is very different to the unix
sockets interface, so winsock is built as a separate layer on top of
the transport, implemented in  AFD.SYS and MSAFD.DLL. (The structure
of the microsoft stack is shown in figure 1 at
http://snipurl.com/akjj ).
My reverse-engineered logic for how AFD does send buffer management
goes like this:

sock_send (userbuf)
  if (bufferedbytes < so_sndbuf)
        allocate tmpBuffer of size len (userbuf)
        copy userbuf to tmpBuffer
        initiate overlapped send on tmpBuffer
        bufferedbytes += len (userbuf)
        return len (userbuf)
  else if (overlapped)
        initiate overlapped send directly on userbuf
        return INPROGRESS
  else if (blocking)
        allocate tmpBuffer of size len (userbuf)
        initiate overlapped send on tmpBuffer
        bufferedbytes += len (userbuf)
        block until (only 1 overlapped send outstanding OR
                 bufferedbytes<so_sndbuf)
        return len (userbuf)
  else /* nonblocking and nonoverlapped */
        return EAGAIN
end

overlapped_send_completion_handler (buffer)
   reduce bufferedbytes by length of completed buffer
end

The winsock documentation notwithstanding, I have never seen a send()
return anything other than the full amount passed to it, or an error -- no
partial sends.

The above is similar to what KB214397 says happens, but subtley
different (if the KB were correct, then the problems wouldn't occur).
KB214397 says:

> Winsock uses the following rules to indicate a send completion to the
> application (depending on how the send is invoked, the completion notification
> could be the function returning from a blocking call, signaling an event or
> calling a notification function, and so forth):
> *) If the socket is still within SO_SNDBUF quota, Winsock copies the data
> from the application send and indicates the send completion to the application.
> *) If the socket is beyond SO_SNDBUF quota and there is only one previously
> buffered send still in the stack kernel buffer, Winsock copies the data from the
> application send and indicates the send completion to the application.
> *) If the socket is beyond SO_SNDBUF quota and there is more than one
> previously buffered send in the stack kernel buffer, Winsock copies the data
> from the application send. Winsock does not indicate the send completion to
> the application until the stack completes enough sends to put the socket
> back within SO_SNDBUF quota or only one outstanding send condition.

There are various possible workarounds for this situation. One way
might be to disable winsock's buffering (setsockopt so_sndbuf to 0)
and do our own buffering, somehow keeping multiple sends pending. We'd
probably end up reimplementing all the hairy buffer preallocation
stuff that AFD already deals with so I didn't feel like going this
way. I still think it's worth investigating, because it might provide
a route to improving cygwin's linux emulation, by emulating
draft-minshall-nagle style nagling (as used by linux 2.4 onwards) and
Linux/FreeBSD TCP_CORK/MSG_MORE and so on. I'm not sure whether these
things would make a noticeable difference to any apps people are
likely to be running under cygwin, though (apache?). The route I
instead chose to follow was to split any large sends into chunks such
that the size of the chunk:
1) is as large as possible (to minimize system calls)
2) is always less than so_sndbuf (to avoid the delayed acks problem)
3) is always at least 1 MSS large, if the original send was at least 1
MSS large (to avoid Nagle-related delays)
4) is an integer multiple of MSS (except for the final chunk) (to
maximize network efficiency)

These constraints can't be simultaneously satisfied unless so_sndbuf
>= 2*MSS (ok for ethernet MSS=1460 and winsock default
so_sndbuf=8192), so I relax them one at a time as so_sndbuf shrinks
relative to MSS. The only disadvantage of this chunking (that I can
think of, and assuming so_sndbuf>=2*MSS) is that it can cause cygwin
to put out a lot more packets with the TCP PUSH bit set. I think this
will have very little impact, since BSD ignores the push bit, Linux
uses it only to select packets to consider for MSS estimation. One
platform which does pay attention to received push bits, though, is
windows: unless the IgnorePushBitOnReceives registry entry is set,
windows uses push bits to decide when to return from recv(), so this
chunking may cause slightly increased CPU usage on the remote peer, if
it has a microsoft stack (but this should be no worse than setting
IgnorePushBitOnReceives on the remote peer).

Another potential problem with this chunking is that the MSS, needed
in (3) and (4) above, seems to be impossible to retrieve on windows.
Winsock does not implement the getsockopt TCP_MAXSEG or IP_MTU options
nor the linux-specific TCP_INFO. Nor does it implement the TCP_CORK or
TCP_NOPUSH options nor the MSG_MORE flag, which really would be The
Right Way(TM) to deal with this (and many other problems). So we're
stuck with taking the interface MTU (which we can only get reliably on
recent Windows versions) and guessing which IP and TCP options are in
effect, effectively assuming that PMTU changes nothing and that the
remote MTU is at least as large as our MTU. Our guess will probably be
correct in many cases, and when we guess wrong, so long as there is
enough data in the send buffer, the lower layers of the stack will
coalesce sends to make full size segments anyway (on my hardware, this
seems to start to happen reliably once there is about 32kb in the send
buffer, but it probably depends on network cards, drivers, etc). Even
if the packets don't get coalesced, it's only the last packet of each
chunk which is less than full size, so in a bad case (with the default
send buffer of 8192, and the maximum usual internet MTU of 1460), we
could be putting out 20% of short packets -- with the TCP/IP and
ethernet overheads, this works out to at worst a 1.5% penalty on the
wire, so I'm not too worried, but if necessary it can always be
improved by increasing the send buffer, which you should be doing
anyway if you care about efficiency.

Here's the results (variance is about +/- 0.1MByte/s), between cygwin
and openbsd on a single segment of 100Base-T fast ethernet with a
crossover cable:
Theoretical Maximum: 100**6 * (1500-40)/(1500+38)/1024/1024/8 == 11.3MByte/s.
Best from raw winsock, every possible combination of settings I tried
(including overlapped IO, etc), 11.1MByte/s.
ttcp -tsn 30000 -l 7300 (==1460*5): 11.1MByte/s
ttcp -tsn 30000 -l 2920 (==1460*2): 10.9MByte/s
ttcp -tsn 30000 -b 32768: 11.0MByte/s (This represents the performance
you should see by default if you set the TcpWindowSize to 32768)
ttcp, default settings: 9.3MBytes/s (This represents the performance
you should see by default if you don't change any registry settings,
and your app doesn't change socket settings)

So there's maybe 20% overhead left in the case that the user is using
default settings and an application which doesn't play with socket
options. As far as I can tell, this remaining overhead is due to
(scheduling?) delays waiting for the thread to wake up after getting
WSAEWOULDBLOCK and calling waitformultipleevents (if I change
fhandler_socket:sendmsg() to spin without calling wait() then I get an
extra 0.7MByte/s, and also playing with the process priorities has
some effect). This isn't completely satisfying, but, still, 20% away
from the best raw winsock can do isn't all bad, for an emulation
environment (and is a 28000% increase over the situation I started
with), and as I said, increasing the window size reduces this overhead
to negligable.

Finally, lets look at the original reported problem and test with scp.
Copying from cygwin -> openbsd, initiated on cygwin I get 6.8 MB/s,
apparently cpu-limited. I'm relatively happy with that. Copying
openbsd -> cygwin, initiated on cygwin I get 1.5MB/s. Not so hot. Why
can this be? Hmm. CPU usage is only 30% so it's not that.... <sounds
of frenzied hairy debugging, gradually whittling down the entirety of
scp/ssh to a 1-page testcase. ugh>.... Turns out its due to the 10ms
Sleep() in thread_pipe(). I made a patch to reduce the impact of this
sleep. Hmm... I notice that Corinna checked in a patch to CVS a couple
of days ago which will also help with this exact problem. Perhaps
combining the two approaches will be better than either alone?
Corinna's patch should help when the pipe being select()ed on
eventually terminates the select() and mine should help in the case
that some other event terminates the select(), which is the case for
scp (incoming data on the network socket terminates the select() ). I
haven't tested this combination yet. I don't have sshd configured on
my cygwin, so I haven't tested doing copies initiated from the openbsd
side....

Anyway, with my patches to sockets and to thread_pipe(), I can scp
from openbsd -> cygwin, initiated on cygwin, at 6.8MB/s, which is the
same speed as in the other direction, and over 400% better than
without the patch.

So... my patches (against cygwin-1.5.19-4)  are attached. I hope they
are useful to someone. I have only tested them on one machine, running
windows xp sp2, so there are probably problems on other windows
versions. The select.cc patch should probably do more error checking.
The fhandler_socket patch probably has some minor race conditions on
the new class-variables (mss, mtu, chunk, sndbuf, sndbuf_old) but
since this would at worst cause only a performance problem (I think)
and because there seem to be very few POSIX guarantees on what happens
when multiple threads read on a socket in parallel anyway (other than
it won't crash), I decided not to introduce the overhead of
synchronization structures, but maybe someone has comments?

To get maximum benefit, it's advised, but not required, to increase
HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize(DWORD)
to an appropriate value for your network. For my hardware, I needed
0x7fff to get segments coalescing fully, and for long fat pipes it
should be calculated from the bandwidth-delay-product, as described in
the accompanying materials for the HSN patches described earlier in
the thread.

I made an honest effort to format this patch correctly. I hope I did
everything right.

Lev
PS: If the maintainers want to incorporate this patch, and consider it
"significant" enough to require a copyright assignment, I'll be happy
to submit one.

ChangeLog:
2006-04-25  Lev Bishop  <Lev.Bishop+cygwin@gmail.com>

    * fhandler.h (fhandler_socket::mtu, fhandler_socket::mss)
    (fhandler_socket::chunk, fhandler_socket::sndbuf)
    (fhandler_socket::sndbuf_old): New class members. Keep track of
    socket params to allow sensible chunking of sends.
    * fhandler_socket.cc (nextchunk): New static function. Takes the next
    chunk from the user's buffers.
    (fhandler_socket::update_mtu, fhandler_socket::update_mss)
    (fhandler_socket::update_chunk, fhandler_socket::get_chunk)
    (fhandler_socket::update_sndbuf): New methods. Access functions for
    new variables.
    (fhandler_socket::sendmsg): Break sends into chunks.
    * net.cc (cygwin_setsockopt): Keep track of so_sndbuf.

    * select.cc (pipeinf::wakeup): New element.
    (thread_pipe, start_thread_pipe, pipe_cleanup): Use pipeinf::wakeup
    event to allow timely termination of pipe-select thread.

diff -Naurp cygwin/fhandler.h cygwin.patched/fhandler.h

--- cygwin/fhandler.h   2006-01-16 12:14:35.001000000 -0500
+++ cygwin.patched/fhandler.h   2006-04-25 04:23:23.536720700 -0400
@@ -408,6 +408,19 @@ class fhandler_socket: public fhandler_b
 void af_local_set_sockpair_cred ();

 private:
+  ssize_t mtu;
+  void update_mtu ();
+  ssize_t mss;
+  void update_mss ();
+  ssize_t chunk;
+  void update_chunk ();
+  ssize_t get_chunk (ssize_t tosend=0);
+  ssize_t sndbuf;
+  ssize_t sndbuf_old; // So we can tell if it changed, to update chunk
+ public:
+  void update_sndbuf ();
+
+ private:
 struct _WSAPROTOCOL_INFOA *prot_info_ptr;
 char *sun_path;
 struct status_flags
diff -Naurp cygwin/fhandler_socket.cc cygwin.patched/fhandler_socket.cc
--- cygwin/fhandler_socket.cc   2006-01-20 10:54:29.001000000 -0500
+++ cygwin.patched/fhandler_socket.cc   2006-04-25 04:23:23.566763900 -0400
@@ -126,6 +126,11 @@ get_inet_addr (const struct sockaddr *in

 fhandler_socket::fhandler_socket () :
 fhandler_base (),
+  mtu(0),
+  mss(0),
+  chunk(0),
+  sndbuf(0),
+  sndbuf_old(-1), // Different from sndbuf, to force update of chunk
 sun_path (NULL),
 status ()
 {
@@ -1242,6 +1247,163 @@ fhandler_socket::sendto (const void *ptr
 return res;
 }

+static int
+nextchunk (WSABUF *wsabuf, const struct msghdr *msg, ssize_t chunksize,
+          DWORD *bufs, DWORD *cur)
+{
+  int bytes=0;
+
+  const struct iovec *iov = msg->msg_iov + *cur;
+  const size_t iovcnt = msg->msg_iovlen;
+
+  u_long offset=0;
+  if (*bufs)
+    offset = wsabuf[*bufs-1].buf - (char *)iov->iov_base + wsabuf[*bufs-1].len;
+  for (*bufs=0; *cur != iovcnt; ++*cur)
+    {
+      if (offset!=iov->iov_len)
+       {
+         wsabuf[*bufs].buf = (char *)iov->iov_base + offset;
+         bytes += (wsabuf[*bufs].len = iov->iov_len - offset);
+         ++*bufs;
+         ++iov;
+         if (chunksize && bytes>=chunksize)
+           {
+             wsabuf[*bufs-1].len -= bytes-chunksize;
+             return bytes;
+           }
+       }
+      offset = 0;
+    }
+  return bytes;
+}
+
+void
+fhandler_socket::update_sndbuf()
+{
+  int newsndbuf;
+  int optlen = sizeof(newsndbuf);
+  if (getsockopt (get_socket (), SOL_SOCKET, SO_SNDBUF,
+                 (char *)&newsndbuf, &optlen))
+    {
+      debug_printf ("unable to get so_sndbuf");
+      newsndbuf = 8192; // Default for windows with >32Mb or >64 Mb RAM
+    }
+  if (newsndbuf != sndbuf)
+    {
+      debug_printf ("Sendbuf: %d",newsndbuf);
+      sndbuf = newsndbuf;
+    }
+}
+
+void
+fhandler_socket::update_mtu ()
+{
+  mtu = 1500; // Ethernet mtu
+  struct sockaddr_in addr;
+  int addrlen = sizeof(addr);
+  if (!::getsockname (get_socket (), (sockaddr *)&addr, &addrlen))
+    {
+      struct ifconf ifc;
+      struct ifreq ifr[20]; // XXX hardcoded limit
+      ifc.ifc_req = ifr;
+      ifc.ifc_len = sizeof(ifr);
+      extern int get_ifconf (struct ifconf *ifc, int what); /* net.cc */
+      if (!get_ifconf (&ifc, SIOCGIFCONF))
+       for (struct ifreq *ifrp = ifc.ifc_req;
+            (caddr_t) ifrp < ifc.ifc_buf + ifc.ifc_len;
+            ++ifrp)
+         if ( ((sockaddr_in*)(&ifrp->ifr_addr))->sin_addr.S_un.S_addr
+              == addr.sin_addr.S_un.S_addr)
+           {
+             /* Check the name to avoid some races. Still a possibility
+                for races if, eg, two eth adapters change places, because
+                get_ifconf will baptise the first it finds eth0, the next
+                eth1, etc. Could avoid by coding directly to IP Helper
+                library but is it worth it? XXX
+             */
+
+             char name[strlen (ifrp->ifr_name)+1];
+             strcpy (name, ifrp->ifr_name);
+             if (!get_ifconf (&ifc, SIOCGIFMTU)
+                 && (caddr_t)ifrp < ifc.ifc_buf + ifc.ifc_len
+                 && !strcmp (name, ifrp->ifr_name))
+               {
+                 mtu = ifrp->ifr_mtu;
+                 debug_printf ("Using interface %s, mtu %d", name, mtu);
+               }
+             break;
+           }
+    }
+  if (mtu<576)
+    {
+      debug_printf ("Forcing minimum IP mtu of 576! (was %d)", mtu);
+      mtu = 576;
+    }
+}
+
+void
+fhandler_socket::update_mss ()
+{
+  /* Finding MSS is hard because winsock doesn't implement TCP_MAXSEG,
+   * TCP_INFO, IP_MTU, etc
+   *
+   * We try to estimate MSS, using the interface MTU, but.....
+   *   1) we have to assume minimal headers (timestamps? options?)
+   *   2) the interface MTU is only an upper bound on the path MTU, which
+   *      changes dynamically with PMTU-D (default enabled on windows)
+   *   3) get_ifconf() only gives accurate MTU on recent windows versions
+   * So at best we can say we have an upper bound on the MTU. At least
+   * this allows us to avoid bad nagle interactions.
+   */
+  const int hdr = 20 + 20; // XXX Assume default TCP/IP hdrs, no options
+  if (!mtu) update_mtu ();
+  mss = mtu - hdr;
+}
+
+ssize_t
+fhandler_socket::get_chunk (ssize_t tosend)
+{
+  if (sndbuf_old != sndbuf) update_chunk ();
+  if (!chunk) return 0;
+  if (!tosend) return chunk;
+
+  ssize_t newchunk = chunk;
+  if (tosend<sndbuf) newchunk = 0;
+  else if (tosend-chunk<mss && chunk>mss) newchunk -= mss;
+#if 0 /* Slows things down */
+  if (newchunk != chunk)
+    debug_printf ("Force chunk size:%d (was: %d tosend:%d)",
+                 newchunk, chunk, tosend);
+#endif
+  return newchunk;
+}
+
+void
+fhandler_socket::update_chunk ()
+{
+  ssize_t newchunk;
+  if (get_addr_family () != AF_INET || get_socket_type () != SOCK_STREAM)
+    {
+      chunk = 0; // Not TCP, don't chunk
+      return;
+    }
+  /* for TCP, chunk size should be, in decreasing order of importance:
+       1) < SO_SNDBUF (to allow some bytes unacked on the channel)
+       2) >= MSS  (to avoid having nagle on every send)
+       3) integer*MSS (to avoid short segments), or else as large as
+            possible, to reduce their frequency
+     Relax these requirements if SO_SNDBUF is too small to allow
+     satisfying them all simultaneously. */
+  if (!mss) update_mss ();
+  if (!sndbuf) update_sndbuf ();
+  newchunk = mss*((sndbuf-1)/mss);
+  sndbuf_old = sndbuf;
+  if (newchunk == chunk) return;
+  chunk = newchunk;
+  debug_printf ("Using chunk size:%d (mss:%d)", chunk, mss);
+ }
+
 int
 fhandler_socket::sendmsg (const struct msghdr *msg, int flags, ssize_t tot)
 {
@@ -1254,28 +1416,16 @@ fhandler_socket::sendmsg (const struct m
   /*TODO*/
 }

-  struct iovec *const iov = msg->msg_iov;
-  const int iovcnt = msg->msg_iovlen;
-
 int res = SOCKET_ERROR;

-  WSABUF wsabuf[iovcnt];
-
-  const struct iovec *iovptr = iov + iovcnt;
-  WSABUF *wsaptr = wsabuf + iovcnt;
-  do
-    {
-      iovptr -= 1;
-      wsaptr -= 1;
-      wsaptr->len = iovptr->iov_len;
-      wsaptr->buf = (char *) iovptr->iov_base;
-    }
-  while (wsaptr != wsabuf);
-
+  WSABUF wsabuf[msg->msg_iovlen];
+  DWORD bufcnt = 0, nxtbuf = 0;
+  ssize_t thischunk = get_chunk (tot);
+  nextchunk (wsabuf, msg, thischunk, &bufcnt, &nxtbuf);
 DWORD ret = 0;

 if (is_nonblocking () || closed () || async_io ())
-    res = WSASendTo (get_socket (), wsabuf, iovcnt, &ret,
+    res = WSASendTo (get_socket (), wsabuf, bufcnt, &ret,
                 flags & MSG_WINMASK, (struct sockaddr *) msg->msg_name,
                 msg->msg_namelen, NULL, NULL);
 else
@@ -1283,21 +1433,29 @@ fhandler_socket::sendmsg (const struct m
   HANDLE evt;
   if (prepare (evt, FD_CLOSE | FD_WRITE | (owner () ? FD_OOB : 0)))
    {
+         DWORD partialret = 0;
      do
        {
-             res = WSASendTo (get_socket (), wsabuf, iovcnt,
-                              &ret, flags & MSG_WINMASK,
-                              (struct sockaddr *) msg->msg_name,
-                              msg->msg_namelen, NULL, NULL);
+             do
+               {
+                 res = WSASendTo (get_socket (), wsabuf, bufcnt,
+                                  &partialret, flags & MSG_WINMASK,
+                                  (struct sockaddr *) msg->msg_name,
+                                  msg->msg_namelen, NULL, NULL);
+               }
+             while (res == SOCKET_ERROR
+                    && WSAGetLastError () == WSAEWOULDBLOCK
+                    && !(res = wait (evt, 0))
+                    && !closed ());
+             if (res == SOCKET_ERROR) break;
+             ret += partialret;
        }
-         while (res == SOCKET_ERROR
-                && WSAGetLastError () == WSAEWOULDBLOCK
-                && !(res = wait (evt, 0))
-                && !closed ());
-         release (evt);
+         while (partialret == thischunk
+                && nextchunk (wsabuf, msg, thischunk = get_chunk (tot-ret),
+                              &bufcnt, &nxtbuf));
    }
+      release (evt);
 }
-
 if (res == SOCKET_ERROR)
 set_winsock_errno ();
 else
diff -Naurp cygwin/net.cc cygwin.patched/net.cc
--- cygwin/net.cc       2006-01-20 10:54:29.001000000 -0500
+++ cygwin.patched/net.cc       2006-04-25 04:23:23.606821500 -0400
@@ -686,6 +686,9 @@ cygwin_setsockopt (int fd, int level, in
   res = setsockopt (fh->get_socket (), level, optname,
                    (const char *) optval, optlen);

+      if (SO_SNDBUF==optname && SOL_SOCKET==level)
+       fh->update_sndbuf ();
+
   if (optlen == 4)
    syscall_printf ("setsockopt optval=%x", *(long *) optval);

diff -Naurp cygwin/select.cc cygwin.patched/select.cc
--- cygwin/select.cc    2006-01-16 12:14:36.001000000 -0500
+++ cygwin.patched/select.cc    2006-04-25 04:23:23.646879100 -0400
@@ -615,6 +615,7 @@ struct pipeinf
 cygthread *thread;
 bool stop_thread_pipe;
 select_record *start;
+    HANDLE wakeup;
 };

 static DWORD WINAPI
@@ -645,7 +646,7 @@ thread_pipe (void *arg)
    }
   if (gotone)
    break;
-      Sleep (10);
+      WaitForSingleObject(pi->wakeup,10);
 }
 out:
 return 0;
@@ -662,6 +663,7 @@ start_thread_pipe (select_record *me, se
 pipeinf *pi = new pipeinf;
 pi->start = &stuff->start;
 pi->stop_thread_pipe = false;
+  pi->wakeup=CreateEvent(0,0,0,0);
 pi->thread = new cygthread (thread_pipe, 0, pi, "select_pipe");
 me->h = *pi->thread;
 if (!me->h)
@@ -677,7 +679,9 @@ pipe_cleanup (select_record *, select_st
 if (pi && pi->thread)
 {
   pi->stop_thread_pipe = true;
+      SetEvent(pi->wakeup);
   pi->thread->detach ();
+      CloseHandle(pi->wakeup);
   delete pi;
   stuff->device_specific_pipe = NULL;
 }