This is the mail archive of the cygwin@cygwin.com mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Interactions between rshd and programs using sockets?


I am experimenting with the MPI (Message Passing Interface).  For those not
familiar with MPI, it is a software specification that enables a master
program to communicate with slave programs on the same or remote machines so
that processes such as large numerical computations can be split over
several machines.  I'm using the Windows NT implementation publicly
available from Argonne National Laboratory..

Starting remote processes in the UNIX world is easy enough via rsh, etal,
but NT doesn't come with them, so the NT implementations usually come with
programs especially tailored for starting these slave process; one runs
these programs as services on the slave machines.  Unfortunately, I would
have to extensively remodel my application to take advantage of their
launching tools.  Therefore, after reading the documentation that comes with
the Argonne package, I now understand the environment that the master and
slave programs require.  I have been using Cygwin's rsh/rshd tools
successfully, but I've run into a problem that I don't know how to resolve.

In this implementation of MPI, programs get some of their parameters via
environment variables.  Programs on different machines communicate via
sockets and the socket number is one of those parameters.  I have master and
slave programs that operate properly when I manually set up the environment
regardless of whether the slave runs on the same machine as the master or on
a remote machine.  Then, I wrote short bash scripts that set up the
environment and execute the master and slave programs.  When I manually
invoke these scripts, again, everything works fine regardless of whether the
slave is on the same machine as the master.

The problem arises when I start the master program on a machine (NT1) and
then try to start the slave on a second machine (NT2) via rsh.

Here's the script:

MPICH_ROOT="roadrunner:54545"; export MPICH_ROOT
MPICH_JOBID=roadrunner.123; export MPICH_JOBID
MPICH_IPROC=$2; export MPICH_IPROC
MPICH_NPROC=$3; export MPICH_NPROC
echo Starting slave $MPICH_IPROC of $MPICH_NPROC
$1

Here's what happens when I try to start the slave on NT2:

$ rsh NT2 runslave.sh cpi.exe 1 2
starting slave 1 of 2
Error 10106, process 1
   ComPortThread: NT_Tcp_create_bind_select failed

$

The failure seems to happen almost immediately, therefore, it's not a
timeout problem such as the slave waiting to talk to the master.
Interestingly, if I rlogin to NT2 and execute "runslave.sh cpi.exe 1 2"
everything works fine.  So, the question is: What is it about running the
runslave.sh script via rsh that causes the port bind to fail?  One
difference I note is that rlogin asks me for a password while rsh does not.
Is this a security issue?  I updated my installation with the latest Cygwin
release yesterday but that didn't change the results any.  I think the
/etc/passwd and /etc/group files on both machines are properly built and
I've carefully tried to followed the instructions in the
ineutils-1.3.2.README.

Regards,
David L. Humphrey
Manager, Software Development
Bell Geospace, Inc

--
Want to unsubscribe from this list?
Check out: http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]