[Pkg-openmpi-maintainers] Bug#524553: Bug#524553: openmpi-bin: mpiexec seems to be resolving names on server instead of each node

Manuel Prinz manuel at debian.org
Wed Apr 22 13:48:46 UTC 2009


Hi Micha!

I'm sorry for replying late! I was on holidays.

Your description sounds reasonable but I have no possibility to do tests
of my own at the moment. I CC'ed Jeff (upstream), maybe he can comment
on the issue.

BTW, did you also try the 1.3 series of Open MPI?

Best regards
Manuel


Am Samstag, den 18.04.2009, 01:49 +0300 schrieb Micha Feigin:
> Package: openmpi-bin
> Version: 1.2.8-3
> Severity: important
> 
> 
> As far as I understand the error, mpiexec resolves name -> addresses on the server
> it is run on instead of an each host seperately. This works in an environment where
> each hostname resolves to the same address on each host (cluster connected via a
> switch) but fails where it resolves to different addresses (ring/star setups for
> example where each computer is connected directly to all/some of the others)
> 
> I'm not 100% sure that this is the problem as I'm seeing success on a single
> case where this should probably fail but it is my best bet from the error message.
> 
> version 1.2.8 worked fine for the same simple program (a simple hellow world that
> just comunicated the computer name for each process)
> 
> An example output:
> 
> mpiexec is run on the master node hubert and is set to run the processes on two nodes
> fry and leela. As is understood from the error messages leela tries to connect to
> fry on address 192.168.1.2 which is it's address on hubert but not leela (where it
> is 192.168.4.1)
> 
> This is a four node claster all interconnected
> 
>     192.168.1.1      192.168.1.2
> hubert ------------------------ fry
>   |    \                    /    | 192.168.4.1
>   |       \              /       |
>   |          \        /          |
>   |             \  /             |
>   |             /  \             |
>   |          /        \          |
>   |       /              \       |
>   |    /                     \   | 192.168.4.2
> hermes ----------------------- leelas
> 
> =================================================================
> mpiexec -np 8 -H fry,leela test_mpi
> Hello MPI from the server process of 8 on fry!
> [[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
> 
> [[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
> 
> [[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
> 
> [leela:4436] *** An error occurred in MPI_Send
> [leela:4436] *** on communicator MPI_COMM_WORLD
> [leela:4436] *** MPI_ERR_INTERN: internal error
> [leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
> 
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 1 with PID 4433 on
> node leela exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
> [hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> =================================================================
> 
> This seems to be a directional issue as running the program -H fry,leela failes
> where -H leela,fry works, same behaviour for all senarious except those that include
> the master node (hubert) where it resolves the external ip (from an external dns) instead
> of the internal ip (from the hosts file). thus one direction fails (no external connection
> at the moment for all but the master) and the other causes a lockup
> 
> I hope that the explenation is not too convoluted
> 
> -- System Information:
> Debian Release: squeeze/sid
>   APT prefers unstable
>   APT policy: (500, 'unstable'), (1, 'experimental')
> Architecture: amd64 (x86_64)
> 
> Kernel: Linux 2.6.28.8 (SMP w/4 CPU cores)
> Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
> Shell: /bin/sh linked to /bin/bash
> 
> Versions of packages openmpi-bin depends on:
> ii  libc6                         2.9-7      GNU C Library: Shared libraries
> ii  libgcc1                       1:4.3.3-7  GCC support library
> ii  libopenmpi1                   1.2.8-3    high performance message passing l
> ii  libstdc++6                    4.3.3-7    The GNU Standard C++ Library v3
> ii  openmpi-common                1.2.8-3    high performance message passing l
> 
> openmpi-bin recommends no packages.
> 
> Versions of packages openmpi-bin suggests:
> ii  gfortran                      4:4.3.3-2  The GNU Fortran 95 compiler
> 
> -- no debconf information

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Dies ist ein digital signierter Nachrichtenteil
URL: <http://lists.alioth.debian.org/pipermail/pkg-openmpi-maintainers/attachments/20090422/a96427f9/attachment-0001.pgp>


More information about the Pkg-openmpi-maintainers mailing list