[Pkg-openmpi-maintainers] Bug#524553: Bug#524553: openmpi-bin: mpiexec seems to be resolving names on server instead of each node
Manuel Prinz
manuel at debian.org
Wed Apr 22 13:48:46 UTC 2009
Hi Micha!
I'm sorry for replying late! I was on holidays.
Your description sounds reasonable but I have no possibility to do tests
of my own at the moment. I CC'ed Jeff (upstream), maybe he can comment
on the issue.
BTW, did you also try the 1.3 series of Open MPI?
Best regards
Manuel
Am Samstag, den 18.04.2009, 01:49 +0300 schrieb Micha Feigin:
> Package: openmpi-bin
> Version: 1.2.8-3
> Severity: important
>
>
> As far as I understand the error, mpiexec resolves name -> addresses on the server
> it is run on instead of an each host seperately. This works in an environment where
> each hostname resolves to the same address on each host (cluster connected via a
> switch) but fails where it resolves to different addresses (ring/star setups for
> example where each computer is connected directly to all/some of the others)
>
> I'm not 100% sure that this is the problem as I'm seeing success on a single
> case where this should probably fail but it is my best bet from the error message.
>
> version 1.2.8 worked fine for the same simple program (a simple hellow world that
> just comunicated the computer name for each process)
>
> An example output:
>
> mpiexec is run on the master node hubert and is set to run the processes on two nodes
> fry and leela. As is understood from the error messages leela tries to connect to
> fry on address 192.168.1.2 which is it's address on hubert but not leela (where it
> is 192.168.4.1)
>
> This is a four node claster all interconnected
>
> 192.168.1.1 192.168.1.2
> hubert ------------------------ fry
> | \ / | 192.168.4.1
> | \ / |
> | \ / |
> | \ / |
> | / \ |
> | / \ |
> | / \ |
> | / \ | 192.168.4.2
> hermes ----------------------- leelas
>
> =================================================================
> mpiexec -np 8 -H fry,leela test_mpi
> Hello MPI from the server process of 8 on fry!
> [[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
>
> [[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
>
> [[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
>
> [leela:4436] *** An error occurred in MPI_Send
> [leela:4436] *** on communicator MPI_COMM_WORLD
> [leela:4436] *** MPI_ERR_INTERN: internal error
> [leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
>
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 1 with PID 4433 on
> node leela exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
> [hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> =================================================================
>
> This seems to be a directional issue as running the program -H fry,leela failes
> where -H leela,fry works, same behaviour for all senarious except those that include
> the master node (hubert) where it resolves the external ip (from an external dns) instead
> of the internal ip (from the hosts file). thus one direction fails (no external connection
> at the moment for all but the master) and the other causes a lockup
>
> I hope that the explenation is not too convoluted
>
> -- System Information:
> Debian Release: squeeze/sid
> APT prefers unstable
> APT policy: (500, 'unstable'), (1, 'experimental')
> Architecture: amd64 (x86_64)
>
> Kernel: Linux 2.6.28.8 (SMP w/4 CPU cores)
> Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
> Shell: /bin/sh linked to /bin/bash
>
> Versions of packages openmpi-bin depends on:
> ii libc6 2.9-7 GNU C Library: Shared libraries
> ii libgcc1 1:4.3.3-7 GCC support library
> ii libopenmpi1 1.2.8-3 high performance message passing l
> ii libstdc++6 4.3.3-7 The GNU Standard C++ Library v3
> ii openmpi-common 1.2.8-3 high performance message passing l
>
> openmpi-bin recommends no packages.
>
> Versions of packages openmpi-bin suggests:
> ii gfortran 4:4.3.3-2 The GNU Fortran 95 compiler
>
> -- no debconf information
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Dies ist ein digital signierter Nachrichtenteil
URL: <http://lists.alioth.debian.org/pipermail/pkg-openmpi-maintainers/attachments/20090422/a96427f9/attachment-0001.pgp>
More information about the Pkg-openmpi-maintainers
mailing list