[Pkg-openmpi-maintainers] Bug#584699: programs freeze on first MPI op. when run on multihomed IPv6 hosts
Ivan Shmakov
ivan at main.uusia.org
Sat Jun 5 19:31:15 UTC 2010
Source: openmpi
Version: 1.4.1-1
Programs run under mpirun(1) freeze on first MPI operation when
multiple addresses per interface are involved.
The configuration was roughly as follows:
$ ip addr
…
2: eth0: …
…
inet6 2001:db8::2XX:XXXX:XXXX:XXXX/64 scope global dynamic
…
inet6 2001:db8::17a:170:1:4/64 scope global
valid_lft forever preferred_lft forever
…
$
(I. e., one address was configured in interfaces(5), while the
other was thanks to the stateless IPv6 autoconfiguration.)
The mpirun(1) was invoked like:
$ mpirun -nperboard 4 -H node1…,node2…,node3… hpcc… < /dev/null &
which resulted in orted(1) being spawned on the nodes, with both
of the IPv6 addresses in the tcp6:// URI's. The payload
processes were consuming 100% CPU each, but no progress was
made. (While we've tried to diagnose the problem, it was
observed that the nodes have actually formed two disjoint sets,
with the nodes of a single set being able to participate in a
parallel computation spawned at any node, but all the attempts
to spawn a task using the nodes from different sets have
resulted in the behavior described above; apparently, the sets
were formed with some dependence on the MAC address.)
The behavior was 100%-reproducible.
Switching the stateless configuration off on the nodes with
sysctl(8) and removing the extra IP have fixed the problem.
# sysctl -w net.ipv6.conf.eth0.accept_ra=0
# ip addr del 2001:db8::2XX:XXXX:XXXX:XXXX/64 dev eth0
#
--
FSF associate member #7257
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-openmpi-maintainers/attachments/20100606/0ab5e2a8/attachment-0001.pgp>
More information about the Pkg-openmpi-maintainers
mailing list