[Pkg-openmpi-maintainers] Bug#592326: Bug#592326: Failure of AZTEC test case run.

Jeff Squyres (jsquyres) jsquyres at cisco.com
Fri Sep 3 10:30:43 UTC 2010


Adding pthread could fix something, but I'm a little dubious. It seems unlikely. 

You should probably contact the Aztec authors at this point. 

Sent from my PDA. No type good. 

On Sep 3, 2010, at 3:05 AM, Rachel Gordon <rgordon at techunix.technion.ac.il> wrote:

> Dear Jeff, Ralf and  Manuel
> 
> There are some good news,
> I added -pthread  to both the compilation and link for running
> az_tutorial_with_MPI.f, and I also compiled aztec with -pthread
> Now the code runs O.K for np=1,2.
> 
> Now bad news: when I try running with 3,4 or more processors I get a similar error message:
> 
> mpirun -np 3 sample
> 
> [cluster:25805] *** Process received signal ***
> [cluster:25805] Signal: Segmentation fault (11)
> [cluster:25805] Signal code:  (128)
> [cluster:25805] Failing at address: (nil)
> [cluster:25805] [ 0] /lib/libpthread.so.0 [0x7fbe20cb5a80]
> [cluster:25805] [ 1] /shared/lib/libmpi.so.0 [0x7fbe221325f7]
> [cluster:25805] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7fbe22160a48]
> [cluster:25805] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
> [cluster:25805] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
> [cluster:25805] [ 5] sample(AZ_transform+0x1c3) [0x418372]
> [cluster:25805] [ 6] sample(az_transform_+0x84) [0x407943]
> [cluster:25805] [ 7] sample(MAIN__+0x19a) [0x407708]
> [cluster:25805] [ 8] sample(main+0x2c) [0x44e00c]
> [cluster:25805] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7fbe209721a6]
> [cluster:25805] [10] sample [0x4073b9]
> [cluster:25805] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 25805 on node cluster exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> When I try running on 4 4pcessors I get a double message (from 2 processors).
>    mpirun -np 4 sample
> 
> [cluster:25946] *** Process received signal ***
> [cluster:25946] Signal: Segmentation fault (11)
> [cluster:25946] Signal code:  (128)
> [cluster:25946] Failing at address: (nil)
> [cluster:25947] *** Process received signal ***
> [cluster:25947] Signal: Segmentation fault (11)
> [cluster:25947] Signal code:  (128)
> [cluster:25947] Failing at address: (nil)
> [cluster:25946] [ 0] /lib/libpthread.so.0 [0x7f4ae4c6ba80]
> [cluster:25946] [ 1] /shared/lib/libmpi.so.0 [0x7f4ae60e85f7]
> [cluster:25946] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7f4ae6116a48]
> [cluster:25946] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
> [cluster:25946] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
> [cluster:25947] [ 0] /lib/libpthread.so.0 [0x7f7dc5350a80]
> [cluster:25946] [ 5] sample(AZ_transform+0x1c3) [0x418372]
> [cluster:25946] [ 6] sample(az_transform_+0x84) [0x407943]
> [cluster:25946] [ 7] sample(MAIN__+0x19a) [0x407708]
> [cluster:25946] [ 8] sample(main+0x2c) [0x44e00c]
> [cluster:25946] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f4ae49281a6]
> [cluster:25946] [10] sample [0x4073b9]
> [cluster:25946] *** End of error message ***
> [cluster:25947] [ 1] /shared/lib/libmpi.so.0 [0x7f7dc67cd5f7]
> [cluster:25947] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7f7dc67fba48]
> [cluster:25947] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
> [cluster:25947] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
> [cluster:25947] [ 5] sample(AZ_transform+0x1c3) [0x418372]
> [cluster:25947] [ 6] sample(az_transform_+0x84) [0x407943]
> [cluster:25947] [ 7] sample(MAIN__+0x19a) [0x407708]
> [cluster:25947] [ 8] sample(main+0x2c) [0x44e00c]
> [cluster:25947] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f7dc500d1a6]
> [cluster:25947] [10] sample [0x4073b9]
> [cluster:25947] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 25946 on node cluster exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> 
> 
> 
> Attached is the file found in AZTEC named:  md_wrap_mpi_c.c
> This might give you some further hint.
> 
> 
> 
> Rachel
> 
>  Dr.  Rachel Gordon
>  Senior Research Fellow           Phone: +972-4-8293811
>  Dept. of Aerospace Eng.        Fax:   +972 - 4 - 8292030
>  The Technion, Haifa 32000, Israel     email: rgordon at tx.technion.ac.il
> 
> 
> On Thu, 2 Sep 2010, Ralf Wildenhues wrote:
> 
>> Hello Rachel, Jeff,
>> 
>> * Rachel Gordon wrote on Thu, Sep 02, 2010 at 01:35:37PM CEST:
>>> The cluster I am trying to run on has only the openmpi MPI version.
>>> So, mpif77 is equivalent to mpif77.openmpi and mpicc is equivalent
>>> to mpicc.openmpi
>>> 
>>> I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc.
>>> The compilation and linkage stage ran with no problem:
>>> 
>>> mpif77 -O   -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=200000
>>> -DMAX_CHUNK_SIZE=200000  -c -o az_tutorial_with_MPI.o
>>> az_tutorial_with_MPI.f
>>> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec      -o sample
>> 
>> Can you retry but this time add -pthread to both compile and link
>> command?
>> 
>> There were other reports on the OpenMPI devel list that some pthread
>> flags have gone missing somewhere.  It might well be that that caused
>> its libraries to already be built wrongly, or just the application,
>> I'm not sure.  But the segfault inside libpthread is suspicious.
>> 
>> Thanks,
>> Ralf
>> 
>>> But again when I try to run 'sample' I get:
>>> 
>>> mpirun -np 1 sample
>>> 
>>> 
>>> [cluster:24989] *** Process received signal ***
>>> [cluster:24989] Signal: Segmentation fault (11)
>>> [cluster:24989] Signal code: Address not mapped (1)
>>> [cluster:24989] Failing at address: 0x100000098
>>> [cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80]
>>> [cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e)
>>> [0x7f50594ce34e]
>>> [cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba]
>>> [cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
>>> [cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
>>> [cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662]
>>> [cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec]
>>> [cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6)
>>> [0x7f5057cf31a6]
>>> [cluster:24989] [ 8] sample [0x407459]
>>> [cluster:24989] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 24989 on node cluster
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>> 
>> 
> <md_wrap_mpi_c.c>






More information about the Pkg-openmpi-maintainers mailing list