[Pkg-openmpi-maintainers] Bug#592326: Bug#592326: Failure of AZTEC test case run.
Jeff Squyres (jsquyres)
jsquyres at cisco.com
Fri Sep 3 10:30:43 UTC 2010
Adding pthread could fix something, but I'm a little dubious. It seems unlikely.
You should probably contact the Aztec authors at this point.
Sent from my PDA. No type good.
On Sep 3, 2010, at 3:05 AM, Rachel Gordon <rgordon at techunix.technion.ac.il> wrote:
> Dear Jeff, Ralf and Manuel
>
> There are some good news,
> I added -pthread to both the compilation and link for running
> az_tutorial_with_MPI.f, and I also compiled aztec with -pthread
> Now the code runs O.K for np=1,2.
>
> Now bad news: when I try running with 3,4 or more processors I get a similar error message:
>
> mpirun -np 3 sample
>
> [cluster:25805] *** Process received signal ***
> [cluster:25805] Signal: Segmentation fault (11)
> [cluster:25805] Signal code: (128)
> [cluster:25805] Failing at address: (nil)
> [cluster:25805] [ 0] /lib/libpthread.so.0 [0x7fbe20cb5a80]
> [cluster:25805] [ 1] /shared/lib/libmpi.so.0 [0x7fbe221325f7]
> [cluster:25805] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7fbe22160a48]
> [cluster:25805] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
> [cluster:25805] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
> [cluster:25805] [ 5] sample(AZ_transform+0x1c3) [0x418372]
> [cluster:25805] [ 6] sample(az_transform_+0x84) [0x407943]
> [cluster:25805] [ 7] sample(MAIN__+0x19a) [0x407708]
> [cluster:25805] [ 8] sample(main+0x2c) [0x44e00c]
> [cluster:25805] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7fbe209721a6]
> [cluster:25805] [10] sample [0x4073b9]
> [cluster:25805] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 25805 on node cluster exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> When I try running on 4 4pcessors I get a double message (from 2 processors).
> mpirun -np 4 sample
>
> [cluster:25946] *** Process received signal ***
> [cluster:25946] Signal: Segmentation fault (11)
> [cluster:25946] Signal code: (128)
> [cluster:25946] Failing at address: (nil)
> [cluster:25947] *** Process received signal ***
> [cluster:25947] Signal: Segmentation fault (11)
> [cluster:25947] Signal code: (128)
> [cluster:25947] Failing at address: (nil)
> [cluster:25946] [ 0] /lib/libpthread.so.0 [0x7f4ae4c6ba80]
> [cluster:25946] [ 1] /shared/lib/libmpi.so.0 [0x7f4ae60e85f7]
> [cluster:25946] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7f4ae6116a48]
> [cluster:25946] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
> [cluster:25946] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
> [cluster:25947] [ 0] /lib/libpthread.so.0 [0x7f7dc5350a80]
> [cluster:25946] [ 5] sample(AZ_transform+0x1c3) [0x418372]
> [cluster:25946] [ 6] sample(az_transform_+0x84) [0x407943]
> [cluster:25946] [ 7] sample(MAIN__+0x19a) [0x407708]
> [cluster:25946] [ 8] sample(main+0x2c) [0x44e00c]
> [cluster:25946] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f4ae49281a6]
> [cluster:25946] [10] sample [0x4073b9]
> [cluster:25946] *** End of error message ***
> [cluster:25947] [ 1] /shared/lib/libmpi.so.0 [0x7f7dc67cd5f7]
> [cluster:25947] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7f7dc67fba48]
> [cluster:25947] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
> [cluster:25947] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
> [cluster:25947] [ 5] sample(AZ_transform+0x1c3) [0x418372]
> [cluster:25947] [ 6] sample(az_transform_+0x84) [0x407943]
> [cluster:25947] [ 7] sample(MAIN__+0x19a) [0x407708]
> [cluster:25947] [ 8] sample(main+0x2c) [0x44e00c]
> [cluster:25947] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f7dc500d1a6]
> [cluster:25947] [10] sample [0x4073b9]
> [cluster:25947] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 25946 on node cluster exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>
>
>
> Attached is the file found in AZTEC named: md_wrap_mpi_c.c
> This might give you some further hint.
>
>
>
> Rachel
>
> Dr. Rachel Gordon
> Senior Research Fellow Phone: +972-4-8293811
> Dept. of Aerospace Eng. Fax: +972 - 4 - 8292030
> The Technion, Haifa 32000, Israel email: rgordon at tx.technion.ac.il
>
>
> On Thu, 2 Sep 2010, Ralf Wildenhues wrote:
>
>> Hello Rachel, Jeff,
>>
>> * Rachel Gordon wrote on Thu, Sep 02, 2010 at 01:35:37PM CEST:
>>> The cluster I am trying to run on has only the openmpi MPI version.
>>> So, mpif77 is equivalent to mpif77.openmpi and mpicc is equivalent
>>> to mpicc.openmpi
>>>
>>> I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc.
>>> The compilation and linkage stage ran with no problem:
>>>
>>> mpif77 -O -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=200000
>>> -DMAX_CHUNK_SIZE=200000 -c -o az_tutorial_with_MPI.o
>>> az_tutorial_with_MPI.f
>>> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec -o sample
>>
>> Can you retry but this time add -pthread to both compile and link
>> command?
>>
>> There were other reports on the OpenMPI devel list that some pthread
>> flags have gone missing somewhere. It might well be that that caused
>> its libraries to already be built wrongly, or just the application,
>> I'm not sure. But the segfault inside libpthread is suspicious.
>>
>> Thanks,
>> Ralf
>>
>>> But again when I try to run 'sample' I get:
>>>
>>> mpirun -np 1 sample
>>>
>>>
>>> [cluster:24989] *** Process received signal ***
>>> [cluster:24989] Signal: Segmentation fault (11)
>>> [cluster:24989] Signal code: Address not mapped (1)
>>> [cluster:24989] Failing at address: 0x100000098
>>> [cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80]
>>> [cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e)
>>> [0x7f50594ce34e]
>>> [cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba]
>>> [cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
>>> [cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
>>> [cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662]
>>> [cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec]
>>> [cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6)
>>> [0x7f5057cf31a6]
>>> [cluster:24989] [ 8] sample [0x407459]
>>> [cluster:24989] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 24989 on node cluster
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>
>>
> <md_wrap_mpi_c.c>
More information about the Pkg-openmpi-maintainers
mailing list