Linux workstation, MPI and QA tests

From NWChem

You are viewing a single post from the thread title above
Jump to: navigation, search

Click here for full thread
Just Got Here
Threads 1
Posts 3
Dear Edo,

Thank you for the reply.

Quote:Edoapra Feb 8th 5:55 pm
Christof,
I see from your compilation settings that you have supplied a long list of compiler options with the variable FC. I see pleny of potential problems with this since 1) the -openmp option is likely to cause runtime conflicts with the Global Arrays parallelization and 2) the makefile structure will be confused by this long variable for FC. If you really want to change the compiler options, the recommended way would be (for example)

I agree with you that too much optimization is potentially a problem. The big number of options resulted from adding "-fp-model precise -fp-model source" indiscriminately when the problems with the build showed up to put some brakes on the compiler.

I now recompiled with
setenv BLASOPT "-L$MKLROOT/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm -I$MKLROOT/include" 
make FC="ifort -I$MKLROOT/include" CC="icc -DMKL_ILP64 -I$MKLROOT/include" FOPTIMIZE="-O2 -fp-model precise" COPTIMIZE="-O2 -fp-model precise"


Please note the change to the MKL as I have to link to the sequential MKL now without being able to use -openmp. Also, the other options in FC and CC conform to the Intel MKL link line advisor output.

With these changes to the environment it now fails
  Checking single precisions 
 ga_create ........................ OK
 ga_fill_patch .................... OK
 ga_copy_patch .................... OK
ERROR (proc=1): a [1,19,0] =3736.000000,  b [1,19,0] =10001.000000
ERROR (proc=2): a [20,0,0] =36.000000,  b [20,0,0] =101.000000
ERROR (proc=3): a [1,0,0] =0.000000,  b [1,0,0] =1.000000
 ga_copy_patch (transpose) ........ OK
 ga_scale_patch ................... OK
 ga_add_patch ..................... OK
 ga_sdot_patch .................... OK
 ga_destory ....................... OK
 Commencing NGA Test
 -------------------
 Checking 3-Dimensional Arrays
 ga_fill .......................... OK
ERROR (proc=0): a [20,19,0] =10036.000000,  b [20,19,0] =10101.000000
Last System Error Message from Task 0:: No such file or directory
Last System Error Message from Task 1:: No such file or directory
Last System Error Message from Task 2:: No such file or directory
Last System Error Message from Task 3:: No such file or directory
3:3:bye:: 0
(rank:3 hostname:neuro24a pid:32464):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
1:1:bye:: 0
(rank:1 hostname:neuro24a pid:32462):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
2:2:bye:: 0
(rank:2 hostname:neuro24a pid:32463):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
0:0:bye:: 0
(rank:0 hostname:neuro24a pid:32461):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 4 DUP FROM 0 
with errorcode 0.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on


Some of this output appears to be out of order as expected from a parallel crash.

I would like to add that the MPI complains
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

but as far as I googled this should be mostly harmless.

I will try again from a clean tar.gz again tomorrow, may be there is some other cruft.

Quote:Edoapra Feb 8th 5:55 pm

I will try to answer your questions next
1. The "Linux workstation platform" compilation instructions result in a binary to be used with mpirun, since USE_MPI is set equal to y

OK.
Quote:Edoapra Feb 8th 5:55 pm

2. testtask should not fail for the 3-D Global Arrays test

Thougt so :-)
Quote:Edoapra Feb 8th 5:55 pm

3. If no high-speed network is present, you can let ARMCI_NETWORK undefined. Another option, recently introduced in GA/ARMCI and that we have not thoroughly tested yet, is MPI-MT. I would suggest, first to try to get the vanilla ARMCI compilation to work,
and then you might try the ARMCI_NETWORK=MPI-MT setting

I fully agree with that. Also, I did not enable MPI_THREAD_MULTIPLE when compiling the openmpi, so testing that would take more effort. However, does your answer imply that MPI across nodes with ethernet should work with unset ARMCI_NETWORK ? Or is the only chance with ARMCI_NETWORK=MPI-MT if at all ?
Quote:Edoapra Feb 8th 5:55 pm

4. I am not sure how to answer this one, since I can see conflicting details in your question. If you compiler NWChem with USE_MPI=y, tests have to be run with doqmtests.mpi, since only doqmtests.mpi uses the needed mpirun

So "QA/HOW-TO-RUN-TESTS" as distributed with the release tar.gz
   
c) Run the doqmtests and runtest.md scripts as described above, but first
      edit those files to substitute "runtests.mpi.unix" for "runtests.unix"
      and "runtest.unix"


is outdated ? Could a link be added to recent instructions right at the compiler how-to page ? Some people might run the binary without testing ...

The testsuite failures I referred to were with the "serial" binary, not the mpi one. Of course these failures are quite impossible to diagnose/debug/assess via the internet. I hope the earlier forum thread on the testsuite failures with the supplied binaries gathers a few more helpful comments.


Best Regards

Christof


Who's here now Members 0 Guests 0 Bots/Crawler 0


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC