SEARCH
TOOLBOX
LANGUAGES
Forum Menu

Compiling for MPI.

From NWChem

Viewed 1143 times, With a total of 6 Posts
Jump to: navigation, search

Clicked A Few Times
Threads 13
Posts 27
I have been working for a few months with NWChem on a workstation, but now need to ramp up the size of the simulations I am doing. I have unsuccessfully been trying to get NWChem 6 with MPI to compile
for a while now. Any counsel to resolve this will be appreciated.
My environment variables which I've set for compiling (with the greatest success rate so far) are:
export NWCHEM_TARGET=LINUX64
export NWCHEM_TOP=~/nwchem/nwchem-6.0/
export NWCHEM_MODULES=all
export LARGE_FILES=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"

export CC=gcc
export FC=gfortran

export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/local/mvapich2-1.6-gcc
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpich"

The compilation is successful with gcc/gfortran, although switching everything to the corresponding Intel compilers and modules consistently errors out. The cluster is running Scientific Linux over IB with either MVAPICH or OpenMPI, with gcc/gfortran v.4.4.5; GNU Make v.3.81.
The output when I run is
[davis68@taub302 uo2-work]$ mpiexec ~/bin/nwchem lda-147.nw 
ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'.
-10012:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10012 hostname:taub448 pid:24214):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
12:Child process terminated prematurely, status=: 256
(rank:12 hostname:taub448 pid:24188):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
-10000:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10000 hostname:taub302 pid:21006):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
0:Child process terminated prematurely, status=: 256
(rank:0 hostname:taub302 pid:20981):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0

There is a long wait after the first line, ``ARMCI configured for 2 cluster nodes... before the other messages appear.

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 5
Posts 575
Please carefully read the INSTALL file, section about openIB. You need to specify the ARMCI_NETWORK and the location of IB libraries.

Bert


Quote:Davis68 Dec 19th 8:48 pm
I have been working for a few months with NWChem on a workstation, but now need to ramp up the size of the simulations I am doing. I have unsuccessfully been trying to get NWChem 6 with MPI to compile
for a while now. Any counsel to resolve this will be appreciated.
My environment variables which I've set for compiling (with the greatest success rate so far) are:
export NWCHEM_TARGET=LINUX64
export NWCHEM_TOP=~/nwchem/nwchem-6.0/
export NWCHEM_MODULES=all
export LARGE_FILES=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"

export CC=gcc
export FC=gfortran

export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/local/mvapich2-1.6-gcc
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpich"

The compilation is successful with gcc/gfortran, although switching everything to the corresponding Intel compilers and modules consistently errors out. The cluster is running Scientific Linux over IB with either MVAPICH or OpenMPI, with gcc/gfortran v.4.4.5; GNU Make v.3.81.
The output when I run is
[davis68@taub302 uo2-work]$ mpiexec ~/bin/nwchem lda-147.nw 
ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'.
-10012:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10012 hostname:taub448 pid:24214):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
12:Child process terminated prematurely, status=: 256
(rank:12 hostname:taub448 pid:24188):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
-10000:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10000 hostname:taub302 pid:21006):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
0:Child process terminated prematurely, status=: 256
(rank:0 hostname:taub302 pid:20981):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0

There is a long wait after the first line, ``ARMCI configured for 2 cluster nodes... before the other messages appear.

Clicked A Few Times
Threads 13
Posts 27
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"


This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
 argument  1 = lda-147.nw



============================== echo of input deck ==============================
...
normal output for initial processing
...


NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
          *               NWPW PSPW Calculation              *
...
     >>>  JOB STARTED       AT Fri Dec 23 14:18:03 2011  <<<
          ================ input data ========================
 Pack_init:error pushing stack        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15

In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 5
Posts 575
COuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).

Bert


Quote:Davis68 Dec 26th 3:49 pm
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"


This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
 argument  1 = lda-147.nw



============================== echo of input deck ==============================
...
normal output for initial processing
...


NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
          *               NWPW PSPW Calculation              *
...
     >>>  JOB STARTED       AT Fri Dec 23 14:18:03 2011  <<<
          ================ input data ========================
 Pack_init:error pushing stack        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15

In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 5
Posts 575
User specified 16Gbyte in the input deck, where the input is per processor.

Bert



Quote:Bert Dec 28th 9:08 pm
COuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).

Bert


Quote:Davis68 Dec 26th 3:49 pm
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"


This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
 argument  1 = lda-147.nw



============================== echo of input deck ==============================
...
normal output for initial processing
...


NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
          *               NWPW PSPW Calculation              *
...
     >>>  JOB STARTED       AT Fri Dec 23 14:18:03 2011  <<<
          ================ input data ========================
 Pack_init:error pushing stack        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15

In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 5
Posts 575
When compiling it is recommended to set the following three environment variables:

USE_MPI y
USE_MPIF y
USE_MPIF4 y

By adding the third environment variables and recompiling the user (Neal) was able to successfully run.

Bert



Quote:Bert Dec 28th 10:31 pm
User specified 16Gbyte in the input deck, where the input is per processor.

Bert



Quote:Bert Dec 28th 9:08 pm
COuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).

Bert


Quote:Davis68 Dec 26th 3:49 pm
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"


This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
 argument  1 = lda-147.nw



============================== echo of input deck ==============================
...
normal output for initial processing
...


NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
          *               NWPW PSPW Calculation              *
...
     >>>  JOB STARTED       AT Fri Dec 23 14:18:03 2011  <<<
          ================ input data ========================
 Pack_init:error pushing stack        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15

In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.

Clicked A Few Times
Threads 13
Posts 27
The solution, in this case, turned out to be adding another environment variable, USE_MPIF4=y. Thanks Bert.


Forum >> NWChem's corner >> Compiling NWChem



Who's here now Members 0 Guests 1 Bots/Crawler 0


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC