 |
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
Compiling for MPI.
From NWChem
Viewed 1143 times, With a total of 6 Posts
|
|
|
Clicked A Few Times
Threads 13
Posts 27
|
|
| 1:48:33 PM PST - Mon, Dec 19th 2011 |
|
I have been working for a few months with NWChem on a workstation, but now need to ramp up the size of the simulations I am doing. I have unsuccessfully been trying to get NWChem 6 with MPI to compile
for a while now. Any counsel to resolve this will be appreciated.
My environment variables which I've set for compiling (with the greatest success rate so far) are:
export NWCHEM_TARGET=LINUX64
export NWCHEM_TOP=~/nwchem/nwchem-6.0/
export NWCHEM_MODULES=all
export LARGE_FILES=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"
export CC=gcc
export FC=gfortran
export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/local/mvapich2-1.6-gcc
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpich"
The compilation is successful with gcc/gfortran, although switching everything to the corresponding Intel compilers and modules consistently errors out. The cluster is running Scientific Linux over IB with either MVAPICH or OpenMPI, with gcc/gfortran v.4.4.5; GNU Make v.3.81.
The output when I run is
[davis68@taub302 uo2-work]$ mpiexec ~/bin/nwchem lda-147.nw
ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'.
-10012:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10012 hostname:taub448 pid:24214):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
12:Child process terminated prematurely, status=: 256
(rank:12 hostname:taub448 pid:24188):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
-10000:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10000 hostname:taub302 pid:21006):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
0:Child process terminated prematurely, status=: 256
(rank:0 hostname:taub302 pid:20981):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
There is a long wait after the first line, ``ARMCI configured for 2 cluster nodes... before the other messages appear.
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
|
Forum Vet
Threads 5
Posts 575
|
|
| 3:28:47 PM PST - Thu, Dec 22nd 2011 |
|
Please carefully read the INSTALL file, section about openIB. You need to specify the ARMCI_NETWORK and the location of IB libraries.
Bert
Quote:Davis68 Dec 19th 8:48 pmI have been working for a few months with NWChem on a workstation, but now need to ramp up the size of the simulations I am doing. I have unsuccessfully been trying to get NWChem 6 with MPI to compile
for a while now. Any counsel to resolve this will be appreciated.
My environment variables which I've set for compiling (with the greatest success rate so far) are:
export NWCHEM_TARGET=LINUX64
export NWCHEM_TOP=~/nwchem/nwchem-6.0/
export NWCHEM_MODULES=all
export LARGE_FILES=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"
export CC=gcc
export FC=gfortran
export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/local/mvapich2-1.6-gcc
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpich"
The compilation is successful with gcc/gfortran, although switching everything to the corresponding Intel compilers and modules consistently errors out. The cluster is running Scientific Linux over IB with either MVAPICH or OpenMPI, with gcc/gfortran v.4.4.5; GNU Make v.3.81.
The output when I run is
[davis68@taub302 uo2-work]$ mpiexec ~/bin/nwchem lda-147.nw
ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'.
-10012:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10012 hostname:taub448 pid:24214):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
12:Child process terminated prematurely, status=: 256
(rank:12 hostname:taub448 pid:24188):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
-10000:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10000 hostname:taub302 pid:21006):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
0:Child process terminated prematurely, status=: 256
(rank:0 hostname:taub302 pid:20981):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
There is a long wait after the first line, ``ARMCI configured for 2 cluster nodes... before the other messages appear.
|
|
|
|
|
|
Clicked A Few Times
Threads 13
Posts 27
|
|
| 8:49:58 AM PST - Mon, Dec 26th 2011 |
|
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"
This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
argument 1 = lda-147.nw
============================== echo of input deck ==============================
...
normal output for initial processing
...
NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
* NWPW PSPW Calculation *
...
>>> JOB STARTED AT Fri Dec 23 14:18:03 2011 <<<
================ input data ========================
Pack_init:error pushing stack 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15
In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
|
Forum Vet
Threads 5
Posts 575
|
|
| 2:08:30 PM PST - Wed, Dec 28th 2011 |
|
COuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).
Bert
Quote:Davis68 Dec 26th 3:49 pmOK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"
This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
argument 1 = lda-147.nw
============================== echo of input deck ==============================
...
normal output for initial processing
...
NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
* NWPW PSPW Calculation *
...
>>> JOB STARTED AT Fri Dec 23 14:18:03 2011 <<<
================ input data ========================
Pack_init:error pushing stack 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15
In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
|
Forum Vet
Threads 5
Posts 575
|
|
| 3:31:06 PM PST - Wed, Dec 28th 2011 |
|
User specified 16Gbyte in the input deck, where the input is per processor.
Bert
Quote:Bert Dec 28th 9:08 pmCOuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).
Bert
Quote:Davis68 Dec 26th 3:49 pmOK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"
This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
argument 1 = lda-147.nw
============================== echo of input deck ==============================
...
normal output for initial processing
...
NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
* NWPW PSPW Calculation *
...
>>> JOB STARTED AT Fri Dec 23 14:18:03 2011 <<<
================ input data ========================
Pack_init:error pushing stack 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15
In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
|
Forum Vet
Threads 5
Posts 575
|
|
| 4:57:42 PM PST - Wed, Dec 28th 2011 |
|
When compiling it is recommended to set the following three environment variables:
USE_MPI y
USE_MPIF y
USE_MPIF4 y
By adding the third environment variables and recompiling the user (Neal) was able to successfully run.
Bert
Quote:Bert Dec 28th 10:31 pmUser specified 16Gbyte in the input deck, where the input is per processor.
Bert
Quote:Bert Dec 28th 9:08 pmCOuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).
Bert
Quote:Davis68 Dec 26th 3:49 pmOK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"
This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
argument 1 = lda-147.nw
============================== echo of input deck ==============================
...
normal output for initial processing
...
NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
* NWPW PSPW Calculation *
...
>>> JOB STARTED AT Fri Dec 23 14:18:03 2011 <<<
================ input data ========================
Pack_init:error pushing stack 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15
In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.
|
|
|
|
|
|
Clicked A Few Times
Threads 13
Posts 27
|
|
| 4:58:00 PM PST - Wed, Dec 28th 2011 |
|
| The solution, in this case, turned out to be adding another environment variable, USE_MPIF4=y. Thanks Bert.
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
|  |