6.1.1 MPI build runs great, but only on 1 node

From NWChem

Viewed 18153 times, With a total of 24 Posts
Jump to: navigation, search

Jump to page 12Next 16Last
Just Got Here
Threads 1
Posts 4
NWChem 6.1.1 on SL6 Linux, built with gcc-4.4 and openmpi-1.4.3.

Here's what I did to build it:

export NWCHEM_TOP=$PWD
export NWCHEM_TARGET=LINUX64
export INSTALL_PREFIX=/opt/nwchem/6.1.1
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export MPI_LIB=/opt/openmpi/1.4.3/lib
export MPI_INCLUDE=/opt/openmpi/1.4.3/include
export FC=gfortran
export CC=gcc
cd $NWCHEM_TOP/src
make nwchem_config NWCHEM_MODULES=all
make
mkdir -p $INSTALL_PREFIX
mkdir -p $INSTALL_PREFIX/bin
mkdir -p $INSTALL_PREFIX/data
cp $NWCHEM_TOP/bin/${NWCHEM_TARGET}/nwchem $INSTALL_PREFIX/bin
chmod 755 $INSTALL_PREFIX/bin/nwchem
cp -r $NWCHEM_TOP/src/basis/libraries $INSTALL_PREFIX/data
cp -r $NWCHEM_TOP/src/data $INSTALL_PREFIX
cp -r $NWCHEM_TOP/src/nwpw/libraryps $INSTALL_PREFIX/data

Here's how I run it (using PBS Professional 11.2):

  1. !/bin/bash
  2. PBS -N nwchem
  3. PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
  4. PBS -j oe
mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out

But all 16 processes appear on only one of the 2 nodes I've been allocated for this job. If I switch to running on only 1 node, everything looks great, but more than 1 node causes all of the processes to double-up on only the "master" node.

Any ideas/comments/suggestions?

Thanks a lot!

Just Got Here
Threads 1
Posts 4
Ooops! My PBS job script was autoformatted when I submitted. It should look like this:

\#!/bin/bash
\#PBS -N nwchem
\#PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
\#PBS -j oe

mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out

Just Got Here
Threads 1
Posts 4
I posted this in the "Compiling NWChem" section because I suspect that this problem is associated with the way I built my executable.

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 4
Posts 597
This is not a build issue as far as I can see. It is the mpiexec command that starts the 16 nwchem processes on one node, nwchem itself has nothing to do with that. You may want to look at the mpiexec manual. For example adding "-npernode 8" might give you want you need. Alternatively, you may want to use mpirun.

Bert

Quote:Chemogan Jul 18th 5:31 pm
Ooops! My PBS job script was autoformatted when I submitted. It should look like this:

\#!/bin/bash
\#PBS -N nwchem
\#PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
\#PBS -j oe

mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out

Just Got Here
Threads 1
Posts 4
Thanks Bert. Yeah, I'm getting the impression that I did build NWChem successfully, and that I'm just having some trouble with OpenMPI (I'm more accustomed to MPICH2).

Ah-ha! Yes that was it. Works now.

I added "-hostfile" and "-npernode" to my command (mpiexec is just a synonym for mpirun, they're both symbolic links for orterun):

mpiexec -n 16 -hostfile $PBS_NODEFILE -npernode 8 nwchem n2.mp2.ccsd.nwchem > n2.mp2.ccsd.out

Sorry for posting in the "Compiling" section. Perhaps this thread should be moved to the "Running" seciton, if that's possible.

Thanks so much for your help!

Gets Around
Threads 17
Posts 72
Hi,

it seems reopening of the thread is needed.
The nwchem 6.1.1 does not run accross the nodes on my system too. Nwchem 6.0 runs fine.
The 6.1.1 (and also the initial 6.1 release), when run across the nodes, crashes with:

argument 1 = ../nwchem.nw -10000:armci_AcceptSockAll:timeout waiting for connection: 0 (rank:-10000 hostname:d071.dcsc.fysik.dtu.dk pid:20939):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_AcceptSockAll():673 cond:0 0:armci_rcv_data: read failed: -1 (rank:0 hostname:d071.dcsc.fysik.dtu.dk pid:20936):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/dataserv.c:armci_ReadFromDirect():439 cond:0 -10002:armci_AcceptSockAll:timeout waiting for connection: 0 (rank:-10002 hostname:d031.dcsc.fysik.dtu.dk pid:22561):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_AcceptSockAll():673 cond:0 2:Child process terminated prematurely, status=: 256 (rank:2 hostname:d031.dcsc.fysik.dtu.dk pid:22558):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigChldHandler():178 cond:0

The http://www.nwchem-sw.org/images/Nwchem-6.1.1-src.2012-06-27.tar.gz is built against openmpi 1.3.3 with torque support, with the following script (irrelavant parts of the filesystem paths are replaced by ...) on CentOS 5, x86_64:

export NWCHEM_TOP=/.../nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export LD_LIBRARY_PATH=/.../lib64
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPIEXEC=/.../bin/mpiexec
export MPI_LIB=/.../lib64
export MPI_INCLUDE=/.../include/
export LIBMPI='-L/.../lib64 -lmpi -lmpi_f90 -lmpi_f77'
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export TCGRSH=ssh
export PYTHONHOME=/usr
export PYTHONVERSION=2.4
export PYTHONLIBTYPE=a
export USE_PYTHON64=y
export HAS_BLAS=yes
export BLASOPT="-L/usr/lib64 -lblas -llapack"
make nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee make_nwchem_config.log
make 64_to_32 2>&1 | tee make_64_to_32.log
make USE_64TO32=y 2>&1 | tee make.log

I run the following example (with mpiexec `which nwchem` nwchem.nw):

geometry noautoz noautosym
O 0.0 0.0 1.245956
O 0.0 0.0 0.0
end
basis spherical
\* library cc-pvdz
end

dft
mult 3
xc xpbe96 cpbe96
smear 0.0
direct
noio
end

task dft energy

I have tried also to specify the PBS_NODEFILE explicitly with --hostfile ${PBS_NODEFILE}.
On the nodes, i see just one nwchem per node sitting with 100% of CPU, other instances are with 0 CPU load.

Forum Vet
Threads 7
Posts 1267
Marcindulak
Could please post the following files
$NWCHEM_TOP/src/tools/build/config/makefile.h
$NWCHEM_TOP/src/tools/build/armci/config/makefile.h

Please send the output of the following command, too
mpiexec -V

It would be useful to see the full error/output file from NWChem,
with -v option passed to mpiexec

Gets Around
Threads 17
Posts 72
I run this time with:
mpiexec -wdir `pwd` --tmpdir `pwd` --debug-daemons --verbose `which nwchem` nwchem.nw

The resulting files are available:
http://dcwww.camd.dtu.dk/~dulak/nwchem-6.1.1-build_config.log
http://dcwww.camd.dtu.dk/~dulak/nwchem-6.1.1-build_armci_config.log
http://dcwww.camd.dtu.dk/~dulak/nwchem-6.1.1.err
http://dcwww.camd.dtu.dk/~dulak/nwchem-6.1.1.out
http://dcwww.camd.dtu.dk/~dulak/ompi_info

I would also like to see the rules what characters can be use and what when posting on the forum not clearly described: see
http://www.nwchem-sw.org/index.php/Special:AWCforum/st/id338/I_can%27t_post_in_the_compili...
http://www.nwchem-sw.org/index.php/Special:AWCforum/st/id493/%22The_specified_URL_cannot_b...
The forum should be self contained, so one does not need to create extenal links in order to provide the requested files.

Forum Vet
Threads 7
Posts 1267
What Linux Distribution?
Marcindulak
What linux distribution & version are you using?

Forum Vet
Threads 7
Posts 1267
BLAS size
Marcindulak
The only problem I spotted so far (and that should not explain the inter-node problem) is that your are using blas (and maybe lapack)
from /usr/lib64. My guess is that library uses 32-bit integers. If this is indeed the case, you would need to specify that to the tools
configurations by specifying the following environmental variables
BLAS_SIZE=4
LAPACK_SIZE=4

Forum Vet
Threads 7
Posts 1267
Marcindulak
I have managed to reproduce your problem.
However, I do not see any difference with 6.0 ... can you confirm that 6.0 works fine when using ARMCI_NETWORK=SOCKETS and using OpenMPI?

Cheers, Edo

Gets Around
Threads 17
Posts 72
I have compiled 6.1.1 with {BLAS,LAPACK}_SIZE=4 without solving the mpi problem, apart from getting --with-blas4="-L/usr/lib64 -lblas -llapack" in the make stages. As a side comment shouldn't LAPACK_LIB variable be set too, and not only BLASOPT?
I see LAPACK_LIB variable is not mentioned at http://www.nwchem-sw.org/index.php/Compiling_NWChem
This makes the output when setting BLASOPT to look like:
--without-lapack --with-blas8=-L/usr/lib64 -lblas -llapack

The 6.0 version i use is this one:
http://download.opensuse.org/repositories/home:/marcindulak/CentOS_CentOS-5/
with the log available:
https://build.opensuse.org/package/live_build_log?arch=x86_64&package=nwchem&proje...
It does not look like nwchem 6.0 prints anything about ARMCI_NETWORK, and i haven't set anything.

In my impression the problems with crashes across the nodes started around the time when i had to set
USE_MPIF4=y in order to kompile nwchem.

Forum Vet
Threads 7
Posts 1267
Marcindulak,
Could you please send me the full stderr/stdout of a successful multinode run with 6.0?
Could you please add the following options to mpiexec/mpirun/orterun
--mca btl_base_verbose 50 --mca btl_openib_verbose 1
Thanks, Edo
Edited On 9:19:32 AM PDT - Wed, Aug 22nd 2012 by Edoapra

Forum Vet
Threads 7
Posts 1267
Please ignore previous post
Marcindulak,
Please ignore the previous post since I have managed to reproduce your findings using the 6.0 and 6.1.1 binaries from your RPMs
(it took me a while to figure out the write openmpi orterun option to get things working, however ...).
More later, Edo

Forum Vet
Threads 7
Posts 1267
How to revert 6.1 back to the 6.0 behavior for the tools directory
Marcindulak,
The following recipe might work to fix your 6.1 issues (it worked for me).
It allows you to link with the same parallel tools used in 6.0.

cd $NWCHEM_TOP/src/tools
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y clean
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y
cd ..
make FC=gfortran link

Cheers, Edo

Gets Around
Threads 17
Posts 72
The following fails linking blas for me, are the steps in the right order?:

export NWCHEM_TOP=/.../nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export LD_LIBRARY_PATH=/usr/lib64/openmpi/1.4-gcc/lib
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPIEXEC=/usr/lib64/openmpi/1.4-gcc/bin/mpiexec
export MPI_LIB=/usr/lib64/openmpi/1.4-gcc/lib
export MPI_INCLUDE=/usr/lib64/openmpi/1.4-gcc/include
export LIBMPI='-L/usr/lib64/openmpi/1.4-gcc/lib -lmpi -lmpi_f90 -lmpi_f77'
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export TCGRSH=ssh
export PYTHONHOME=/usr
export PYTHONVERSION=2.4
export PYTHONLIBTYPE=a
export USE_PYTHON64=y
export HAS_BLAS=yes
export BLASOPT="-L/usr/lib64 -lblas -llapack"
make nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee make_nwchem_config.log
make 64_to_32 2>&1 | tee make_64_to_32.log
make USE_64TO32=y 2>&1 | tee make.log
cd $NWCHEM_TOP/src/tools
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y clean 2>&1 | tee ../make_ga_clean.log
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y 2>&1 | tee ../make_ga.log
cd ..
make FC=gfortran link 2>&1 | tee make_link.log

with:
/.../nwchem-6.1.1-src/src/task/task_bsse.F:1778: undefined reference to `ycopy_'

Please fix the problems with posting to the forum: im wasting about 5 min per post trying
to figure out, line-by-line, which characters are are allowed and which are not.
This time i figured out that single quote is not allowed in im.


Forum >> NWChem's corner >> Compiling NWChem
Jump to page 12Next 16Last



Who's here now Members 0 Guests 1 Bots/Crawler 0


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC