Optimizing parallel performance on Xeon(R)

From NWChem

Viewed 1438 times, With a total of 0 Posts

Forum >> NWChem's corner >> Compiling NWChem

Maxjr Member
Profile
Send PM

Clicked A Few Times

Threads 2
Posts 6

5:37:24 PM PDT - Tue, Apr 5th 2016

Dear NWchem team,

I have recently compiled the NWchem code on a heterogeneous linux/intel cluster using the intel 2016 compiler and libraries, and the mpi 5.1.3. The nwchem executable was successfully built and seemed to be working normal. However, when I ran some parallel tests for the "hess_nh3" test case provided in the nwchem-6.6/QA/tests folder, I have found that the parallelization is quite inefficient. In fact, I have noticed that the total cpu time is increasing when the number of nodes increases rather than decrease! For example, when the calculation runs on one node with the maximum number of CPUs (2x8 physical cores per node), the total time is 1045.3 s and wall time is 1084.8 s, while running on 2 nodes with 32 CPUS I have got 1550.3 s for the total time and 2155.5 s for the wall time. This result becomes even worse if I continue increasing the number of nodes such that using 4 nodes the calculation is already twice times slower than running on a single node. Furthermore, I verified that run the calculations with the multithread (OMP_NUM_THREADS=1), i.e, using also the virtual cpus, the computing time is significantly slower than running the same calculation using only physical cores. Follow below a summary of the computing time I have obtained from my parallel tests:

1nodes_16cpus Total times cpu: 1045.3s wall: 1084.8s
2nodes_32cpus Total times cpu: 1550.3s wall: 2155.5s
3nodes_48cpus Total times cpu: 1547.8s wall: 2606.0s
4nodes_64cpus Total times cpu: 2071.2s wall: 3680.9s

These tests were run on a queue having Intel Xeon(R) E5620@2.40GHz computer nodes with 8 physical CPUs (16 cores in total) and 12 MB of memory cache. Please, does someone may give me a tip on how can I optimize the nwchem installation to reach its best parallel performance in this specific computer system? Does anyone has a suggestion of other tests to check the parallel performance?

I am including below the compiling instructions (install.sh) that I have used in my compilation and a dropbox link for the respective installation log:

====================================================================================

!/bin/sh

module purge

module load compilers/intel/16.0
module load libraries/ipmi/5.1
module load libraries/mkl/16.0

export PATH=/opt/intel/compilers_and_libraries_2016.2.181/linux/bin/intel64:$PATH

LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2016/linux/mkl/lib/intel64/:/opt/intel/compilers_and_libraries_2016/linux/lib/intel64:/usr/lib64:/lib:/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_LONG_PATHS=Y

export USE_MPI=Y
export USE_MPIF=Y
export USE_MPIF4=Y
export MPI_LOC=/opt/intel/impi/5.1.3.181/
export MPI_INCLUDE="-I/opt/intel/impi/5.1.3.181/intel64/include"
export MPI_LIB="/opt/intel/impi/5.1.3.181/intel64/lib/release -L/opt/intel/impi/5.1.3.181/intel64/lib/"
export LIBMPI="-lmpifort -lmpi -lmpigi -ldl -lrt -lpthread"

export NWCHEM_MODULES="all python"
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE

export LIB_DEFINES=-DDFLT_TOT_MEM=16777216

export IPCCSD="y"
export EACCSD="y"
export MRCC_THEORY=TRUE

export PYTHONHOME=/usr
export PYTHONVERSION=2.6
export USE_PYTHON64=y
export PYTHONLIBTYPE=so

sed -i 's/libpython$(PYTHONVERSION).a/libpython$(PYTHONVERSION).$(PYTHONLIBTYPE)/g' config/makefile.h

export HAS_BLAS=yes
export USE_SCALAPACK=y
export MKLLIB=/opt/intel/compilers_and_libraries_2016/linux/mkl/lib/intel64
export MKLINC=/opt/intel/compilers_and_libraries_2016/linux/mkl/include
export BLASOPT="-L$MKLLIB -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lpthread -lm"
export LAPACK_LIBS="-L$MKLLIB -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lpthread -lm"

export LAPACK_CPPFLAGS="-DMKL_ILP64 -I$MKLINC"

export SCALAPACK="-L$MKLLIB -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"

export SCALAPACK_CPPFLAGS="-DMKL_ILP64 -I$MKLINC"

export SCALAPACK_SIZE=8
export BLAS_SIZE=8
export USE_64TO32=y

export FC=ifort
export CC=icc

echo "cd $NWCHEM_TOP/src"
cd $NWCHEM_TOP/src

echo "BEGIN --- make realclean "
make realclean
echo "END --- make realclean "

echo "BEGIN --- make nwchem_config "
make nwchem_config
echo "END --- make nwchem_config "

echo "BEGIN --- make"
make CC=icc FC=ifort FOPTIMIZE="-O3 -msse2 -no-prec-div -funroll-loops -unroll-aggressive"
echo "END --- make "

cd $NWCHEM_TOP/src/util
make CC=icc FC=ifort FOPTIMIZE="-O3 -msse2 -no-prec-div -funroll-loops -unroll-aggressive" version
make CC=icc FC=ifort FOPTIMIZE="-O3 -msse2 -no-prec-div -funroll-loops -unroll-aggressive"
cd $NWCHEM_TOP/src
make CC=icc FC=ifort FOPTIMIZE="-O3 -msse2 -no-prec-div -funroll-loops -unroll-aggressive" link

====================================================================================

https://www.dropbox.com/s/aro9k74w6gphpys/install.log?dl=0

Please, let me know if you find some ill defined or missing flag.

I will be really grateful for any help!

All the best,

Max Pinheiro Jr

Forum >> NWChem's corner >> Compiling NWChem

Who's here now Members 0 Guests 1 Bots/Crawler 0

AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC

Search

Navigation

SEARCH

TOOLBOX

LANGUAGES

Forum Menu

Optimizing parallel performance on Xeon(R)

From NWChem

====================================================================================

====================================================================================

Toolbox