Improper compilation causes memory error during running nwchem

From NWChem

Viewed 773 times, With a total of 13 Posts
Jump to: navigation, search

Clicked A Few Times
Threads 2
Posts 6
Hi,
i am struggling with compiliing nwchem 6.6 in ubuntu 16.10 with gfortran. Locales are set to german (LC_NUMERIC="de_DE.UTF-8"). When following the instructions given in "Documentation>Compiling NWChem>3.1 NWChem 6.6 on Ubuntu 14.04 (Trusty Tahr)" compilation finishes without error messages. Then for a simple test case i use the example input file:
title "Nitrogen cc-pvdz SCF geometry optimization"
geometry
n 0 0 0
n 0 0 1.08
end
basis
n library cc-pvdz
end
task scf optimize
My own compiled version ends with an error
from getmem: mem. needed=               248762  , mem. available=               209363
Error no. 1 in getmem memory overflow : call no., amount requested : 85 49790
0:texas: nerror called:Received an Error in Communication
When comparing the output to a pre-build version of NWChem which shows no memory issue i get the diff result
(side-by-side: pre-compiled version | source compiled version with error)
          Memory information                                                                                      Memory information
------------------ ------------------
heap = 13107198 doubles = 100.0 Mbytes | heap = 13107200 doubles = 100.0 Mbytes
stack = 13107195 doubles = 100.0 Mbytes | stack = 13107197 doubles = 100.0 Mbytes
global = 26214400 doubles = 200.0 Mbytes (distinct from heap & stack) global = 26214400 doubles = 200.0 Mbytes (distinct from heap & stack)
total = 52428793 doubles = 400.0 Mbytes | total = 52428797 doubles = 400.0 Mbytes
verify = yes verify = yes
hardfail = no hardfail = no
and
Forming initial guess at       0.1s                                                               |     Forming initial guess at       0.0s
Superposition of Atomic Density Guess Superposition of Atomic Density Guess
------------------------------------- -------------------------------------
Sum of atomic energies: -108.60004629 Sum of atomic energies: -108.60004629
| from getmem: mem. needed= 248762 , mem. available= 209363
Non-variational initial energy | ------------------------------------------------------------------------
------------------------------ | texas: nerror called 0
| ------------------------------------------------------------------------
Total energy = -109.172911 | ------------------------------------------------------------------------
1-e energy = -194.701220 | current input line :
2-e energy = 61.519341 | 9: task scf optimize
HOMO = -0.421673 | ------------------------------------------------------------------------
LUMO = 0.042733 | ------------------------------------------------------------------------
| An error occured while computing integrals
| ------------------------------------------------------------------------
Symmetry analysis of molecular orbitals - initial | For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documen
------------------------------------------------- |
|
| For further details see manual section:
!! scf_movecs_sym_adapt: 4 vectors were symmetry contaminated | --------------------------------------------------------------------------
| MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0
Symmetry fudging | with errorcode -1.
|
!! scf_movecs_sym_adapt: 4 vectors were symmetry contaminated | NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
| You may or may not see output from other processes, depending on
Numbering of irreducible representations: | exactly when Open MPI kills them.
| --------------------------------------------------------------------------
Since the pre-build version is running as expected, i assume the memory issue is related to compiling from source. Does anyone has an idea what possibly went wrong during compilation?

Forum Vet
Threads 9
Posts 1389
Have you applied the patches listed at
http://www.nwchem-sw.org/index.php/Download#Patches_for_the_27746_revision_of_NWChem_6.6

Could you please post the output of the following commands

gcc -v
gfortran -v
mpicc -v
mpif90 -v
env | grep MPI
env |grep BLAS
env | grep SCALA
env|grep USE_6
env|egrep NWC
head -25 $NWCHEM_TOP/src/tools/build/config.log
grep -i gemm $NWCHEM_TOP/src/nwdft/xc/xc_tabcd.F

Clicked A Few Times
Threads 2
Posts 6
Solved! Thank you very much! Your first suggestion did the trick!

Somehow i missed the patches completely. After installing every patch (some seemed to be already included in the source) the simulation for my own compiled version successfully finishes without error messages. Since lack of time, i didn't figure out which one of the patches was responsible for the error message mentioned above, sorry.

Thank you again for pointing me in the right direction!

Gets Around
Threads 17
Posts 54
centos 7.3 NWChem 6.6 segmentation fault
I compiled NWC 6.6.revision.-src.2015-10-20 on an Intel system w 64 gb memory, basically following the standard procedure which has worked for other Centos 7 systems. An nwchem binary was created in /bin/ which I usually take to be a successful compilation. However, when I try to run a test job, a segmentation fault is output to the console. Any ideas what went wrong?

Here are the details:
OS and installed programs:
OS: CentOS-7.3-x86_64 7-3.1611.el7
openmpi.x86_64 1.10.0-10.el7 and
openmpi-develop.x86_64 1.10.0-10.el7
make.x86_64 3.82-21.el7
python.x86_64 2.7.5-39.el7_2
python-devel.x86_64 2.7.5-39.el7_2
gcc.x86_64 4.8.5-4.el7
gcc-c++.x86_64 4.8.5-4.el7
gcc-gfortran.x86_64 4.8.5-4.el7
perl.x86_64 4:5.16.3-286.el7
perl-libs.x86_64 4:5.16.3-286.el7
tcsh.x86_64 4:5.16.3-286.el7
openssh.x86_64 6.6.1pl-25.el7_2
openssh-clients.x86_64 6.6.1pl-25.el7_2
openblas.x86_64 0.2.19-3.el7
openblas-devel.x86_64 0.2.19-3.el7
openblas-openmp.x86_64 0.2.19-3.el7
openblas-openmp64.x86_64 0.2.19-3.el7
openblas-openmp64_.x86_64 0.2.19-3.el7
openblas-serial64.x86_64 0.2.19-3.el7
openblas-serial64_.x86_64 0.2.19-3.el7
openblas-threads.x86_64 0.2.19-3.el7
openblas-threads64.x86_64 0.2.19-3.el7
openblas-threads64_.x86_64 0.2.19-3.el7
scalapack-openmpi-devel.x86_64 2.0.2-15.el7
scalapack-common.x86_64 2.0.2-15.el7
blas.x86_64 3.4.2-5.el7
blas-devel.x86_64 3.4.2-5.el7
environment-modules.x86_64 3.2.10-10.el7
hwloc-libs.x86_64 1.7-5.el7
infinipath-psm.x86_64 3.3-0.g6f42cdb1bb8.2.el7
lapack.x86_64 3.4.2-5.el7
lapack-devel.x86_64 3.4.2-5.el7
libfabric.x86_64 1.1.0-2.el7
libibumad.x86_64 1.3.10.2-1.el7
libpsm2.x86_64 0.7-4.el7
opensm-libs.x86_64 3.3.19-1.el7
elpa-openmpi.x86_64 2015.02.002-4.el7
elpa-openmpi-devel.x86_64 2015.02.002-4.el7
atlas.x86_64 3.10.1-10.el7
blacs-common.x86_64 2.0.2-15.el7
blacs-openmpi.x86_64 2.0.2-15.el7
compat-openmpi16.x86_64 1.6.4-10.el7
elpa-common.noarch 2015.02.002-4.el7
elpa-devel.noarch 2015.02.002-4.el7
libesmtp.x86_64 1.0.6-7.el7

Following patches were installed:
Tddft_mxvec20.patch
Config_libs66.patch
Cosmo_meminit.patch
Sym_abelian.patch
Xccvs98.patch
Dplot_tolrho.patch
Driver_smalleig.patch
Ga_argv.patch
Ga_defs.patch
Zgesvd.patch
Cosmo_dftprint.patch
Util_gnumakefile.patch
Util_getppn.patch
Notdir_fc.patch
Xatom_vdw.patch

The environmental variables were set:
export USE_MPI=y
export NWCHEM_TARGET=LINUX64
export USE_INTERNALBLAS=y
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
export PATH=/usr/lib64/openmpi/bin/:$PATH
export NWCHEM_MODULES="all"
export NWCHEM_TOP=/usr/local/nwchem-6.6
export BLAS_SIZE=4
export SCALAPACK_SIZE=4
export USE_64TO32=y

The console command and reply were as follows:
mpirun -np 2 /usr/local/nwchem/bin/nwchem n2.in > n2-4.out

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
  1. 0 0x7F8D4D4BE467
  2. 1 0x7F8D4D4BEAAE
  3. 2 0x7F8D4C7A924F
  4. 0 0x7F35D21D7467
  5. 1 0x7F35D21D7AAE
  6. 2 0x7F35D14C224F
  7. 3 0x2BCE6C0 in dcopy_
  8. 3 0x2BCE6C0 in dcopy_
  9. 4 0x2B310B3 in ycopy_
  10. 4 0x2B310B3 in ycopy_
  11. 5 0x9B6999 in pstat_init_ at pstat_init.F:32
  12. 5 0x9B6999 in pstat_init_ at pstat_init.F:32
  13. 6 0x406960 in MAIN__ at nwchem.F:204
  14. 6 0x406960 in MAIN__ at nwchem.F:204

The output.out file contents:

corsair3.cns.uaf.edu.1197hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
corsair3.cns.uaf.edu.1198hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
argument  1 = n2.in


mpirun noticed that process rank 0 with PID 1197 on node corsair3 exited on signal 11 (Segmentation fault).



Thanks for any suggestions,
John Keller

Gets Around
Threads 17
Posts 54
minor correction
The nwchem executable was created - apparently normally - in the /usr/local/nwchem-6.6/LINUX64/bin directory.
John K.
Edited On 10:57:41 AM PDT - Fri, Mar 24th 2017 by Jwkeller

Forum Vet
Threads 9
Posts 1389
Please send the output of the following commands

grep -i gemm $NWCHEM_TOP/src/nwdft/xc/xc_tabcd.F
ldd $NWCHEM_TOP/bin/LINUX64/nwchem
nm $NWCHEM_TOP/bin/LINUX64/nwchem|grep ygemm|head

Gets Around
Threads 17
Posts 54
[jkeller@corsair3 ~]$ grep -i gemm /usr/local/nwchem-6.6/src/nwdft/xc/xc_tabcd.F
                     call ygemm('T', 'N', nnia, nnja, nq, 1.d0, Bmat,
call ygemm('T', 'N', nnia, nnja, nq, 1.0d0, Emat,
call ygemm('T', 'N', nnia, nnja, nq, -1.d0, Bmat,
call ygemm('T', 'N', nnia, nnja, 3*nq,
call ygemm('T', 'N', nnia, nnja, 3*nq,

[jkeller@corsair3 ~]$ ldd /usr/local/nwchem-6.6/bin/LINUX64/nwchem
linux-vdso.so.1 => (0x00007fff2d18d000)
libmpi_usempi.so.5 => /usr/lib64/openmpi/lib/libmpi_usempi.so.5 (0x00007f14c81d3000)
libmpi_mpifh.so.12 => /usr/lib64/openmpi/lib/libmpi_mpifh.so.12 (0x00007f14c7f7d000)
libmpi.so.12 => /usr/lib64/openmpi/lib/libmpi.so.12 (0x00007f14c7c99000)
librt.so.1 => /lib64/librt.so.1 (0x00007f14c7a7d000)
libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f14c7759000)
libm.so.6 => /lib64/libm.so.6 (0x00007f14c7457000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f14c723b000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f14c7024000)
libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f14c6de8000)
libc.so.6 => /lib64/libc.so.6 (0x00007f14c6a27000)
libopen-rte.so.12 => /usr/lib64/openmpi/lib/libopen-rte.so.12 (0x00007f14c67a9000)
libopen-pal.so.13 => /usr/lib64/openmpi/lib/libopen-pal.so.13 (0x00007f14c6505000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f14c6301000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f14c60fd000)
libhwloc.so.5 => /lib64/libhwloc.so.5 (0x00007f14c5ec3000)
/lib64/ld-linux-x86-64.so.2 (0x00007f14c83d7000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f14c5cb6000)
libltdl.so.7 => /lib64/libltdl.so.7 (0x00007f14c5aac000)

[jkeller@corsair3 ~]$ nm /usr/local/nwchem-6.6/bin/LINUX64/nwchem|grep ygemm|head
0000000002b31100 T ygemm_

Gets Around
Threads 17
Posts 54
Edo,
Do I need to send anything more to the Forum relating to this issue?
John Keller

Forum Vet
Threads 9
Posts 1389
John, to be honest with you, I ma not quite sure what's going wrong in your installation on Centos 7.3
Please send me the output of the following
cd $NWCHEM_TOP/src/blas
make clean
make

Gets Around
Threads 17
Posts 54
Edo - I re-compiled as above, and now it works. (?) The only thing I did differently was add "make clean" before "make".

However, I am getting messages at the top of the .log file "hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds..." one per processor requested. According to some website, this is due to a bug in libfabrick 1.3 which is included in CentOS 7.3. CentOS 7.2 has libfrabric 1.1.

John K.

Forum Vet
Threads 9
Posts 1389
John,
Thanks for your feedback.
I think I have found a work-around for the fifteen seconds hfi_wait_for_device issue. Please execute the following commands
mkdir -p $HOME/.openmpi
echo "mtl = psm" >> $HOME/.openmpi/mca-params.conf

This trick worked for my Centos 7.3 installation

Gets Around
Threads 17
Posts 54
Edo - Thanks. That works - but only for the user's account running NWC on that machine ("corsair3").

However, the lines are still there at the top of the .out file when WebMO runs the job on that machine. WebMO is running applications on this server supposedly under this user's account, but there must be some other way WebMO is running NWC.

John K.

Just Got Here
Threads 2
Posts 3
Problem linking python
I'm having problems linking python into NWChem. I'm following the instructions for setting the environment variables. Here's a script for compiling:
  1. !/bin/bash

export USE_MPI="y"
  1. export USE_PYTHONCONFIG="Y"
export PYTHONVERSION="2.7"
export PYTHONHOME="/scr_haswell/swsides/opt/contrib-qmcpack/Python-2.7.13-sersh"

export PYTHONLIBTYPE="so"
export PYTHONCONFIGDIR="config/../.."

export BLASOPT="-llapack -lblas"
export BLAS_SIZE="8"
export USE_ARUR="n"

export NWCHEM_TOP="/scr_haswell/swsides/directpkgs/nwchem-6.6"
export NWCHEM_TARGET="LINUX64"


echo ""
echo "-------------------------------------------------------------"
echo "Setup environment settings"
echo ""
echo "NWCHEM_TOP =$NWCHEM_TOP"
echo "NWCHEM_TARGET=$NWCHEM_TARGET"
echo "-------------------------------------------------------------"

echo ""
echo "Running make nwchem_config..."
echo ""
make nwchem_config NWCHEM_MODULES="all python" > test-config.log
sleep 2

echo ""
echo "Running make"
echo ""
make -j 8
========================================================================


The error is:

make nwchem.o stubs.o
make[1]: Entering directory `/scr_haswell/swsides/directpkgs/nwchem-6.6/src'
gfortran -m64 -ffast-math -Warray-bounds -fdefault-integer-8 -march=native -mtune=native -finline-functions -O2 -g -fno-aggressive-loop-optimizations -g -O -I. -I/scr_haswell/swsides/directpkgs/nwchem-6.6/src/include -I/scr_haswell/swsides/directpkgs/nwchem-6.6/src/tools/install/include -DGFORTRAN -DCHKUNDFLW -DGCC4 -DGCC46 -DEXT_INT -DLINUX -DLINUX64 -DPARALLEL_DIAG -DCOMPILATION_DATE="'`date +%a_%b_%d_%H:%M:%S_%Y`'" -DCOMPILATION_DIR="'/scr_haswell/swsides/directpkgs/nwchem-6.6'" -DNWCHEM_BRANCH="'6.6'" -c -o nwchem.o nwchem.F
gfortran -m64 -ffast-math -Warray-bounds -fdefault-integer-8 -march=native -mtune=native -finline-functions -O2 -g -fno-aggressive-loop-optimizations -g -O -I. -I/scr_haswell/swsides/directpkgs/nwchem-6.6/src/include -I/scr_haswell/swsides/directpkgs/nwchem-6.6/src/tools/install/include -DGFORTRAN -DCHKUNDFLW -DGCC4 -DGCC46 -DEXT_INT -DLINUX -DLINUX64 -DPARALLEL_DIAG -DCOMPILATION_DATE="'`date +%a_%b_%d_%H:%M:%S_%Y`'" -DCOMPILATION_DIR="'/scr_haswell/swsides/directpkgs/nwchem-6.6'" -DNWCHEM_BRANCH="'6.6'" -c -o stubs.o stubs.F
make[1]: Leaving directory `/scr_haswell/swsides/directpkgs/nwchem-6.6/src'
gfortran -Wl,--export-dynamic -L/scr_haswell/swsides/directpkgs/nwchem-6.6/lib/LINUX64 -L/scr_haswell/swsides/directpkgs/nwchem-6.6/src/tools/install/lib -o /scr_haswell/swsides/directpkgs/nwchem-6.6/bin/LINUX64/nwchem nwchem.o stubs.o -lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lsolvation -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -lnwpython -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lmm -lcons -lperfm -ldntmc -lccca -lnwcutil -lga -larmci -lpeigs -lperfm -lcons -lbq -lnwcutil /scr_haswell/swsides/opt/contrib-qmcpack/Python-2.7.13-sersh/lib/python2.7/config/../../libpython2.7.so -llapack -lblas -lnwclapack -lnwcblas -L/scr_haswell/swsides/opt/contrib-qmcpack/mpich-3.1.4-shared/lib -lmpifort -lmpi -lrt -lm -lpthread -lnwcutil -lpython2.7 -lpthread -ldl -lutil -lm
/usr/bin/ld: cannot find -lpython2.7
collect2: error: ld returned 1 exit status
make: *** [all] Error 1

I can adjust the

/scr_haswell/swsides/opt/contrib-qmcpack/Python-2.7.13-sersh/lib/python2.7/config/../../libpython2.7.so

line to

-L /scr_haswell/swsides/opt/contrib-qmcpack/Python-2.7.13-sersh/lib (where the shared lib actually is)

and this will work. But the entire logic of the make files is broken. Is there a fix or a patch? I've got an automatic build system that need to use the in-place
build system of the application and I can't edit make files by hand.

Forum Vet
Threads 9
Posts 1389
Could you unset the following env. variables and try to link again: PYTHONLIBTYPE, PYTHONCONFIGDIR

cd $NWCHEM_TOP/src
unset PYTHONLIBTYPE
unset PYTHONCONFIGDIR
make link


Forum >> NWChem's corner >> Compiling NWChem



Who's here now Members 0 Guests 0 Bots/Crawler 1


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC