Problem of running NWchem6.0

From NWChem

Viewed 2234 times, With a total of 1 Posts
Jump to: navigation, search

Just Got Here
Threads 1
Posts 2
Hi,

I compiled NWchem 6.0 with the script build_nwchem

  1. !/bin/bash
  2. export USE_GPROF=yes
export USE_SUBGROUPS=yes
export USE_MPI=yes
  1. export OLD_GA=yes
export MSG_COMMS=MPI
export USE_PYTHON64=yes
export MPI_LOC=/pkg/mpi/gcc/mvapich2-1.6
export MPI_INCLUDE=$MPI_LOC/include
export MPI_LIB=$MPI_LOC/lib
export LIBMPI="-lfmpich -lmpich -lpthread" # MPICH2 1.2
  1. export LIBMPI="-lmpichf90 -lmpich -lmpl -lpthread" # MPICH2 1.3.1
export PYTHONHOME=/usr
export PYTHONVERSION=2.6
export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all python"
export NWCHEM_EXECUTABLE=$NWCHEM_TOP/bin/LINUX64/nwchem
export PYTHONPATH=./:$NWCHEM_TOP/contrib/python/
cd $NWCHEM_TOP/src
make DIAG=PAR FC=gfortran CC=gcc nwchem_config
make DIAG=PAR FC=gfortran CC=gcc $1

The executable runs normally on our cluster when one computing node (48 cores SMP) is
used, however, when using two or more nodes, I get this error message:



It appears that tasks allocated on the same host machine do not have
consecutive message-passing IDs/numbers. This is not acceptable
to the ARMCI library as it prevents SMP optimizations and would
lead to poor resource utilization.

Please contact your System Administrator or, if you can, modify the MPI
message-passing job startup configuration.

Last System Error Message from Task 0:: No such process

nwchem:4066 terminated with signal 11 at PC=29b4054 SP=7fff3b45c880. Backtrace:
./nwchem(_armci_buf_get+0x56)[0x29b4054]
./nwchem(_armci_buf_get_clear_busy+0x1f)[0x29b4b6b]
./nwchem(armci_serv_quit+0x4d)[0x29b3496]
./nwchem(armci_wait_for_server+0x28)[0x29ae47a]
./nwchem(ARMCI_Cleanup+0x5d)[0x299da51]
./nwchem(armci_abort+0x1a)[0x299dbc9]
./nwchem(dassertp_fail+0xbf)[0x299eb29]
./nwchem[0x29a31ba]
./nwchem(armci_init_clusinfo+0x193)[0x29a380b]
./nwchem(PARMCI_Init+0x52)[0x299e154]
./nwchem(PARMCI_Init_args+0x3a)[0x299df48]
./nwchem(ARMCI_Init_args+0x1d)[0x29a5725]
./nwchem(install_nxtval+0x55)[0x29bf375]
./nwchem(ALT_PBEGIN_+0xa5)[0x29be5c5]
./nwchem(PBEGIN_+0x1c)[0x29be60c]
./nwchem(pbeginf_+0x16f)[0x29be2df]
./nwchem(MAIN__+0x27)[0x5605e4]
./nwchem(main+0x2c)[0x29c115c]
/lib64/libc.so.6(__libc_start_main+0xe6)[0x2b8cff88dbc6]
./nwchem[0x55f2d5]
MPI process (rank: 0) terminated unexpectedly on alps6-01.cluster.nchc.org.tw
Exit code -5 signaled from alps6-01
handle_mt_peer: fail to read...: Success




I wonder what can be done to avoid this problem? Each computing node of the cluster has
48 cores and 128GB RAM, connected to each other with QDR infiniband network. The
mvapich2-1.6 was built with QLogic infiniband driver. Thanks for any suggestion.

Jyh-Shyong

Just Got Here
Threads 1
Posts 2
Re.
Hi,

Sorry, I made a mistake in my LSF job script which caused the problem of running NWChem using more than one node. Now the problem is solved.

The problem was caused by the wrong hostfile format for mpirun_rsh command. The correct
format should be a list of node names, and the same node name should appear as many as the cores used for the job, and they should be listed consecutively, as variable $LSB_DJOB_HOSTFILE defines.

Jyh-Shyong


Forum >> NWChem's corner >> Running NWChem



Who's here now Members 0 Guests 0 Bots/Crawler 1


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC