Armci error 260 cond:0

From NWChem

Viewed 7606 times, With a total of 12 Posts
Jump to: navigation, search

Clicked A Few Times
Threads 2
Posts 5
Hello, the last few weeks, I have been trying to analyse a nwchem crash.
The input of the calculation is from the Benchmarks of this site and called C 240 Buckminster Fullerene.
This is being calculated on 32 nodes with 2 Xeon CPU's both with hyperthreading enabled so each compute
node has 4 computational units. The network interconnections are plain Gigabit Ethernet.

The first crashes were with a home built binary with O3 compiler optimisation. Then I built it again with
O2 optimisation and everything stops at the exactly same spot and both binarys stop after a computation
of almost equal duration. Now both builds were done with Intel MKL so the next step ist to remove MKL and
see what it does. Also the program is built with mpich2 and ifort compiler.

It seems that ARMCI is somehow incorrectly configured or somehow does not now how to communicate.
The significant error seems to be
ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0

I still have not dug into the code to find out what that means.

Here is an excerpt from the nwchem log.

dft energy failed                                                                       0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
   278: task dft energy
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 This type of error is most commonly 
 associatated with calculations not reaching convergence criteria
 ------------------------------------------------------------------------
 For more information see the NWChem manual at 
 http://www.emsl.pnl.gov/docs/nwchem/nwchem.html

 For further details see manual section:

0:0:dft energy failed:: 0
(rank:0 hostname:j314.jotunn.rhi.hi.is pid:13071):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 0:: Inappropriate ioctl for device
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0


I am working on testing some alternatives to try out: Eliminating MKL, Eliminating BLAS altogether, Trying Atlas and lapack.
Should I use Intel CC instead of the GNU CC.
Best regards, Anna Jonna.

Clicked A Few Times
Threads 2
Posts 5
Running a precompiled binary, from this site gives the same error but much sooner.
This time an error is reported creating Global Arrays.
This is with the same mpirun as previously, ie mpich2 compiled with ifort.
I am going to build it again with GNU fortran compiler and see what happens.

     Screening Tolerance Information
      -------------------------------
          Density screening/tol_rho: 1.00D-10
          AO Gaussian exp screening on grid/accAOfunc:  14
          CD Gaussian exp screening on grid/accCDfunc:  20
          XC Gaussian exp screening on grid/accXCfunc:  20
          Schwarz screening/accCoul: 1.00D-08

 ------------------------------------------------------------------------
 dft_main0d:                     Error creating ga        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
   278: task dft energy
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 For more information see the NWChem manual at http://www.emsl.pnl.gov/docs/nwchem/nwchem.html


 For further details see manual section:                                                                                                                                                                                                                                                                
Last System Error Message from Task 0:: Inappropriate ioctl for device
0:0:dft_main0d:                     Error creating ga:: 0
(rank:0 hostname:j314.jotunn.rhi.hi.is pid:24216):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
  0: ARMCI aborting 0 (0).
  0: ARMCI aborting 0 (0).
system error message: Invalid argument
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)


This suggests you are not providing enough memory. Are you setting the memory keyword in the input? If so, 1) what is the size, and 2) how much memory do you have per node? Remember that the memory allocation is per core/processor in a node. And, you need to leave some memory for the operating system.

Bert
Edited On 3:07:14 PM PDT - Thu, Nov 4th 2010 by Bert

Just Got Here
Threads 0
Posts 3
Similar problems
Hello!

Firstly, I have to say that I got a similar problem, not the answer for the last post... sorry..

In my case I'm doing a RI-MP2 geometry optimization of a complex containing Ni, C, O and H atoms.

I did a first optimization with the cc-pVDZ basis set for the light atoms and 6-31G (and in a further calc. with 6-31G** basis,) for the Ni atom successfully (a total of 388 basis functions).

The problems arise when I change to a bigger basis set for the Ni atom.
If I change to the cc-pVDZ (specified explicitly and available from http://tyr0.chem.wsu.edu/~kipeters/basis.html) or cc-pVTZ basis set (specified explicitly or with the line "Ni library cc-pVTZ"), the process fails with the error:

1:Bus error, status=: 7
(rank:1 hostname:cl1n006 pid:28807):ARMCI DASSERT fail. signaltrap.c:SigBusHandler():213 cond:0
4:Bus error, status=: 7
2:Bus error, status=: 7
(rank:2 hostname:cl1n006 pid:28813):ARMCI DASSERT fail. signaltrap.c:SigBusHandler():213 cond:0
(rank:4 hostname:cl1n006 pid:28809):ARMCI DASSERT fail. signaltrap.c:SigBusHandler():213 cond:0
"cokni_1h.out" 819L, 29602C

There is not input errors and was the same for a single point energy or a DFT (XC B3LYP) trial.

- could be the problems related with the total memory specification?
I'm using the line:

memory 30000 mb

- Is there some rule to estimate the optimal amount of memory required for a calculation in NWchem?

Thanks in advance for any suggestion!

Good luck!

NWchem is running on a SGI-Altix cluster compiled with the intel fortran compilers, Infiniband support and MPI.

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 4
Posts 597
Quote:Diegoagomezh Nov 4th 6:41 pm
Hello!

Firstly, I have to say that I got a similar problem, not the answer for the last post... sorry..

In my case I'm doing a RI-MP2 geometry optimization of a complex containing Ni, C, O and H atoms.

I did a first optimization with the cc-pVDZ basis set for the light atoms and 6-31G (and in a further calc. with 6-31G** basis,) for the Ni atom successfully (a total of 388 basis functions).

The problems arise when I change to a bigger basis set for the Ni atom.
If I change to the cc-pVDZ (specified explicitly and available from http://tyr0.chem.wsu.edu/~kipeters/basis.html) or cc-pVTZ basis set (specified explicitly or with the line "Ni library cc-pVTZ"), the process fails with the error:

1:Bus error, status=: 7
(rank:1 hostname:cl1n006 pid:28807):ARMCI DASSERT fail. signaltrap.c:SigBusHandler():213 cond:0
4:Bus error, status=: 7
2:Bus error, status=: 7
(rank:2 hostname:cl1n006 pid:28813):ARMCI DASSERT fail. signaltrap.c:SigBusHandler():213 cond:0
(rank:4 hostname:cl1n006 pid:28809):ARMCI DASSERT fail. signaltrap.c:SigBusHandler():213 cond:0
"cokni_1h.out" 819L, 29602C

There is not input errors and was the same for a single point energy or a DFT (XC B3LYP) trial.

- could be the problems related with the total memory specification?
I'm using the line:

memory 30000 mb

- Is there some rule to estimate the optimal amount of memory required for a calculation in NWchem?

Thanks in advance for any suggestion!

Good luck!

NWchem is running on a SGI-Altix cluster compiled with the intel fortran compilers, Infiniband support and MPI.


With that memory line you are requesting 30 Gbyte of memory per processor (per core). Memory needs depend on the calculation, but generally you should not allocate more then the memory in the system. I.e., if you have an 8 core node with 20 GByte of memory, I would probably use "memory 2000 mb", and leave some space for the operating system to run in.

Bert

Just Got Here
Threads 0
Posts 3
Hello..

Bert Thank you for your answer!!

I have partially solved my problem...

I was forgetting the "spherical nosegment" keywords (required for the correlation-consistent basis set) in the BASIS directive line.

Actually, I don't know why the calculations with the cc-pVDZ (for C,O and H) and 6-31G (for Ni atoms) finished successfully without the "spherical nosegment" statement. (perhaps because the 6-31g presence?).

Well...
After this correction and fixing the memory line according to Bert's comment, the
"ARMCI DASSERT fail. signaltrap.c:SigBusHandler():213 cond:0" Error apparently was solved.

However, now the calculation stop after the first SCF energy calculation when the RI-MP2 module starts. I get the error:

1:Segmentation Violation error, status=: 11
(rank:1 hostname:cl1n006 pid:3330):ARMCI DASSERT fail. signaltrap.c:SigSegvHandler():301 cond:0

Anna
In a trial I got the "Armci error 260 cond:0" and the problem was a wrong keyword in the BASIS directive, I wrote "nosegmented" (wrong) instead of "nosement" (right). Perhaps you problem is related with this.

Thanks again for any reply!..

  • Guest -
    Send PM
Quote:Diegoagomezh Nov 5th 10:48 am
Hello..

Bert Thank you for your answer!!

I have partially solved my problem...

I was forgetting the "spherical nosegment" keywords (required for the correlation-consistent basis set) in the BASIS directive line.

Actually, I don't know why the calculations with the cc-pVDZ (for C,O and H) and 6-31G (for Ni atoms) finished successfully without the "spherical nosegment" statement. (perhaps because the 6-31g presence?).

Well...
After this correction and fixing the memory line according to Bert's comment, the
"ARMCI DASSERT fail. signaltrap.c:SigBusHandler():213 cond:0" Error apparently was solved.

However, now the calculation stop after the first SCF energy calculation when the RI-MP2 module starts. I get the error:

1:Segmentation Violation error, status=: 11
(rank:1 hostname:cl1n006 pid:3330):ARMCI DASSERT fail. signaltrap.c:SigSegvHandler():301 cond:0

Anna
In a trial I got the "Armci error 260 cond:0" and the problem was a wrong keyword in the BASIS directive, I wrote "nosegmented" (wrong) instead of "nosement" (right). Perhaps you problem is related with this.

Thanks again for any reply!..


Spherical nosegment should not be the issue. I would strongly recommend removing the nosegment keyword. There is no reason for using it (not required for the basis set) and it increases memory usage. I would have to see an input deck so that I could test it and provide you more input.

Bert

Just Got Here
Threads 0
Posts 3
Hello again...

Bert

Yes!, the "nosegment" keyword is not required for the basis (the "spherical" keyword is the required for the cc-xxx basis... I was wrong).

However, I tried without the "nosegment" keyword and the process stops without any message before the guess creation. When I add the latter, the calculation do the first SCF calc. and stops when the RI-MP2 module start. Here is a copy of my input...


Thank you again for your help!




START cokni_1h
title "Pw-ni + 1H2 P. optimization cc-pVDZ/cc-pVTZ(NiII)"
ECHO

memory 3000 mb noverify

Geometry units angstrom print NOAUTOZ
Ni        0.00000        0.00000        1.20508
Ni 0.00000 0.00000 -1.20508
O 1.87837 -0.04938 -1.13303
O 1.87837 -0.04938 1.13303
O 0.04938 1.87837 -1.13303
O 0.04938 1.87837 1.13303
O -1.87837 0.04938 -1.13303
O -1.87837 0.04938 1.13303
O -0.04938 -1.87837 -1.13303
O -0.04938 -1.87837 1.13303
C 0.05619 2.45769 0.00000
C 2.45769 -0.05619 0.00000
C -2.45769 0.05619 0.00000
C -0.05619 -2.45769 0.00000
C 3.96341 -0.04562 0.00000
H 4.35033 -0.53051 -0.90513
H 4.30455 1.00279 0.00000
H 4.35033 -0.53051 0.90513
C 0.04562 3.96341 0.00000
H -1.00279 4.30455 0.00000
H 0.53051 4.35033 0.90513
H 0.53051 4.35033 -0.90513
C -3.96341 0.04562 0.00000
H -4.30455 -1.00279 0.00000
H -4.35033 0.53051 0.90513
H -4.35033 0.53051 -0.90513
C -0.04562 -3.96341 0.00000
H 1.00279 -4.30455 0.00000
H -0.53051 -4.35033 0.90513
H -0.53051 -4.35033 -0.90513
H -0.00000103 -0.37759827 3.95105265
H 0.00000103 0.37759827 3.95105265
end

basis spherical nosegment noprint
O     library              cc-pVDZ
C library cc-pVDZ
H library cc-pVDZ
Ni library cc-pVTZ
end

basis "ri-mp2 basis"
O     library              cc-pVDZ-fit2-1
C library cc-pVDZ-fit2-1
H library cc-pVDZ-fit2-1
Ni S
  1997.8237701              1.0000000
Ni S
  1097.2416591              1.0000000
Ni S
   496.28986263             1.0000000
Ni S
   196.51539911             1.0000000
Ni S
    87.216971652            1.0000000
Ni S
    35.137417884            1.0000000
Ni S
    11.454909043            1.0000000
Ni S
     4.1042987424           1.0000000
Ni S
     2.7859778875           1.0000000
Ni S
     1.5965165491           1.0000000
Ni S
     0.49186103149          1.0000000
Ni S
     0.28416654012          1.0000000
Ni S
     0.12616265844          1.0000000
Ni P
   650.35199200             1.0000000
Ni P
   184.43208755             1.0000000
Ni P
    47.809364688            1.0000000
Ni P
    15.580224065            1.0000000
Ni P
     7.5148604075           1.0000000
Ni P
     3.8056570107           1.0000000
Ni P
     2.3464576499           1.0000000
Ni P
     0.93782324546          1.0000000
Ni P
     0.51838194511          1.0000000
Ni P
     0.21837657698          1.0000000
Ni P
     0.48219094780E-01            1.0000000
Ni D
   146.64155782             1.0000000
Ni D
    44.430191232            1.0000000
Ni D
    19.451082526            1.0000000
Ni D
     8.4371827589           1.0000000
Ni D
     4.1107905672           1.0000000
Ni D
     2.3210325339           1.0000000
Ni D
     0.97427936878          1.0000000
Ni D
     0.46721224829          1.0000000
Ni D
     0.20770386175          1.0000000
Ni F
    48.616960788            1.0000000
Ni F
    11.017689500            1.0000000
Ni F
     5.4975283859           1.0000000
Ni F
     2.7863055316           1.0000000
Ni F
     1.2328473329           1.0000000
Ni F
     0.61191666508          1.0000000
Ni F
     0.27072533276          1.0000000
Ni G
    18.775427434            1.0000000
Ni G
     6.3224049104           1.0000000
Ni G
     2.8789771807           1.0000000
Ni G
     0.93385065596          1.0000000
Ni G
     0.35342268260          1.0000000

Ni H
     9.6049097340           1.0000000
Ni H
     5.0249732323           1.0000000
Ni H
     1.9659049861           1.0000000
Ni I
     9.8702668554           1.0000000
Ni I
     4.5127683520           1.0000000
end

constraints
fix atom  1:30
end

scf
print low
end

mp2
freeze atomic
end

TASK RIMP2 optimize

  • Guest -
    Send PM
Any updates on this issue? We have a user whose jobs are running into the "ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0" error in a spinspin calculation.
I will post details once we do some more testing on the issue.

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 4
Posts 597
No solution yet. Any DASSERT error message is generally a message taht points to you asking more memory then available on the system. What is the memory keyword in the spinspin case, how much memory is on a node, and how many cores are on a node? You need to leave some memory for the operating system, etc.

Bert

Quote: Mar 25th 9:17 pm
Any updates on this issue? We have a user whose jobs are running into the "ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0" error in a spinspin calculation.
I will post details once we do some more testing on the issue.

  • Guest -
    Send PM
We have tried using as low as "memory 1gb" for a job using all 8 cores form a node with 12GB of memory.
I was told that a job doing only geometry optimization also gave the same error.
I intend to give a look at code to try to have a better ideia of where/why this happens but my programing skills are quite limited, as well as my time.

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 4
Posts 597
I ran the input deck with a 128 cores in under 2 hours with 3 gbyte per core. In your case, with 12 Gbyte per node and 8 cores, I would recommend the 1 gbyte memory setting in NWChem. How many cores did you run this test on? I will try and reproduce with your number of cores and your memory settings.

bert


Quote: Mar 30th 5:57 pm
We have tried using as low as "memory 1gb" for a job using all 8 cores form a node with 12GB of memory.
I was told that a job doing only geometry optimization also gave the same error.
I intend to give a look at code to try to have a better ideia of where/why this happens but my programing skills are quite limited, as well as my time.

Just Got Here
Threads 0
Posts 1
Problem with MD optimize
I run with mpirun -n 2 nwchem a md ?
ERROR

0:0:nga_put_common:cannot locate region: [28394:76246 ,28394:76246 ]:: -999
(rank:0 hostname:master.cluster pid:24491):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0

Do you have suggestion

  • Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Forum Vet
Threads 4
Posts 597
Please start new threads on new items. Here is your item from another email:

Hello,

I have compiled nwchem6.0 on the cluster : I use OpenMPI compiled with INTEL
I want to use Inifiniband
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

  1. !/bin/bash

export TCGRSH=/usr/bin/ssh
export USE_MPI=yes
export USE_MPIF=yes
export MPI_LOC=/usr/mpi/intel/openmpi-1.4.2/
export MPI_INCLUDE=$MPI_LOC/include
export MPI_LIB=$MPI_LOC/lib64
export LIBMPI="-L $MPI_LIB -lmpi -lopen-pal -lopen-rte -lmpi_f90 -lmpi_f77"
ARMCI_NETWORK=OPENIB
export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all"
export NWCHEM_EXECUTABLE=$NWCHEM_TOP/bin/LINUX64/nwchem
cd $NWCHEM_TOP/src
make CC=icc FC=ifort -j4
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

Ok when I run mpirun -n 1 nwchem 1.rwc

No problem it runs

But when I run mpirun -n 2 nwchem 1.rwc
IT STOP :
0:0:nga_put_common:cannot locate region: [28394:76246 ,28394:76246 ]:: -999


What is the problem ?

Could you please help me ... I spent lot of time on this but nothing works.

Best Christophe




From your build I am missing the set up for the Infiniband. See the BUILD file under Infiniband for the environment variables to set to compile for IB.

Bert







Quote:Bovigny Jun 17th 6:25 am
I run with mpirun -n 2 nwchem a md ?
ERROR

0:0:nga_put_common:cannot locate region: [28394:76246 ,28394:76246 ]:: -999
(rank:0 hostname:master.cluster pid:24491):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0

Do you have suggestion


Forum >> NWChem's corner >> Running NWChem



Who's here now Members 0 Guests 0 Bots/Crawler 1


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC