Help with large CCSD(T) Calculation

From NWChem

Viewed 2359 times, With a total of 8 Posts
Jump to: navigation, search

Just Got Here
Threads 1
Posts 4
Hi all. I'm working on some CCSD(T) calculations of CO2 dimers using aug-cc-pvqz basis sets. I realize that this is a very large job. I've run a few calculations previously using molpro (on XSEDE Blacklight), which (I don't have my notes on me, but if I recall correctly) took about 20 hours on 16 cores, and required ~256GB memory.

I would like to try running these jobs on NWChem instead, but I'm having problems with 1) tweaking the performance options and 2) my jobs are dying due to a file writing error.

First, here is my input file. I've not included the basis set specification, as it's a long copy/paste from BSEL

title "co2 test"
#memory stack 9600 mb heap 800 mb global 4800 mb // tried this also, same error
memory stack 1500 mb heap 100 mb global 1400 mb
geometry
	symmetry c1
	C   2.12544   0.00000   0.00000
	O   1.82852  -0.93172  -0.62769
	O   2.42235   0.93172   0.62769
	C  -2.12544   0.00000   0.00000
	O  -1.20623  -0.32695  -0.63119
	O  -3.04465   0.32695   0.63119
end
basis
	## *snip*  
end
bsse
	mon firstmonomer 1 2 3
	mon secondmonomer 4 5 6
end
scf
	singlet
	rhf
end
tce
	ccsd(t)
	2eorb
	io ga
	tilesize 10 # also tried 15 and 20
end
task tce energy


I'm running the jobs on XSEDE trestles, on 8 cores (mpirun_rsh) over 2 nodes (64GB mem/node) using environmental variable ARMCI_DEFAULT_SHMMAX=2048. I've also tried running without the variable set, but with the same results.

So now the results. The job runs for a while, and generates ~150GB of temp files before dying. I've pasted the relevant output below.

*snip*
           General Information
            -------------------
      Number of processors :    16
         Wavefunction type : Restricted Hartree-Fock
          No. of electrons :    44
           Alpha electrons :    22
            Beta electrons :    22
           No. of orbitals :  1234
            Alpha orbitals :   617
             Beta orbitals :   617
        Alpha frozen cores :     0
         Beta frozen cores :     0
     Alpha frozen virtuals :     0
      Beta frozen virtuals :     0
         Spin multiplicity : singlet 
    Number of AO functions :   630
       Number of AO shells :   120
        Use of symmetry is : off
      Symmetry adaption is : off
         Schwarz screening : 0.10D-09
  !! WARNING !! The number of MO is less than the number of AO
          Correlation Information
          -----------------------
          Calculation type : Coupled-cluster singles & doubles w/ perturbation           
   Perturbative correction : (T)                                                         
            Max iterations :      100
        Residual threshold : 0.10D-06
          DIIS level shift : 0.00D+00
    CC-LR DIIS level shift : 0.00D+00
    CC-IR DIIS level shift : 0.00D+00
          Amplitude update :  5-th order DIIS
                I/O scheme : Global Array Library
            Memory Information
            ------------------
          Available GA space size is    ********** doubles
          Available MA space size is     681563897 doubles
 Maximum block size supplied by input
 Maximum block size        20 doubles
 tile_dim =     20
 Block   Spin    Irrep     Size     Offset   Alpha
 -------------------------------------------------
   1    alpha     a     11 doubles       0       1
   2    alpha     a     11 doubles      11       2
   3    beta      a     11 doubles      22       1
   4    beta      a     11 doubles      33       2
   5    alpha     a     19 doubles      44       5
   6    alpha     a     20 doubles      63       6
   7    alpha     a     20 doubles      83       7
   8    alpha     a     20 doubles     103       8
   9    alpha     a     20 doubles     123       9
  10    alpha     a     20 doubles     143      10
  11    alpha     a     19 doubles     163      11
  12    alpha     a     20 doubles     182      12
  13    alpha     a     20 doubles     202      13
  14    alpha     a     20 doubles     222      14
  15    alpha     a     20 doubles     242      15
  16    alpha     a     20 doubles     262      16
  17    alpha     a     19 doubles     282      17
  18    alpha     a     20 doubles     301      18
  19    alpha     a     20 doubles     321      19
  20    alpha     a     20 doubles     341      20
  21    alpha     a     20 doubles     361      21
  22    alpha     a     20 doubles     381      22
  23    alpha     a     19 doubles     401      23
  24    alpha     a     20 doubles     420      24
  25    alpha     a     20 doubles     440      25
  26    alpha     a     20 doubles     460      26
  27    alpha     a     20 doubles     480      27
  28    alpha     a     20 doubles     500      28
  29    alpha     a     19 doubles     520      29
  30    alpha     a     20 doubles     539      30
  31    alpha     a     20 doubles     559      31
  32    alpha     a     20 doubles     579      32
  33    alpha     a     20 doubles     599      33
  34    alpha     a     20 doubles     619      34
  35    beta      a     19 doubles     639       5
  36    beta      a     20 doubles     658       6
  37    beta      a     20 doubles     678       7
  38    beta      a     20 doubles     698       8
  39    beta      a     20 doubles     718       9
  40    beta      a     20 doubles     738      10
  41    beta      a     19 doubles     758      11
  42    beta      a     20 doubles     777      12
  43    beta      a     20 doubles     797      13
  44    beta      a     20 doubles     817      14
  45    beta      a     20 doubles     837      15
  46    beta      a     20 doubles     857      16
  47    beta      a     19 doubles     877      17
  48    beta      a     20 doubles     896      18
  49    beta      a     20 doubles     916      19
  50    beta      a     20 doubles     936      20
  51    beta      a     20 doubles     956      21
  52    beta      a     20 doubles     976      22
  53    beta      a     19 doubles     996      23
  54    beta      a     20 doubles    1015      24
  55    beta      a     20 doubles    1035      25
  56    beta      a     20 doubles    1055      26
  57    beta      a     20 doubles    1075      27
  58    beta      a     20 doubles    1095      28
  59    beta      a     19 doubles    1115      29
  60    beta      a     20 doubles    1134      30
  61    beta      a     20 doubles    1154      31
  62    beta      a     20 doubles    1174      32
  63    beta      a     20 doubles    1194      33
  64    beta      a     20 doubles    1214      34
 Global array virtual files algorithm will be used
 Parallel file system coherency ......... OK
 Integral file          = ./co2.aoints.00
 Record size in doubles =  65536        No. of integs per rec  =  32766
 Max. records in memory =      0        Max. records in file   = ******
 No. of bits per label  =     16        No. of bits per value  =     64
 #quartets = 1.807D+07 #integrals = 1.013D+10 #direct =  0.0% #cached =100.0%
File balance: exchanges=    63  moved=  7630  time=   5.1
 Fock matrix recomputed
 1-e file size   =           380689
 1-e file name   = ./co2.f1            
 Cpu & wall time / sec          137.1          183.2
 4-electron integrals stored in orbital form
  available GA memory                2516039248  bytes
    available GA memory    available GA memory  available GA memory   available GA memory                  2516039256               2516039256       available GA memory  available GA memory  available GA memory  available GA memory Last System Error Message from Task 10:: No such file or directory
Last System Error Message from Task 9:: No such file or directory
 available GA memory                2516039256                 2516039256 bytes 
  bytes   bytes              2516392056              2516392056  
               2516392056 bytes bytes
  available GA memory               2516039256
               2516039256 bytes   createfile: failed ga_create size=*********
 createfile: failed ga_create size=********* createfile: failed ga_create size=*********
 ------------------------------------------------------------------------
 bytes
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------ ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  ------------------------------------------------------------------------ ------------------------------------------------------------------------ current input line : 
 ------------------------------------------------------------------------ ------------------------------------------------------------------------
 current input line :  ------------------------------------------------------------------------ ------------------------------------------------------------------------
       0:       0:  ------------------------------------------------------------------------
 ------------------------------------------------------------------------
Last System Error Message from Task 0:: No such file or directory
 createfile: failed ga_create size=********* createfile: failed ga_create size=********* createfile: failed ga_create size=********* ------------------------------------------------------------------------
 ------------------------------------------------------------------------  current input line :  ------------------------------------------------------------------------
  ------------------------------------------------------------------------
     0: 
 ------------------------------------------------------------------------     0:  ------------------------------------------------------------------------   289: task tce energy
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 For more information see the NWChem manual at 
 http://www.emsl.pnl.gov/docs/nwchem/nwchem.html
 For further details see manual section: 
0:0:createfile: failed ga_create size=:: 2137779302
(rank:0 hostname:trestles-2-32.local pid:25704):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------
 ------------------------------------------------------------------------ ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------ ------------------------------------------------------------------------     0:  ------------------------------------------------------------------------   ------------------------------------------------------------------------For more information see the NWChem manual at 
For more information see the NWChem manual at  
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------For more information see the NWChem manual at   
 ------------------------------------------------------------------------
http://www.emsl.pnl.gov/docs/nwchem/nwchem ------------------------------------------------------------------------
http://www.emsl.pnl.gov/docs/nwchem/nwchem .html ------------------------------------------------------------------------ 
.html
 For more information see the NWChem manual at 
For more information see the NWChem manual at  http://www.emsl.pnl.gov/docs/nwchem/nwchem
For more information see the NWChem manual at .html
 ------------------------------------------------------------------------  
http://www.emsl.pnl.gov/docs/nwchem/nwchem
http://www.emsl.pnl.gov/docs/nwchem/nwchem 
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 10
.htmlhttp://www.emsl.pnl.gov/docs/nwchem/nwchem
For further details see manual section:   application called MPI_Abort(MPI_COMM_WORLD, 0) - process 13
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15
For more information see the NWChem manual at .htmlFor further details see manual section: 
   For further details see manual section:  For further details see manual section: .html                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  application called MPI_Abort(MPI_COMM_WORLD, 0) - process 9
For further details see manual section: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 11
http://www.emsl.pnl.gov/docs/nwchem/nwchem
.html
                                                                                                                                                                                                                                                                For further details see manual section:                                                                                                                                                                                                                                                                
10:10:createfile: failed ga_create size=:: 2137779302
For further details see manual section: 
(rank:10 hostname:trestles-2-4.local pid:10516):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
 9:9:createfile: failed ga_create size=:: 2137779302
                                                                                                                                                                                                                                                               13:13:createfile: failed ga_create size=:: 2137779302
(rank:9 hostname:trestles-2-4.local pid:10515):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
11:11:createfile: failed ga_create size=:: 2137779302
(rank:13 hostname:trestles-2-4.local pid:10519):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
(rank:11 hostname:trestles-2-4.local pid:10517):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
12:12:createfile: failed ga_create size=:: 2137779302
(rank:12 hostname:trestles-2-4.local pid:10518):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
15:15:createfile: failed ga_create size=:: 2137779302
(rank:15 hostname:trestles-2-4.local pid:10521):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
8:8:createfile: failed ga_create size=:: 2137779302
(rank:8 hostname:trestles-2-4.local pid:10514):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
 available GA memory                2516392056  bytes
 createfile: failed ga_create size=*********
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 For more information see the NWChem manual at 
 http://www.emsl.pnl.gov/docs/nwchem/nwchem.html
 For further details see manual section: 
14:14:createfile: failed ga_create size=:: 2137779302
(rank:14 hostname:trestles-2-4.local pid:10520):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 14:: No such file or directory
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------  ------------------------------------------------------------------------
 current input line : 
 ------------------------------------------------------------------------Last System Error Message from Task 2:: No such file or directory
Last System Error Message from Task 3:: No such file or directory
 Last System Error Message from Task 1:: No such file or directory
 ------------------------------------------------------------------------Last System Error Message from Task 5:: No such file or directory
 ------------------------------------------------------------------------Last System Error Message from Task 7:: No such file or directory
Last System Error Message from Task 4:: No such file or directory
Last System Error Message from Task 6:: No such file or directory
 current input line :   ------------------------------------------------------------------------
 current input line :      0: 
 current input line :      0:  ------------------------------------------------------------------------ 
 current input line : 
 current input line :  current input line : 
 ------------------------------------------------------------------------
     0:      0: 
 ------------------------------------------------------------------------     0: 
     0:  ------------------------------------------------------------------------ ------------------------------------------------------------------------
     0: 
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------ ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  ------------------------------------------------------------------------
  ------------------------------------------------------------------------ ------------------------------------------------------------------------For more information see the NWChem manual at For more information see the NWChem manual at  ------------------------------------------------------------------------
 ------------------------------------------------------------------------   ------------------------------------------------------------------------ 
http://www.emsl.pnl.gov/docs/nwchem/nwchem
 ------------------------------------------------------------------------http://www.emsl.pnl.gov/docs/nwchem/nwchem
 .html
 ------------------------------------------------------------------------
For more information see the NWChem manual at  For more information see the NWChem manual at  For further details see manual section: .html
For more information see the NWChem manual at 
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 2
                                                                                                                                                                                                                                                                http://www.emsl.pnl.gov/docs/nwchem/nwchem
.htmlapplication called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
http://www.emsl.pnl.gov/docs/nwchem/nwchem
For further details see manual section:  For further details see manual section: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 14
For more information see the NWChem manual at  
.htmlFor more information see the NWChem manual at 
 http://www.emsl.pnl.gov/docs/nwchem/nwchemapplication called MPI_Abort(MPI_COMM_WORLD, 0) - process 4
                                                                                                                                                                                                                                                               2:2:createfile: failed ga_create size=:: 2137779302
                                                                                                                                                                                                                                                               http://www.emsl.pnl.gov/docs/nwchem/nwchemapplication called MPI_Abort(MPI_COMM_WORLD, 0) - process 5
.html 
(rank:2 hostname:trestles-2-32.local pid:25706):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
http://www.emsl.pnl.gov/docs/nwchem/nwchem.html.html
 application called MPI_Abort(MPI_COMM_WORLD, 0) - process 6
For further details see manual section: 
For further details see manual section: 
3:3:createfile: failed ga_create size=:: 2137779302
  application called MPI_Abort(MPI_COMM_WORLD, 0) - process 7
For further details see manual section: For further details see manual section:   (rank:3 hostname:trestles-2-32.local pid:25707):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
1:1:createfile: failed ga_create size=:: 2137779302
(rank:1 hostname:trestles-2-32.local pid:25705):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
7:7:createfile: failed ga_create size=:: 2137779302
5:5:createfile: failed ga_create size=:: 2137779302
(rank:7 hostname:trestles-2-32.local pid:25711):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
(rank:5 hostname:trestles-2-32.local pid:25709):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
6:6:createfile: failed ga_create size=:: 2137779302
4:4:createfile: failed ga_create size=:: 2137779302
(rank:6 hostname:trestles-2-32.local pid:25710):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
(rank:4 hostname:trestles-2-32.local pid:25708):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 8


Any help would be very very very appreciated. Thanks.

Keith McLaughlin
University of South Florida

Forum Vet
Threads 3
Posts 865
Keith
The input you are using can be run with the TCE module if you increase the number of processors.
If you want to stick to 16 processors, you might want to switch to the older "CCSD" module that
has small memory requirements.
I have been trying to reproduce the behavior of the input you are using and
I have come up with the two input files below.
Please keep in mind that because of the different memory requirements
for CCSD and the (T) part, you will have to use two different input files
 
start ccsd
title "CCSD input"
memory stack 800 mb heap 100 mb global 750 mb
geometry
        C   2.12544   0.00000   0.00000
        O   1.82852  -0.93172  -0.62769
        O   2.42235   0.93172   0.62769
        C  -2.12544   0.00000   0.00000
        O  -1.20623  -0.32695  -0.63119
        O  -3.04465   0.32695   0.63119
end
basis
* library aug-cc-pvqz
end
scf
 direct
 thresh 1d-8
end
ccsd
 diisbas 2
 freeze atomic
 nodisk
 tol2e 1d-14
end
task ccsd


restart ccsd
title "CCSD(T) input"
memory stack 400 mb heap 100 mb global 950 mb
task ccsd(t)
Edited On 12:13:15 PM PST - Fri, Dec 26th 2014 by Edoapra

Just Got Here
Threads 1
Posts 4
Hi Edoapra, thanks for your reply.

It seems that you're correct that the job will run if I request more cores. I'm now running on 64 cores, but I'm now running into a new error.

*snip*
Global array virtual files algorithm will be used
 Parallel file system coherency ......... OK
 Integral file          = ./n2.aoints.00
 Record size in doubles =  65536        No. of integs per rec  =  32766
 Max. records in memory =   1874        Max. records in file   = ******
 No. of bits per label  =     16        No. of bits per value  =     64
 #quartets = 2.929D+06 #integrals = 1.446D+09 #direct =  0.0% #cached =100.0%
File balance: exchanges=   254  moved=  1805  time=   0.1
 Fock matrix recomputed
 1-e file size   =           173056
 1-e file name   = ./n2.f1
 Cpu & wall time / sec           12.3           15.4
 4-electron integrals stored in orbital form
1: WARNING:armci_set_mem_offset: offset changed 0 to 26914816
33: WARNING:armci_set_mem_offset: offset changed 0 to 22720512
(rank:32 hostname:trestles-4-13.local pid:26022):ARMCI DASSERT fail. openib.c:armci_server_register_region():964 cond:(memhdl->memhndl!=((void *)0))
Last System Error Message from Task 32:: Cannot allocate memory
32:Segmentation Violation error, status=: 11
32:Segmentation Violation error, status=: 11
32:Segmentation Violation error, status=: 11
*snip*


Any suggestions?
Edited On 11:24:59 PM PDT - Sat, Aug 24th 2013 by Kmclaugh

  • Karol Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
    Profile
    Send PM
Clicked A Few Times
Threads 1
Posts 31
Hi Keith,
You have two options:
1.) run the TCE but with other options for 4-index transfromation, which is currently causing problems. Instead of "2eorb" please use the sequence:
2eorb
2emet 13

if your job is still crashing please use

2eorb
2emet 14
split 2

you may also make "split" value bigger (for example "split 4" which means that atomic 2-electron integrals will be divided into 4 batches, which reduces the memory required to perfrom 4-index transformation). The TCE code will also require more processors (according to my estimates 128 or more should be fine). Please also use ARMCI_DEFAULT_SHMMAX=4096.

2.) you run the "old" spin-free version of CCSD(T) for the closed shell.

Best,
Karol

Just Got Here
Threads 1
Posts 4
Thanks for your help. I will try your suggestions.

Just Got Here
Threads 1
Posts 4
I'm still having some issues, but I have been able to get some smaller jobs to complete.

I don't quite understand the output. I'm trying to calculate the CBS extrapolated interaction energy. I noticed that in my current input file, the interaction energy is not given. In molpro I'd usually use the "dummy" command (to get the BSSE corrected interaction energy), but I'm not sure how to do this in nwchem. Please advise.

Forum Vet
Threads 3
Posts 865
The documentation for running BSSE in NWChem is at
http://www.nwchem-sw.org/index.php/Top-level#TASK_Directive_for_BSSE_calculations

Clicked A Few Times
Threads 0
Posts 34
Quote:Kmclaugh Aug 25th 6:24 am

  • snip*
(rank:32 hostname:trestles-4-13.local pid:26022):ARMCI DASSERT fail. openib.c:armci_server_register_region():964 cond:(memhdl->memhndl!=((void *)0))
Last System Error Message from Task 32:: Cannot allocate memory
32:Segmentation Violation error, status=: 11
32:Segmentation Violation error, status=: 11
32:Segmentation Violation error, status=: 11
  • snip*
[/code]

Any suggestions?


ARMCI-MPI (wiki.mpich.org/armci-mpi/index.php/NWChem) eliminates all ARMCI-related segfaults on InfiniBand. If it doesn't run with ARMCI-MPI, you need more nodes, which is to say, you've exceeded the actual limit of memory. On the other hand, ARMCI-OPENIB segfaults for any number of reasons, many of which are not actually running out of memory.

Jeff

Clicked A Few Times
Threads 19
Posts 43
Quote:Edoapra Aug 23rd 9:22 pm
Keithusin
The input you are using can be run with the TCE module if you increase the number of processors.
If you want to stick to 16 processors, you might want to switch to the older "CCSD" module that
has small memory requirements.
I have been trying to reproduce the behavior of the input you are using and
I have come up with the two input files below.
Please keep in mind that because of the different memory requirements
for CCSD and the (T) part, you will have to use two different input files
 
start ccsd
title "CCSD input"
memory stack 800 mb heap 100 mb global 750 mb
geometry
        C   2.12544   0.00000   0.00000
        O   1.82852  -0.93172  -0.62769
        O   2.42235   0.93172   0.62769
        C  -2.12544   0.00000   0.00000
        O  -1.20623  -0.32695  -0.63119
        O  -3.04465   0.32695   0.63119
end
basis
* library aug-cc-pvqz
end
scf
 direct
 thresh 1d-8
end
ccsd
 diisbas 2
 freeze atomic
 nodisk
 tol2e 1d-14
end
task ccsd


restart ccsd
title "CCSD(T) input"
memory stack 400 mb heap 100 mb global 950 mb
task ccsd(t)


Hello Edoapra,
I just need to understand something..
From this code of yours how can you be able to get the interaction energy term between CO2--CO2.
I mean which portion of this code is defining that..??


Forum >> NWChem's corner >> Running NWChem



Who's here now Members 0 Guests 1 Bots/Crawler 0


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC