Is there a systematic way of finding out how much memory is needed?

From NWChem

Viewed 2194 times, With a total of 10 Posts

Forum >> NWChem's corner >> Running NWChem

Yesint Member
Profile
Send PM

Just Got Here

Threads 1
Posts 4

2:04:13 AM PST - Tue, Nov 13th 2012

Dear nwchem users,
I'm using nwchem on infiniband cluster and strugling with memory problems when doing TDDFT. The input is:

Title "dye2Nex"
Start dye2Nex
set fock:replicated logical .false.

permanent_dir /Data/Users/syesylevsky/QM/dye2/N
memory total 400 mb

echo
charge 0

geometry noautosym units angstrom

C     0.00000     0.00000     0.00000

C     1.36800     0.00000     0.00000

C     -0.774000     1.26900     0.00000

C     0.0560000     2.48300     0.00300000

O     2.11800     1.15900     -0.00500000

O     -0.652000     -1.20200     0.00400000

C     2.28500     3.49300     0.00900000

C     1.70000     4.74800     0.0160000

C     0.309000     4.88600     0.0130000

C     -0.507000     3.76700     0.00500000

O     -1.99700     1.27200     0.00200000

C     1.45200     2.36000     0.00300000

H     -1.58400     -1.02300     0.0550000

H     3.37500     3.38100     0.00500000

H     2.33400     5.64100     0.0240000

H     -0.135000     5.88700     0.0160000

H     -1.59900     3.87300     -0.00100000

C     2.22300     -1.17800     -0.00300000

C     4.14100     -2.24800     0.313000

O     3.55400     -0.999000     0.414000

C     5.46200     -2.57000     0.622000

C     5.82700     -3.89700     0.443000

C     4.91700     -4.85600     -0.0240000

C     3.60400     -4.52700     -0.330000

C     1.97000     -2.48200     -0.356000

C     3.20900     -3.20300     -0.158000

H     6.16900     -1.81700     0.984000

H     5.25600     -5.89000     -0.149000

H     2.89600     -5.27800     -0.693000

H     1.03900     -2.91400     -0.717000

H     6.85300     -4.20500     0.672000

end

ecce_print ecce.out

basis "ao basis" spherical print

H library "3-21G"

O library "3-21G"

C library "3-21G"

END

dft

 mult 1

 XC b3lyp

 iterations 5000

 mulliken

 direct

end

driver

 default

 maxiter 2000

end

tddft

 nroots 3

 target 1

end

task tddft optimize

When I'm running this I get the following error:

2: error ival=5
(rank:2 hostname:mesocomte87 pid:9679):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193
cond:(pdscr->status==IBV_WC_SUCCESS)
1: error ival=10
(rank:1 hostname:mesocomte65 pid:18582):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_send_complete():459
cond:(pdscr->status==IBV_WC_SUCCESS)
5: error ival=10
(rank:5 hostname:mesocomte19 pid:20956):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193
cond:(pdscr->status==IBV_WC_SUCCESS)
0:Terminate signal was sent, status=: 15
(rank:0 hostname:mesocomte21 pid:30562):ARMCI DASSERT fail.
../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0

As it was advised on this forum I set
export ARMCI_DEFAULT_SHMMAX=2048
but this does not help. I spent a lot of time playing with different memory values and finally got it working with

memory stack 150 mb heap 50 mb global 200 mb

but this was a blind guesswork, which I really don't want to do for every new system or basis level.

EDIT: it crashed after few hours. I still can't get it running.

Is there a good systematic way of finding out how much memory particular job needs to run normally in parallel environment? Which disgnostic messages should I use for this?

Thank you very much in advance!

Semen

Edited On 5:34:33 AM PST - Tue, Nov 13th 2012 by Yesint

Andrew.yeung Member
Profile
Send PM

Clicked A Few Times

Threads 4
Posts 13

10:30:39 AM PST - Tue, Nov 13th 2012
I had some related problems recently. How much memory do you have on your system? Try increasing total memory drastically? For an 8 processor job, I use: memory total 22 gb

Yesint Member
Profile
Send PM

Just Got Here

Threads 1
Posts 4

11:31:26 PM PST - Tue, Nov 13th 2012
Quote:Andrew.yeung Nov 13th 9:30 am I had some related problems recently. How much memory do you have on your system? Try increasing total memory drastically? For an 8 processor job, I use: memory total 22 gb[/quote] In principle I can ask for up to 12 Gb per process, but than this job will stay in the queue forever (it will saturate the nodes compeletely and will get very low priority). My objective is to allocate just enough to get it running but keep waiting time reasonable. My molecule is rather small and on 1 CPU it runs under 1 Gb of memory, but I can't understand how to estimate memory consumption in parallel mode.

Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
Profile
Send PM

Forum Vet

Threads 5
Posts 598

2:20:38 PM PST - Wed, Nov 14th 2012

How memory allocation works in NWChem

Let start with the beginning:

A. The memory keyword in the input specifies the memory per process, generally per processor and NOT per job.

Hence, if you tried to specify "memory total 22 gb" with 8 processors on one node, that means you are asking for 178 gbyte on one node to make this job run.

B. When you specify "memory total xxx mb", the amount xxx gets split up in 25% heap, 25% stack, and 50% global.

 Heap: For most applications heap is not important and could be a much smaller block of memory. Generally we set this to 100 mb at most if we specify explicitly.

 Stack: Effectively your local memory for each processor to use for the calculations.

 Global: Memory used to store arrays that are globally accessible. Effectively it has a block of the <size global> times <# of processors used on node>, which can get very big.

C. Specifying memory explicitly, I recommend you use the format:

   memory heap 100 mb stack 1000 mb global 2400 mb

The example here makes available 3500 mb, 3.5 Gbyte per processor and would require 3.5 Gbyte times the # of processors running on the node to be physically available. You cannot use virtual memory. You also need to leave space for the OS, so the above example we use when we have 8 processors and 32 gbyte of memory per node.

D. How much memory does the calculation need? The amount and distribution of stack and global needed is strongly dependent on the application. Generally an equal distribution works fine to start with. The code will indicate if it runs out of local or global memory, and you can redistribute. For coupled cluster (TCE) calculations you will generally need more global than stack memory (above example is a TCE style input). Tiling is important for TCE, to reduce local memory requirements.

E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.

Hope this helps,

Bert

Andrew.yeung Member
Profile
Send PM

Clicked A Few Times

Threads 4
Posts 13

4:27:06 PM PST - Wed, Nov 14th 2012
Thanks for correcting my mistake, Bert. Is there a reason why you break up memory this way (heap 100 mb/stack 1000 mb/global 2400 mb), instead of the 25-25-50% by default?

Psd Member
Profile
Send PM

Clicked A Few Times

Threads 2
Posts 8

1:54:40 AM PST - Fri, Nov 16th 2012

Hi Bert!

Thanks for your post, but I still have a question about the ARMCI_DEFAULT_SHMMAX.

Suppose I use 2 nodes with 16 cores each node, each core has 4GB memory, and I specify:

memory heap 100 mb stack 400 mb global 3200 mb

that is to say, I use 3200MB*16=51200MB global memory each node.
If I set

setenv ARMCI_DEFAULT_SHMMAX 51200

comes out this warning:

incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200

Do you know what's the problem is?

Thank you!

Quote:Bert Nov 14th 1:20 pm

Let start with the beginning:

E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.

Hope this helps,

Bert

Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
Profile
Send PM

Forum Vet

Threads 5
Posts 598

3:56:05 PM PST - Sat, Nov 17th 2012
Simply because most codes do not use that much stack memory, so it would be waisted. Bert Quote:Andrew.yeung Nov 14th 11:27 pm Thanks for correcting my mistake, Bert. Is there a reason why you break up memory this way (heap 100 mb/stack 1000 mb/global 2400 mb), instead of the 25-25-50% by default?

Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
Profile
Send PM

Forum Vet

Threads 5
Posts 598

3:59:07 PM PST - Sat, Nov 17th 2012
Yes, the code right now has some internal limits. Henc,e you cannot set it to more than 8000 mb, mainly because this was based on fewer cores per node. I'll look at having this updated and tested. I would have to suggest you do not set the stack that small if you want to run coupled cluster caculations, it will be more expensive as you are forced to use smaller blocks. Bert Quote:Psd Nov 16th 8:54 am Hi Bert! Thanks for your post, but I still have a question about the ARMCI_DEFAULT_SHMMAX. Suppose I use 2 nodes with 16 cores each node, each core has 4GB memory, and I specify: memory heap 100 mb stack 400 mb global 3200 mb that is to say, I use 3200MB*16=51200MB global memory each node. If I set setenv ARMCI_DEFAULT_SHMMAX 51200 comes out this warning: incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200 incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200 Do you know what's the problem is? Thank you! Quote:Bert Nov 14th 1:20 pm Let start with the beginning: E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4. Hope this helps, Bert

Yesint Member
Profile
Send PM

Just Got Here

Threads 1
Posts 4

1:36:16 AM PST - Sun, Nov 18th 2012

I understand the things in theory, but on practice I still can't get it working. Currently I have

memory total 4000 mb

It runs for few hours and than fails. The end of log is the following:

           Memory Information

           ------------------

         Available GA space size is         524244319 doubles

         Available MA space size is          65513497 doubles

         Length of a trial vector is         9864

         Algorithm : Incore multiple tensor contraction

         Estimated peak GA usage is         182779852 doubles

         Estimated peak MA usage is              6600 doubles



   3 smallest eigenvalue differences (eV)

 No. Spin  Occ  Vir  Irrep   E(Vir)    E(Occ)   E(Diff)

   1    1   72   73 a        -0.071    -0.208     3.744

   2    1   71   73 a        -0.071    -0.239     4.578

   3    1   70   73 a        -0.071    -0.245     4.747

 Entering Davidson iterations

 Restricted singlet excited states

 Iter   NTrls   NConv    DeltaV     DeltaE      Time   

 ----  ------  ------  ---------  ---------  ---------

0: error ival=-1
(rank:0 hostname:mesocomte68 pid:30430):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_server_rdma_strided_to_contig():3239 cond:(rc==0)

As far as I can see from Memory Information I have a lot of free memory, but it still fails. Could you please tell what's wrong? I wonder what is armci_server_rdma_strided_to_contig()...

Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
Profile
Send PM

Forum Vet

Threads 5
Posts 598

8:09:01 AM PST - Sun, Nov 18th 2012
Not clear, seems to be related to the system. I would try and reduce the memory footprint. The output suggest you do not need that much memory in the first place. Doing the numbers and info it looks like you are running on 2 processor cores, and each core is on a different node connected by IB? How many cores and how much memory do you have per node? You may be able to run this on a single node. Bert Quote:Yesint Nov 18th 8:36 am I understand the things in theory, but on practice I still can't get it working. Currently I have memory total 4000 mb It runs for few hours and than fails. The end of log is the following: Memory Information ------------------ Available GA space size is 524244319 doubles Available MA space size is 65513497 doubles Length of a trial vector is 9864 Algorithm : Incore multiple tensor contraction Estimated peak GA usage is 182779852 doubles Estimated peak MA usage is 6600 doubles 3 smallest eigenvalue differences (eV) No. Spin Occ Vir Irrep E(Vir) E(Occ) E(Diff) 1 1 72 73 a -0.071 -0.208 3.744 2 1 71 73 a -0.071 -0.239 4.578 3 1 70 73 a -0.071 -0.245 4.747 Entering Davidson iterations Restricted singlet excited states Iter NTrls NConv DeltaV DeltaE Time ---- ------ ------ --------- --------- --------- 0: error ival=-1 (rank:0 hostname:mesocomte68 pid:30430):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_server_rdma_strided_to_contig():3239 cond:(rc==0) As far as I can see from Memory Information I have a lot of free memory, but it still fails. Could you please tell what's wrong? I wonder what is armci_server_rdma_strided_to_contig()...

Yesint Member
Profile
Send PM

Just Got Here

Threads 1
Posts 4

2:55:37 AM PST - Sun, Nov 25th 2012
Quote:Bert Nov 18th 7:09 am Not clear, seems to be related to the system. I would try and reduce the memory footprint. The output suggest you do not need that much memory in the first place. Doing the numbers and info it looks like you are running on 2 processor cores, and each core is on a different node connected by IB? How many cores and how much memory do you have per node? You may be able to run this on a single node. Bert It runs over IB, one core per node. Each node has at least 12GB of RAM. I'll try to put it on the single node, but this is not what we want to do normally.

Forum >> NWChem's corner >> Running NWChem

Who's here now Members 0 Guests 1 Bots/Crawler 0

AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC

Search

Navigation

SEARCH

TOOLBOX

LANGUAGES

Forum Menu

Is there a systematic way of finding out how much memory is needed?

From NWChem

Toolbox