Memory problem on AIX

From NWChem

Viewed 5318 times, With a total of 13 Posts
Jump to: navigation, search

Clicked A Few Times
Threads 2
Posts 9
Dear All,

I've compiled NWChem-6.0 on an IBM machine and both serial and parallel versions work, except
the case when I specify memory to a value greater than 2 GB.

Here is the compilation and system info. The machine has 32 IBM power6 cores and 256 GB of memory.

$ uname -a 
AIX wcu02 3264193612 3 5 00C28FA44C00
$ xlf -qversion
IBM XL Fortran Enterprise Edition for AIX, V11.1
Version: 11.01.0000.0008
$ xlc -qversion 
IBM XL C/C++ Enterprise Edition for AIX, V9.0
Version: 09.00.0000.0007


I compiled the code with gmake 3.8 and setting the following variables

setenv NWCHEM_TOP /home/user/Source/nwchem-6.0
setenv NWCHEM_TARGET IBM
setenv LD_LIBRARY_PATH /usr/lpp/ppe.poe/lib
setenv INCLUDE /usr/include
setenv USE_MPI y
setenv LARGE_FILES TRUE
setenv MPI_LIB /usr/lpp/ppe.poe/lib
setenv MPI_INCLUDE "/usr/lpp/ppe.poe/include/thread -I/usr/lpp/ppe.poe/include"
setenv LIBMPI "-binitfini:poe_remote_main -lmpi_r -lvtd_r -lpthreads"
setenv NWCHEM_MODULES all
setenv HAS_BLAS TRUE
setenv BLASOPT "-lessl"
gmake nwchem_config
gmake >& make.log &

If needed I can also upload my make.log file.

Now when I try to run the program and increase the memory above 2 GB I get the following error

MA fatal error: MA_sizeof: invalid nelem: -1988100096

which is the only thing in the output file except the list of arguments used to run the job (in my case only the input file name).

Does anyone know what could be the source of the problem and how to solve it?
(recompile with other variables set/define additional variables when running?)

Thanks in advance,
Lukasz

Forum Vet
Threads 7
Posts 1296
NWCHEM_TARGET=IBM is a 32-bit platform
Lukasz,
Since when you compiled with NWCHEM_TARGET=IBM, you have generated a 32-bit executable that will not be able to address more than 2GB of memory. In order to overcome this limit, you would have to generate a 64-bit binary using NWCHEM_TARGET=IBM64

Cheers, Edo

Clicked A Few Times
Threads 2
Posts 9
Dear Edo,

Thanks for your immediate response. I recompiled the code with NWCHEM_TARGET=IBM64
(after performing make realclean) and it compiled without problems. Now there seems to be another
issue. I tried to run one of the examples shipped with NWChem, namely
$NWCHEM_TOP/examples/rimp2/hf-scf.nwc and I got the following output:

 argument  1 = hf-scf.nwc
              Northwest Computational Chemistry Package (NWChem) 6.0
              ------------------------------------------------------
                    Environmental Molecular Sciences Laboratory
                       Pacific Northwest National Laboratory
                                Richland, WA 99352
                              Copyright (c) 1994-2010
                       Pacific Northwest National Laboratory
                            Battelle Memorial Institute
             NWChem is an open-source computational chemistry package
                        distributed under the terms of the
                      Educational Community License (ECL) 2.0
             A copy of the license is included with this distribution
                              in the LICENSE.TXT file
                                  ACKNOWLEDGMENT
                                  --------------
            This software and its documentation were developed at the
            EMSL at Pacific Northwest National Laboratory, a multiprogram
            national laboratory, operated for the U.S. Department of Energy
            by Battelle under Contract Number DE-AC05-76RL01830. Support
            for this work was provided by the Department of Energy Office
            of Biological and Environmental Research, Office of Basic
            Energy Sciences, and the Office of Advanced Scientific Computing.
           Job information
           ---------------
    hostname      = wcu02
    program       = ../../bin/IBM64/nwchem
    date          = Tue Apr 17 06:19:13 2012
    compiled      = Tue_Apr_17_06:07:58_2012
    source        = /wcu/w01/Source/nwchem-6.0
    nwchem branch = 6.0
    input         = hf-scf.nwc
    prefix        = hf.
    data base     = ./hf.db
    status        = startup
    nproc         =        1
    time left     =     -1s
           Memory information
           ------------------
    heap     =   23107201 doubles =    176.3 Mbytes
    stack    =   23107201 doubles =    176.3 Mbytes
    global   =   46214400 doubles =    352.6 Mbytes (distinct from heap & stack)
    total    =   92428802 doubles =    705.2 Mbytes
    verify   = yes
    hardfail = no 
           Directory information
           ---------------------
  0 permanent = .
  0 scratch   = .
 ------------------------------------------------------------------------
  util_set_rtdb_state: rtdb_put failed      911
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: task scf
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 An error occured in the Runtime Database
 ------------------------------------------------------------------------
 For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation
 For further details see manual section:                                                                                                                                                                                                                                                                
rtdb_seq_put: put failed for "" in ./hf.db
Last System Error Message from Task 0:: A file or directory in the path name does not exist.
0:0: util_set_rtdb_state: rtdb_put failed:: 911
(rank:0 hostname:wcu02 pid:438646):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
ERROR: 0031-250  task 0: Terminated


I saw a post on this forum with a similar error but there was no solution given. Do you know
how this problem could be solved?

Thanks again for your time,
Cheers, Lukasz

Forum Vet
Threads 7
Posts 1296
Does any NWChem input file fail?
Lukasz,
Does any NWChem input file (e.g. $NWCHEM_TOP/src/nwchem.nw) fail with IBM64 or is this failure specific to hf-scf.nwc?

Thanks, Edo

Clicked A Few Times
Threads 2
Posts 9
I've tried 10 different jobs and in all cases I get the same error as in my previous post.

Forum Vet
Threads 7
Posts 1296
Please recompile the rtdb directory
Lukasz,
I would suggest you to use nwchem 6.1 (if possible) since it contains some IBM fixes (however I don't see anything related to the rtdb problem you are seeing)

Could please recompile the rtdb directory with no optimization and then relink?
Here are the instructions:

cd $NWCHEM_TOP/src/rtdb
make COPTIMIZE="-O0 -g"
cd ..
make FC=xlf link

Please let me know if this fixes the problem you are facing.

Cheers, Edo

Clicked A Few Times
Threads 2
Posts 9
Dear Edo,

I'm not using nwchem-6.1 since I cannot make the binary. Here are last few lines from my make.log
source='../ga-5-1/pario/elio/stat.c' object='pario/elio/stat.lo' libtool=yes \
DEPDIR=.deps depmode=aix /bin/sh ../ga-5-1/build-aux/depcomp \
/bin/sh ./libtool  --tag=CC   --mode=compill xlc -DHAVE_CONFIG_H -I. -I../ga-5-1     -I/usr/lpp/ppe.poe/include/thread -I/usr/lpp/ppe.poe/include  -Ima -I../ga-5-1/ma -I../ga-5-1/LinAlg/lapack+blas -Iglobal/src -I../ga-5-1/global/src -I../ga-5-1/global/testing -I../ga-5-1/pario/dra -I../ga-5-1/pario/eaf -I../ga-5-1/pario/elio -I../ga-5-1/pario/sf    -I../ga-5-1/armci/src/include -Iarmci/gaf2c -I../ga-5-1/armci/gaf2c -I../ga-5-1/armci/tcgmsg     -c -o pario/elio/stat.lo ../ga-5-1/pario/elio/stat.c
libtool: compill :  xlc -DHAVE_CONFIG_H -I. -I../ga-5-1 -I/usr/lpp/ppe.poe/include/thread -I/usr/lpp/ppe.poe/include -Ima -I../ga-5-1/ma -I../ga-5-1/LinAlg/lapack+blas -Iglobal/src -I../ga-5-1/global/src -I../ga-5-1/global/testing -I../ga-5-1/pario/dra -I../ga-5-1/pario/eaf -I../ga-5-1/pario/elio -I../ga-5-1/pario/sf -I../ga-5-1/armci/src/include -Iarmci/gaf2c -I../ga-5-1/armci/gaf2c -I../ga-5-1/armci/tcgmsg -c -M ../ga-5-1/pario/elio/stat.c -o pario/elio/stat.o
"../ga-5-1/pario/elio/stat.c", line 80.13: 1506-007 (S) "struct STATVFS" is undefined.
"../ga-5-1/pario/elio/stat.c", line 81.9: 1506-334 (S) Identifier bsize has already been defined on line 78 of "../ga-5-1/pario/elio/stat.c".
gmake[4]: *** [pario/elio/stat.lo] Error 1
gmake[3]: *** [all-recursive] Error 1
gmake[2]: *** [all] Error 2
gmake[1]: *** [build/.libs/libga.a] Error 1
gmake: *** [libraries] Error 1


But I think it should be addressed in a separate topic.

I followed your advice and recompiled rtdb in nwchem-6.0 without optmization. Now I can allocate up to 3400 mb,
allocating anything more results in an error:

 argument  1 = RbYb_RSC_CCSDT_09.00.inp
MA error: MA_init: could not allocate 1835008208 bytes
 ------------------------------------------------------------------------
 nwchem.F: ma_init failed (ga_uses_ma=F)      911
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: memory total 3500 mb
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation
 For further details see manual section:                                                                                                                                                                                                                                                                
0:0:nwchem.F: ma_init failed (ga_uses_ma=F):: 911
(rank:0 hostname:wcu02 pid:471870):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 0:: There is not enough memory available now.
ERROR: 0031-250  task 0: Terminated

I'm sure I didn't exceed the available memory limit since at the time of running the job there
was 200 GB available:
Total Memory        = 252672 mb
Memory              = 252672 mb
FreeRealMemory      = 194432 mb

I would be grateful for any suggestions ?

Thanks, Lukasz

Forum Vet
Threads 7
Posts 1296
Lukasz,
I am quite clueless about the reason why your memory allocation is failing under IBM64 (especially since
I have no access to a IBM64 platform and since things work OK under LINUX64, instead).
Could you please try the following memory line, instead and tell me what happens?
Since NWChem use local (a.k.a MA) memory and global (GA) memory, I would like to see
what happens if you try a small amount of GA memory and keep increasing MA memory instead,
that please try the following sequence (in separate input files, of course)

memory stack 1000 mb heap 300 mb global 250 mb

memory stack 1250 mb heap 300 mb global 250 mb

memory stack 1500 mb heap 300 mb global 250 mb

memory stack 1750 mb heap 300 mb global 250 mb

...
and so on by increasing the stack value until NWChem crashes.

Please let me know the outcome of this process, Edo

PS To fix the 6.1 tools compilation problem, you might want to use gcc instead of xlc as C compiler by setting CC=gcc

Clicked A Few Times
Threads 2
Posts 9
Dear Edo,

Thanks again for your precious advice. I run the jobs you asked and the code crashed already for memory stack 1500 mb heap 300 mb global 250 mb with a similar message as previously
MA error: MA_init: could not allocate 1887437008 bytes
 ------------------------------------------------------------------------
 nwchem.F: ma_init failed (ga_uses_ma=F)      911
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: memory stack 1500 mb heap 300 mb global 250 mb
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation
 For further details see manual section:                                                                                                                                                                                                                                                                
0:0:nwchem.F: ma_init failed (ga_uses_ma=F):: 911
(rank:0 hostname:wcu02 pid:356372):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 0:: There is not enough memory available now.
ERROR: 0031-250  task 0: Terminated


Compiling nwchem-6.1 fails also with gcc (my gcc version is 4.2.0) when compiling the same
routine as previously, here's the last portion of the make.log
libtool: compille:  gcc -DHAVE_CONFIG_H -I. -I../ga-5-1 -I/usr/lpp/ppe.poe/include/thread64 -I/usr/lpp/ppe.poe/include -Ima -I../ga-5-1/ma -I../ga-5-1/LinAlg/lapack+blas -Iglobal/src -I../ga-5-1/global/src -I../ga-5-1/global/testing -I../ga-5-1/pario/dra -I../ga-5-1/pario/eaf -I../ga-5-1/pario/elio -I../ga-5-1/pario/sf -I../ga-5-1/armci/src/include -Iarmci/gaf2c -I../ga-5-1/armci/gaf2c -I../ga-5-1/armci/tcgmsg -MT pario/elio/stat.lo -MD -MP -MF pario/elio/.deps/stat.Tpo -c ../ga-5-1/pario/elio/stat.c -o pario/elio/stat.o
../ga-5-1/pario/elio/stat.c: In function 'elio_stat':
../ga-5-1/pario/elio/stat.c:77: error: storage size of 'ufs_statfs' isn't known
gmake[4]: *** [pario/elio/stat.lo] Error 1
gmake[3]: *** [all-recursive] Error 1
gmake[2]: *** [all] Error 2
gmake[1]: *** [build/.libs/libga.a] Error 1
gmake: *** [libraries] Error 1


I already had trouble compiling other software on AIX in the past because the OS
lacks some of the standard linux headers, libs and commands, but I was always able to
figure out what is missing and install it. In this case I have no idea where the problem lies.

Anyway I would appreciate any piece of advice on how to get one of the versions of nwchem working.

Cheers, Lukasz

Forum Vet
Threads 7
Posts 1296
Lukasz,
The memory experiment showed that you cannot got beyond 1.8GB of local memory and I have no explanation for this, since on 64-bit Linux we have not seen this problem. Do you define ARMCI_NETWORK, by any chance?

As far as the nwchem-6.1 compilation problem is concerned,
you need to edit $NWCHEM_TOP/src/tools/ga-5-1/pario/elio//eliop.h
and add the following 3 lines justt after line 42,

#else
# include <sys/statvfs.h>
# define STATVFS statvfs

Or as in patch format,

$ svn diff
Index: eliop.h
=======================================================
--- eliop.h (revision 9865)
+++ eliop.h (working copy)
@@ -40,6 +40,9 @@
# include <sys/vfs.h>
# define STATVFS statfs
# define NO_F_FRSIZE
+#else
+# include <sys/statvfs.h>
+# define STATVFS statvfs
#endif

#ifdef WIN32

Clicked A Few Times
Threads 2
Posts 9
Edo,

Patch works fine. Now the 6.1 version compiles but I have the same problem as with 6.0 so I cannot allocate more than 1.8GB of local memory. I don't define ARMCI_NETWORK.

Cheers, Lukasz

Forum Vet
Threads 7
Posts 1296
Lukasz
Did you have ever managed to use 2GB of memory (or more) with any other program on your AIX system?
What is the output of "ulimit -a"

Cheers, Edo

Forum Vet
Threads 7
Posts 1296
Might have found the culprit
Lukasz
Please ignore the posting I have just made a few minutes ago since I might have found the root cause of the problem.

The NWChem makefile structure is using a hardwired link option that limit the amount of memory to less thank 2GB (bmaxdata:0x80000000). In order to use, say 8Gb you would need to set bmaxdata:0x200000000.
This is set at line 933 of $NWCHEM_TOP/src/config/makefile.h
The line should be changed from
LDOPTIONS += -bmaxstack:0x80000000 -bmaxdata:0x80000000 # needed because of bigtoc
to
LDOPTIONS += -bmaxstack:0x80000000 -bmaxdata:0x200000000 # needed because of bigtoc

You do need to recompile nwchem to do it, but just re-link it, instead, by typing

make FC=xlf link

Let me know how it goes

Clicked A Few Times
Threads 2
Posts 9
Edo,

It seems that this was the problem with memory. After changing the bmaxdata all the
limitations on the memory are gone. I run a few test jobs (single and parallel) and the program works fine.

Thanks again for your kind help.
Cheers, Lukasz


Forum >> NWChem's corner >> Running NWChem



Who's here now Members 0 Guests 1 Bots/Crawler 0


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC