ECCE and SLURM batch system

From NWChem

Viewed 1857 times, With a total of 8 Posts
Jump to: navigation, search

Clicked A Few Times
Threads 2
Posts 11
Our supercomputer administrators have recently switched from PBS to SLURM. For now they are supporting PBS submissions to SLURM, but do not know their long-term plans for it. How difficult is it to add a new queueing system? Where do I find the scripts to make it happen?

Matthew Asplund

Clicked A Few Times
Threads 2
Posts 11
Follow-up to my own post
I have edited the QueueManager file to create a new SLURM set of commands, but I am mostly not certain if I have to edit something to make parsing the output data from the SLURM commands work.

Matthew Asplund

Gets Around
Threads 14
Posts 111
Matthew,
let me know how it goes. I'm (slowly) working on setting slurm on my cluster (debian jessie doesn't package SGE anymore) and will try to get ECCE working with it.

Gets Around
Threads 14
Posts 111
I've set up slurm on my cluster and have configured ECCE to work with it. See here: [1]

It works, but can probably be improved upon.

Clicked A Few Times
Threads 2
Posts 11
I actually edited the submit.site file to add explicit support for SLURM by adding the lines to the file


172 SLURM {
173 #SBATCH --time=$wallTime
174 #SBATCH --ntasks=$totalprocs
175 #SBATCH --nodes=$nodes
176 #SBATCH -C 'avx'
177 #SBATCH --mem-per-cpu=4096M
178 }

I am still having problems with job monitoring, so I will try putting your changes to eccejobmonitor to my installation.

Gets Around
Threads 14
Posts 111
Matt,
the key to getting the job monitoring to work is to edit apps/scripts/eccejobmonitor
Beware that $q contains the name of the queue manager in lower case, regardless of how you've defined it in QueueManagers

Other than that, it was pretty straightforward (setting up SLURM itself was a bigger challenge), and I've been using it for day and a bit now without issue.
Edited On 4:30:47 PM PDT - Wed, Jul 29th 2015 by Ohlincha

Clicked A Few Times
Threads 2
Posts 11
So, I stopped playing with this, but am getting back to it. My problem right now is that I am getting an error "Unable to parse job id. Cannot monitor job." when I submit things. Now, when I run the sbatch command to submit a job, it returns output "Submitted batch job 9488438" (or whatever the job ID is). I tried writing a wrapper script to reduce the output to just the job id, but that didn't help. Is there a way to track what is actually happening during the submit process? I tried setting the ECCE_DEBUG and ECCE_RCOM_LOGMODE but that just outputs the ssh communication.

Gets Around
Threads 14
Posts 111
Add a
#SBATCH --output=slurm.out

line so that messages get logged.

I read it as submission failing i.e. the jobs never run?

Log onto the submit node and run the submit_xxxxxx file manually. See what happens and if it runs. You might be able to narrow it down to either communication issues or something to do with slurm.

Clicked A Few Times
Threads 2
Posts 11
Actually, the jobs submit and run just fine, but I get an error
ERROR: Unable to parse job id. Cannot monitor job.
WARNING: Launch aborted...

So, it is in the submit step that things are failing.


Forum >> ECCE: Extensible Computational Chemistry Environment >> General ECCE Topics



Who's here now Members 0 Guests 1 Bots/Crawler 0


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC