Slurm Errors & FAQ¶

Here you can find an overview on common errors and how to fix them as well as a list of pending job reasons and their meaning.

Job Submission Errors¶

Failure

sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified

Failure

sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified

Description / Solution:

Please use the correct syntax for specifying accounts and valid partitions determined by your compute time project.

# Syntax for account specification
#SBATCH --account=<project>
#SBATCH -A <project>

# Syntax for partition specification
#SBATCH --partition=<partition>
#SBATCH -p <partition>

Failure

sbatch: error: Batch job submission failed: Requested node configuration is not available

Description / Solution:

Please make sure that the amount of total CPUs requested exists within the requested nodes.
Please make sure that the amount of memory per Node exists.

Job Failures¶

Failure

Job fails and generates no output files

Description / Solution:

If the Job cannot write its output to output files specified in the batch script, the job will die and no file will be created.
The job batch script has to specify a valid path to a directory and the file to write the output to, must be a writable file!
Do not use $HOME , $WORK , ~ or similar variables in this output! Only explicit full paths and Slurm special flags.

Please check typos and the correct syntax of:

#SBATCH --output=/path/to/file/file_name
#SBATCH --error=/path/to/file/file_name
#SBATCH --o /path/to/file/file_name
#SBATCH --e /path/to/file/file_name

Failure

slurmstepd: error: JOB xxxxxxx ON XXXX CANCELLED AT XXXXXXXXXXXX DUE TO TIME LIMIT srun: error: XXXXX : task XXXX: Killed

Description / Solution:

Job fails due to time limit
Slurm allocates ONLY the requested resources defined in the submission batch script. If your program runs for longer than the originally requested time limit, it will be killed without warning and results from the computation will not be guaranteed.

Please ensure you request enough time (with the correct syntax) for your program to finish computations across all nodes:

#SBATCH  --time=HH:MM:SS
#SBATCH  -t HH:MM:SS

Failure

slurmstepd: error: Detected X oom-kill event(s) in XXXXXXXXXXXXX Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: XXXXX : task XXXX: Killed

Description / Solution:

Job fails due to memory limit (oom-kill)
Slurm allocates ONLY the requested resources defined in the submission batch script. If your program uses more memory than the originally requested memory limit per node, it will be killed without warning and results from the computation will not be guaranteed.

Please ensure you request enough time (with the correct syntax) for your program to finish computations across all nodes:

#SBATCH  --mem=<size>[units] 
#SBATCH  -m <size>[units]

# Or if used:
#SBATCH  --mem-per-cpu=<size>[units]

Different units can be specified using the suffix [K|M|G|T], the default is M for Megabytes (1000M = 1G)

Failure

slurmstepd: error: Can't propagate RLIMIT_NPROC of XXXXXXX from submit host: Invalid argument

Description / Solution:

Job runs but RLIMIT_NPROC is shown as an error
This is a known problem with our current installation of Slurm and can be ignored. It should also not affect your program or results in any meaningful way.

Job Wait Statuses¶

The squeue command displays several pending reasons for waiting jobs which we will discuss in the following:

None
- The job is freshly queued and has not yet been considered by Slurm.
- Solution: just wait for a real reason
Resources
- The job is waiting for allocated resources and will be next in line once they become free.
- Solution: wait until job is executed
Priority
- There exists at least one job with a higher priority
- Solution: wait until job is executed
Dependency
- The job is waiting for a dependency to be fulfilled
- Solution: just wait, it should run as soon as the dependency is fulfilled, you could also check the jobs dependencies for errors
JobArrayTaskLimit
- The maximum amount of simultaneously running array tasks has been reached, e.g., if you used --array=0-15%4 and four tasks are running, the rest of the eligible tasks will have this pending reason
- Solution: Wait for the running array tasks to finish
AssocMaxWallDurationPerJobLimit
- The maximum allowed wallclock time for the job / partition / account has been exceeded. This job will never start.
- Solution: Please delete the job, compare the corresponding Max. allowed wallclocktime: entry of the r_wlm_usage -q output, correct your script accordingly and resubmit the modified job.
AssocMaxCpuPerJobLimit
- The maximum number of allowed cores for the job / partition / account has been exceeded. This job will never start.
- Solution: Please delete the job, compare the corresponding Max. allowed cores per job: entry of the r_wlm_usage -q output, correct your script accordingly and resubmit the modified job.