Slurm Errors & FAQ¶
Here you can find an overview on common errors and how to fix them as well as a list of pending job reasons and their meaning.
Job Submission Errors¶
Failure
sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified
Failure
sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified
Description / Solution:
- Please use the correct syntax for specifying accounts and valid partitions determined by your compute time project.
# Syntax for account specification
#SBATCH --account=<project>
#SBATCH -A <project>
# Syntax for partition specification
#SBATCH --partition=<partition>
#SBATCH -p <partition>
Failure
sbatch: error: Batch job submission failed: Requested node configuration is not available
Description / Solution:
- Please make sure that the amount of total CPUs requested exists within the requested nodes.
- Please make sure that the amount of memory per Node exists.
Job Failures¶
Failure
Job fails and generates no output files
Description / Solution:
- If the Job cannot write its output to output files specified in the batch script, the job will die and no file will be created.
- The job batch script has to specify a valid path to a directory and the file to write the output to, must be a writable file!
- Do not use
$HOME,$WORK,~or similar variables in this output! Only explicit full paths and Slurm special flags.
Please check typos and the correct syntax of:
#SBATCH --output=/path/to/file/file_name
#SBATCH --error=/path/to/file/file_name
#SBATCH --o /path/to/file/file_name
#SBATCH --e /path/to/file/file_name
Failure
slurmstepd: error: JOB xxxxxxx ON XXXX CANCELLED AT XXXXXXXXXXXX DUE TO TIME LIMIT srun: error: XXXXX : task XXXX: Killed
Description / Solution:
- Job fails due to time limit
- Slurm allocates ONLY the requested resources defined in the submission batch script. If your program runs for longer than the originally requested time limit, it will be killed without warning and results from the computation will not be guaranteed.
Please ensure you request enough time (with the correct syntax) for your program to finish computations across all nodes:
Failure
slurmstepd: error: Detected X oom-kill event(s) in XXXXXXXXXXXXX Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: XXXXX : task XXXX: Killed
Description / Solution:
- Job fails due to memory limit (oom-kill)
- Slurm allocates ONLY the requested resources defined in the submission batch script. If your program uses more memory than the originally requested memory limit per node, it will be killed without warning and results from the computation will not be guaranteed.
Please ensure you request enough time (with the correct syntax) for your program to finish computations across all nodes:
#SBATCH --mem=<size>[units]
#SBATCH -m <size>[units]
# Or if used:
#SBATCH --mem-per-cpu=<size>[units]
Different units can be specified using the suffix [K|M|G|T], the default is M for Megabytes (1000M = 1G)
Failure
slurmstepd: error: Can't propagate RLIMIT_NPROC of XXXXXXX from submit host: Invalid argument
Description / Solution:
- Job runs but RLIMIT_NPROC is shown as an error
- This is a known problem with our current installation of Slurm and can be ignored. It should also not affect your program or results in any meaningful way.
Job Wait Statuses¶
The squeue command displays several pending reasons for waiting jobs which we will discuss in the following:
-
None- The job is freshly queued and has not yet been considered by Slurm.
- Solution: just wait for a real reason
-
Resources- The job is waiting for allocated resources and will be next in line once they become free.
- Solution: wait until job is executed
-
Priority- There exists at least one job with a higher priority
- Solution: wait until job is executed
-
Dependency- The job is waiting for a dependency to be fulfilled
- Solution: just wait, it should run as soon as the dependency is fulfilled, you could also check the jobs dependencies for errors
-
JobArrayTaskLimit- The maximum amount of simultaneously running array tasks has been reached, e.g., if you used
--array=0-15%4and four tasks are running, the rest of the eligible tasks will have this pending reason - Solution: Wait for the running array tasks to finish
- The maximum amount of simultaneously running array tasks has been reached, e.g., if you used
-
AssocMaxWallDurationPerJobLimit- The maximum allowed wallclock time for the job / partition / account has been exceeded. This job will never start.
- Solution: Please delete the job, compare the corresponding Max. allowed wallclocktime: entry of the
r_wlm_usage -qoutput, correct your script accordingly and resubmit the modified job.
-
AssocMaxCpuPerJobLimit- The maximum number of allowed cores for the job / partition / account has been exceeded. This job will never start.
- Solution: Please delete the job, compare the corresponding Max. allowed cores per job: entry of the
r_wlm_usage -qoutput, correct your script accordingly and resubmit the modified job.