Poll Job Runtime Stats
sstat
This command allows a user to easily pull up status information about their jobs: CPU usage, task information, node information, resident set size (RSS), and virtual memory (VM).
Basic usage is as follows, but the --format
flag can and should also be used to limit output to relevant details.
sstat --jobs=your_job-id
Some relevant --format
variables are contained in the below table:
Variable | Description |
---|---|
avecpu | Average (system + user) CPU time of all tasks in job. |
averss | Average resident set size of all tasks in job. |
jobid | The number of the job or step. |
maxdiskwrite | Maximum number of bytes written by all tasks in job. |
ntasks | Total number of tasks in a job or step. |
See the below example with the --format
flag, which prints only the job's ID, CPU time, and number of tasks.
sstat --jobs=your_job-id --format=jobid,cputime,ntasks
sacct
This command allows a user to easily pull up information about past jobs that have completed.
Specify either a job ID or username with the --jobs
or --user
flag, respectively to pull up all information on a job:
sacct --jobs=<jobid[,additional_jobids]>
Some available --format
variables are contained in the below table, and may be passed as a comma separated list
Variable | Description |
---|---|
| Account the job ran under. |
| Allocated trackable resources (e.g. cores/RAM) |
| Average CPU time of all tasks in job. |
| Formatted (Elapsed time * core) count used |
| Jobs elapsed time formatted as DD-HH:MM:SS. |
| The job’s state |
| The id of the job. |
| The name of the job. |
| Maximum number of bytes read |
| Maximum number of bytes written |
| Maximum RAM use of all job tasks |
| The number of allocated CPUs |
| The number of allocated nodes |
| Number of tasks in a job |
| Slurm priority |
| Quality of service |
| Username of the person who ran the job |
Examples for better understanding job hardware utilization
Note that by default, only jobs run on the current day will be listed. To search within a different period of time, use the --starttime
flag. The --long
flag can also be used to show a non-abbreviated version of sacct
output. For example, to list detailed job characteristics for a user’s jobs since December 15th, 2020:
This produces a lot of output. As an example for formatted output, the following complete command will list information about jobs that ran today for a user, specifically information about the job’s id, average CPU use, maximum amount of RAM (memory) used, the core time (wall time multiplied by number of cores allocated), and the job’s state:
The above command in conjunction with appropriate --starttime
filtering is very useful for understanding more efficient hardware requests for future jobs. For instance, if maxrss
is 1 GB, then the default memory allocated to a job (4 GB+) is more than sufficient.
An additionally useful flag for the format is allocTRES%42
which will print the allocated “trackable resources” associated with the job with a width of 42 character, e.g. billing=1,cpu=1,mem=4G,node=1
would be printed for a 1 core job. The allocTRES
field is helpful for comparing to the avecpu
and maxrss
values, for instance.
If a +
is listed at the end of a field, then that field has likely been truncated to fit into a fixed number of characters. Consider increasing the with by appending a % followed by a number to specify a new width. For example allocTRES%42
overrides the default width to 42 characters.
mysacct
For convenience, the command mysacct
has been added to the system. This is equivalent to sacct --user=$USER --format=jobid,avecpu,maxrss,cputime,allocTRES%42,state
and accepts the same flags that sacct
would, e.g. --starttime=YYYY-MM-DD
or --endtime=YYYY-MM-DD
.
https://slurm.schedmd.com/sstat.html
https://slurm.schedmd.com/sacct.html
man sacct