Pulling Job Statistics
Overview
This page will help researchers pull system usage information about the running and completed jobs.
Running Jobs
For currently running jobs, seff/sacct will not pull accurate stattitiscs. To see the current CPU, Memory, and GPU usage, you will need to connect to the node.
If you have a sbatch script running, you can use the myjobs
command to find the node the job is running on. Look for the NODELIST section.
jeburks2@login02:~]$ myjobs
JobID ... PARTITION/QOS NAME STATE ... Node/Core/GPU NODELIST(REASON)
11273558 ... general/public myjob RUNNING ... 1/1/NA c008
In the example above, the job is running on node c008. We will then connect directly to that node with ssh
You can only ssh to nodes you have a job running on, otherwise the ssh connection will fail. When you ssh to a node, you are joining the cgroup on that node that runs your job
[jeburks2@login02:~]$ ssh c008
[jeburks2@c008:~]$
Notice the bash prompt changed from username@login02
to username@c008,
indicating we are on node c008 now. We can now use the following commands to view information about our job
top -u $USER ##This will show the CPU and Memory of our job
This shows the processes and their CPU / Memory usage. CPU usage is a percentage. 100% CPU is 1 CPU core, so if a process is using 8 cores it may say 800%, or list 8 processes at 100%
Press q
to quit out of top
Here we can clearly see the GPU Usage % and GPU Memory usage.
Press F10
to exit out of nvtop
When done viewing job statistics, type exit
to return to the login node
Completed Jobs
Once a job has completed/canceled/failed, pulling the job statistics is rather simple. There are two main commands to do this: seff
and mysacct
Seff
seff is short for “slurm efficiency” and will display the percentage of CPU and Memory used by a job relative to how long the job ran. The goal is high efficiency so that jobs are not allocating resources they are not using.
Example of seff for an inefficient job
This shows the job had a CPU for 7 minutes, but only used the CPU for 59 seconds, resulting in a 12% efficiency, but did use the memory
Example of seff for a CPU efficient job
In this example, the job used all four cores it was allocated for 98% of the time the job ran. The core-wall time is calculated by the number of CPU cores * the length of the job. This 15-minute job with 4 CPUs had a core-wall time of 1:00:00. However, the memory efficiency is rather low. This lets us know that if we run this job in the future, we can allocate less memory. This will reduce the impact to our fair share and use the system more efficiently.
Note: Seff does not display statists for GPUs, so a GPU-heavy job will likely have inaccurate seff results
sacct / mysacct
The sacct / mysacct command allows a user to easily pull up information about past jobs that have completed.
Specify either a job ID or username with the --jobs
or --user
flag, respectively to pull up all information on a job:
Some available --format
variables are contained in the below table, and may be passed as a comma separated list
Variable | Description |
---|---|
| Account the job ran under. |
| Allocated trackable resources (e.g. cores/RAM) |
| Average CPU time of all tasks in job. |
| Formatted (Elapsed time * core) count used |
| Jobs elapsed time formatted as DD-HH:MM:SS. |
| The job’s state |
| The id of the job. |
| The name of the job. |
| Maximum number of bytes read |
| Maximum number of bytes written |
| Maximum RAM use of all job tasks |
| The number of allocated CPUs |
| The number of allocated nodes |
| Number of tasks in a job |
| Slurm priority |
| Quality of service |
| Username of the person who ran the job |
For convenience, the command mysacct
has been added to the system. This is equivalent to sacct --user=$USER --format=jobid,avecpu,maxrss,cputime,allocTRES%42,state
and accepts the same flags that sacct
would, e.g. --starttime=YYYY-MM-DD
or --endtime=YYYY-MM-DD
.
Examples for better understanding job hardware utilization
Note that by default, only jobs run on the current day will be listed. To search within a different period of time, use the --starttime
flag. The --long
flag can also be used to show a non-abbreviated version of sacct
output. For example, to list detailed job characteristics for a user’s jobs since December 15th, 2020:
This produces a lot of output. As an example for formatted output, the following complete command will list information about jobs that ran today for a user, specifically information about the job’s id, average CPU use, maximum amount of RAM (memory) used, the core time (wall time multiplied by number of cores allocated), and the job’s state:
Additional Help