FairShare Scheduling Notes

With faculty governance board approval, on September 4th, 2019, ASU Research Computing began running a new compute job scheduling algorithm called FairShare on the main cluster.

Here are the important facts to understand about this change:

  • FairShare will only be implemented on the public Agave cluster.  No changes are being made to the legacy Saguaro cluster or private Ocotillo cluster.
  • No jobs that were running at the time we made the change were interrupted.  
  • FairShare replaced the monthly 30k CPU hours crediting system, where jobs submitted beyond this balance became pre-emptable.
  • CPU Hours are now relative.  Under FairShare, all jobs which are started on public cluster resources will run to completion.  No jobs run on public resources will be preempted.
  • FairShare is not a free-for-all, however.  When the cluster is under heavy load and jobs are waiting to start, the order in which those waiting jobs will be scheduled to run is determined by each researcher's FairShare score.
  • The more CPU hours a researcher has used in a given month, the lower his or her score becomes, and the lower that user's priority is for newly-submitted jobs.  This user's jobs will have a lower priority than those submitted by a researcher with a higher FairShare score.
  • All submitted jobs will eventually run, but the order will be determined by the FairShare scores of each user, not the order in which those jobs are submitted.
  • When the cluster has available, unused resources–and no jobs are waiting to run–then having a low FairShare score will not prevent a researcher's jobs from running.  FairShare only becomes important when there are jobs waiting to be run.  However, jobs that run when the cluster is relatively idle will still affect a customer's FairShare score.
  • A researcher may expect their FairShare to halve for every 10,000 hours of research recently conducted. FairShare recharges exponentially, as a researcher's usage history decays with a half-life of one week.
  • The Wildfire queue will no longer exist on public resources, but will still exist on private cluster resources, such as faculty-owned GPU systems.
  • Wildfire jobs run on private cluster resources are still subject to preemption by the owners of these resources.
  • Wildfire jobs run on private cluster resources DO NOT affect a researcher's FairShare score.
  • Privileged jobs run by the owners of private resources DO NOT affect the owner’s FairShare score.
  • The aggressive queue will still exist in name, but using this queue will have no impact on job scheduling.  The name is being kept active for the convenience of our customers.

The decision to change the job scheduler from CPU hour balances to FairShare was motivated by several factors including:

  1. Compute Efficiency – Fairshare eliminates job preemption, which constitutes wasted cycles, with or without checkpointing.
  2. Scheduler Efficiency – Fairshare will increase utilization of the cluster through more efficient use of resources.
  3. Modernization – Switching to Fairshare follows guidance provided by a peer review from other HPC centers and also brings ASU Research Computing into line with accepted best practices in cluster management. 

You can see your Fairshare score in real-time by running the mybalance command from any Agave node.