0

What strategy of load balancing in k8s would be recommended between Slurm (think SGE) clusters?

The reason I raise this is Slurm has a Pythonic API governing the queuing system.

Information can freely pass from the queuing system to the load balancer in Nginx (k8s) via the API. The load balancer here is not within the cluster (which uses "round robin") that ensures all nodes are equally active, it's the load balancer in Nginx determining which cluster to place the job on. The jobs involved on the clusters will vary from being very quick (seconds) to really slow (days), therefore ideally the exact queue state of each cluster is preferred for determining which cluster is allocated.

The load balancing strategies I am familiar with is

  • "Round robin" (wouldn't work here)

However, the others that are more applicable could be,

  • Weighted round-robin method*
  • Dynamic load balancing*
  • Least connection method
  • Resource-based method*

... any others welcome.

*, These methods would work if the state of the server's queue (headnode) was weighting or shifting the allocation of jobs, e.g. queue size and estimated time each job will take.


Just to mention "auto-scaling" (e.g. AWS) has been mentioned as a solution. Thats a great idea. AWS is obviously one of the Slurm clusters in the description and they have auto-scaling. ParallelCluster (AWS) does auto-scaling without needing to use AWS autoscale (it can be configured to auto-scale). It's an inbuilt ability. Anyway I do fully appreciate the comment by @JamesShewey. In really complex architecture the suggestion is doable, i.e. to create a new ParallelCluster on demand, spinning a cluster is more demanding that a single instance.

What I think is that "resource based method" would work because it would be aware of headnode activity.

0 Answers0