Slug24

The tl;dr of the annual Slurm User Group 2024 conference in Oslo, Norway

This post is private and unpublished. It will not be indexed by search engines.

Lenny2024-09-22

Summary

First off, the picture above is of the Oslo Opera House, not the conference venue. It's one of my favourite buildings — you can walk from the sea to the apex in one go.
Slurm with Kubernetes official support from SchedMD is coming by the end of the year (apparently in Nov).
Debian packaging support from 23.11
Moving towards TLS for all Slurm communication (including Slurmd)
Nvidia Imex support and Nvidia plugin for autodetection of Nvidia GPUs

Slurm and Kubernetes

SchedMD announced that they will be releasing Slurm with Kubernetes support by the end of the year. This feels like a big deal for those who are looking to run Slurm on Kubernetes. The integration should allow users to run Slurm in Kubernetes Sinky, which, if you are already using k8s, will make it easier to manage Slurm.

There are two variants of the integration:

1. Slurm Operator.

I wasn't 100% clear on this one, however, my understanding is that slurmd (and slurmctld, slurmdbd etc) would run in a pod, and the operator would manage the lifecycle of the pod. This would mean that slurm scheduled workloads would run inside a pod. IMHO putting all the slurm components into containers/pods makes complete sense, except for slurmd. However, I do see how this would be an easy win for every Platform-Engineer/DevOps/Admin familiar with k8s but who has a researcher team that really want Slurm. It should in theory also make it much easier to scale Slurm up and down based on needs (e.g. training vs inference).

2. Slurm-Bridge.

This is a bridge between Slurm and Kubernetes that forwards all scheduling decisions for a k8s namespace to Slurm. In Kubernetes language, the Slurm-Bridge is a controller. As I understand it, this will mean running both slurmd and the kubelet on the same node, and submitting jobs via the k8s API. Out of the two approaches this one makes the most sense to me. One of Slurm's advantages over k8s is its ability to run non-containerised workloads on as close to bare metal as possible.

In summary is that it's an interesting first (second*) step. There has obviously been a decent amount of thought to the use cases of Slurm with k8s, and I'm very excited to see an officially supported integration.

*second, because there was a Coreweave+SchedMD solutions announced last year, but it seems that there maybe have been some technical disagreements between the two companies.

Nvidia

As of 24.05 there is:

a new switch/nvidia_imex plugin for IMEX channel management on NVIDIA systems.
new RestrictedCoresPerGPU option at the Node level, designed to ensure GPU workloads always have access to a certain number of CPUs even when nodes are running non-GPU workloads concurrently.
and the Nvidia plugin no longer has any dependency on NVML or CUDA libraries BUT no longer supports core locality and utilisation.

Compact placement and Topology

There's not much to say here except that the Topology/block plugin ("compact placement" in cloud speak) was redesigned for 23.11. It looks comprehensive, has hierarchy, and has been built with GPU/networking in mind.

Trailblazing Turtle Slurm

Simon Guilbault has been working on a monitoring dashboard for his HPC cluster that runs Slurm. It talks directly to MySQL, stores metrics in Prometheus and scraps job scripts from the Slurm REST API. He called it Trailblazing Turtle (repo) because it would be easy to Google.

In particular some of the administrators of academic clusters were very interested in:

CO2 costs
Energy cost
Monetary cost

Anecdotes

As with most conferences, it's the networking where the real value lies. So here are some of the coffee-break queue notes.

Debian Packaging (and Rocky)

Everyone I spoke to was using RHEL or Rocky — no one was using a Debian flavour (except us and one other person). However, it was announced that Debian packaging support will be available from 23.11. For us, and I'm sure for other AI Labs out there, this is good news.

CGroups V2

Lots's of issues with CGroup V2, especially those who built an upgrade on an existing machine.

TLS

SchedMD focusing on TLS feels like they are moving towards a more modern/k8s/cloud-native approach.

Physicists

Finger in the air, I reckon about 80% of the attendees were ex-physicists. I spoke to a few and I think the conclusion was that physicists were the academic group that originally needed supercomputers for things like simulations. So it made most sense they they are the ones with (in most cases decades of) experience with HPC.

Coreweave and SchedMD

Apparently I was the fool that raised their hand and ask the question, "what happened to Coreweave?" — it seems others wanted to know but were too polite to ask.

Get started with Clusterfudge

Join leading AI labs using Clusterfudge to accelerate their research.

Get Started Free ✨