Skip to content

paulscherrerinstitute/flurm

Repository files navigation

Introduction

This project was developed by the Paul Scherrer Institute, and many of the design choices reflect our specific setup and software environment (e.g. Slurm, GPFS, systemd). Nevertheless, we believe it is general enough to be applicable to similar environments with some customization.

The software has been presented at: https://www.interactivehpc.com/previous-editions/isc25-cfp

Description

Flurm allows dynamic reallocation of compute nodes between two or more Slurm clusters, where typically one cluster has more resources by default. The cluster with more resources is referred to as offline, while the one with fewer resources is called online. These terms originate from our setup, in which a larger cluster is used for offline analysis, and the online cluster supports real-time analysis during beamline experiments.

In addition to Slurm, different other aspects of the system can be changed. So far we support (or wish to support) the following:

  • configuration of GPFS mounts (eg. RW vs RO);
  • dedicated firewall configurations;
  • Various personalizations (eg. /etc/motd);
  • automatic configuration revert to default.

It is expected that configuration changes will be applied after a reboot and remain persistent across reboots.

Although it is possible to apply some changes live, this is not guaranteed to be reliable for all system components. For example, filesystem-related changes —such as switching a mount from read-write to read-only— cannot always be safely applied while the filesystem is in use, and often require a reboot to ensure consistency.

Configuration file

The main configuration file used by the tools is /opt/flurm/etc/clusters.ini and show be something like the following, where all the sections and parameters are mandatory:

[offline]
cluster=cluster-A
server=controller1-cluster-a.domain.com,controller2-cluster-a.domain.com
features=main
[online]
cluster=cluster-B
server=controller1-cluster-b.domain.com
features=beamline1,beamline2

If not all options are clear at this point, continue reading this document.

Architecture

We do heavily rely on systemd (targets, services and timers) to coordinate configuration changes.

The overall logic is described in this diagram:

image

Triggering the Move from Cluster A to Cluster B

A configuration change is triggered by the creation of a specific slurm single node reservation, named with the schema flurm_<feature>_X. Every node has a regular timer that scans slurm reservations, and in case it finds one with the naming schema it will trigger the configuration change and a reboot.

If the reservation has no end date (or is one year, due a possible slurm bug), then the move is considered to be permanent, and no move-back mechanism is triggered.

If the reservation has an end date, a new reservation will be created in Cluster B, triggering the move to Cluster A using the same mechanism. This is done by creating a drop-in overwrite of the check reservation service, that will create a pre-defined reservation.

Self-registration to cluster B, partition BL

After a reboot, the node will find itself in a new systemd target, which will start all required services (simple and single-shot) that are required by the configuration.

In case a drop-in is present, this will be triggered by the regular timer checking reservations, and create a move-back reservation

Optional return to Cluster A

The normal move mechanism will be triggered by the optional move-back reservation.

Slurm functionalities

To achieve our goal, we use together two features of slurm: configless nodes and dynamic nodes.

Configless node

A configless node is a node that does not have slurm configuration files and receives them from the controller at startup. The controller to contact is specified via --conf-server slurm-controller-hostname.

The configuration files received from the server will appear in /var/run/slurm/conf.

In our setup all nodes are configless. This is not strictly necessary but helps us in avoinding overlapping our system with puppet.

Dynamic node

A dynamic node is a compute node that at the startup of the slurmd service accounces one or more feature (via -Z --conf Feature=feat) to the slurm controller. A feature is just a string to group nodes together and can used in the slurm config, to add eg. dynamic nodes to a partition. The dynamic nodes can be mixed with the static nodes.

Dynamic and configless

A slurm node that is both configless and dynamic, will start slurmd, specifying both some features and the controller to conntact.

In those cases the slurmd service will start with otions similar to the following ones:

-Z --conf Feature=experiment1 --conf-server controller1-cluster-b.domain.com'

Online and offline cluster

In the following we distinguish two clusters: online and offline. They are slurm clusters that both include dynamic+configless nodes.

Clusters setup

Both online and offline clusters use some dynamic nodes and those are the only ones can be moved between the two clusters.

A further requirement for the dynamic nodes to work, is that the slurm variable MaxNodeCount is set and it is big enough to allocate the dynamic nodes. See the slurm.conf man page for further details.

Move of the nodes

Here is a typical flow for a compute node.

The node is initially in the base (offline) cluster:

[root@compute-node-1 ~]# flurm-status
Current cluster                 offline
No pending moves
Next move check  active. Next run: Tue 2025-05-27 21:10:00 CEST

We then move one node to the offline cluster to the beamline xyz for 3 days. This happens internally a combination of slurm reservation and timers and our tool is a kind of advanced wrapper around them:

[root@login-node ~]# flurm-reserve_nodes -p xyz -c online -n 1 -d 3
Reservation flurm_offline_xyz_20250527210439_0 created

Reservation is registered. A regular timer will apply at the next run (by default every 10 minutes):

[root@compute-node ~]# flurm-status
Current cluster                 offline
Moving to cluster=online and partition=xyz at  2025-05-27T21:04:39
Next move check  active. Next run: Tue 2025-05-27 21:10:00 CEST

After the timer has been fired and after the reboot, the node is in the dynamic cluster.

[root@compute-node ~]# flurm-status
Current cluster                 online
Moving to cluster=offline and partition=main at  2025-05-30T17:09:02
Next move check active. Next run: Tue 2025-05-27 21:10:00 CES

Targets

Flurm makes a heavy usage of systemd targets. So e.g. the offine and online are both systemd targets and the move of a node between the cluster happens by changing the default target (or via isolate, in case the node does not get rebooted).

The targets files can be manually created or they can be generated via the provided bash templates in the system/templates/ directory.

Here's an example:

[~/flurm ]$ NAME=cluster-A CONFLICT="clusterB" envsubst < system/templates/generic.target.in
[Unit]
Description=cluster-A target
Requires=multi-user.target [email protected] [email protected]
Conflicts=clusterB
After=multi-user.target
AllowIsolate=yes

Complex targets hierarchy

It is possible to define sub-targets to each cluster. E.g. it is possible to have a different target for each beamline that is a sub-target of the online cluster.

For these cases, we provide an additional template:

[~/flurm ]$ NAME=beamlineX CONFLICT="beamlineY" PARENT=clusterB envsubst < templates/generic.sub-target.in
[Unit]
Description=beamlineX target
Requires=clusterB.target
Conflicts=beamlineY.target
After=clusterB.target
AllowIsolate=yes
[~/flurm ]$ NAME=beamlineY CONFLICT="beamlineX" PARENT=clusterB envsubst < templates/generic.sub-target.in
[Unit]
Description=beamlineY target
Requires=clusterB.target
Conflicts=beamlineX.target
After=clusterB.target
AllowIsolate=yes

Installation

We are currently using two ways to deploy flurm:

  1. local deployment (local_deployment.sh), where files are copied on the local filesystem in the proper place. This is useful when developing or debugging
  2. RPMs, built with package.sh

About

Flurm (flexible Slurm) is a way to use Slurm and Systemd to manage dynamic compute clusters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •