This project was developed by the Paul Scherrer Institute, and many of the design choices reflect our specific setup and software environment (e.g. Slurm, GPFS, systemd). Nevertheless, we believe it is general enough to be applicable to similar environments with some customization.
The software has been presented at: https://www.interactivehpc.com/previous-editions/isc25-cfp
Flurm allows dynamic reallocation of compute nodes between two or more Slurm clusters, where typically one cluster has more resources by default. The cluster with more resources is referred to as offline, while the one with fewer resources is called online. These terms originate from our setup, in which a larger cluster is used for offline analysis, and the online cluster supports real-time analysis during beamline experiments.
In addition to Slurm, different other aspects of the system can be changed. So far we support (or wish to support) the following:
- configuration of GPFS mounts (eg. RW vs RO);
- dedicated firewall configurations;
- Various personalizations (eg.
/etc/motd
); - automatic configuration revert to default.
It is expected that configuration changes will be applied after a reboot and remain persistent across reboots.
Although it is possible to apply some changes live, this is not guaranteed to be reliable for all system components. For example, filesystem-related changes —such as switching a mount from read-write to read-only— cannot always be safely applied while the filesystem is in use, and often require a reboot to ensure consistency.
The main configuration file used by the tools is
/opt/flurm/etc/clusters.ini
and show be something like the
following, where all the sections and parameters are mandatory:
[offline]
cluster=cluster-A
server=controller1-cluster-a.domain.com,controller2-cluster-a.domain.com
features=main
[online]
cluster=cluster-B
server=controller1-cluster-b.domain.com
features=beamline1,beamline2
If not all options are clear at this point, continue reading this document.
We do heavily rely on systemd
(targets, services and timers) to
coordinate configuration changes.
The overall logic is described in this diagram:
A configuration change is triggered by the creation of a specific
slurm single node reservation, named with the schema
flurm_<feature>_X
. Every node has a regular timer that
scans slurm reservations, and in case it finds one with the naming
schema it will trigger the configuration change and a reboot.
If the reservation has no end date (or is one year, due a possible slurm bug), then the move is considered to be permanent, and no move-back mechanism is triggered.
If the reservation has an end date, a new reservation will be created in Cluster B, triggering the move to Cluster A using the same mechanism. This is done by creating a drop-in overwrite of the check reservation service, that will create a pre-defined reservation.
After a reboot, the node will find itself in a new systemd target, which will start all required services (simple and single-shot) that are required by the configuration.
In case a drop-in is present, this will be triggered by the regular timer checking reservations, and create a move-back reservation
The normal move mechanism will be triggered by the optional move-back reservation.
To achieve our goal, we use together two features of slurm: configless nodes and dynamic nodes.
A configless node is a node that does not have slurm configuration
files and receives them from the controller at startup. The controller
to contact is specified via --conf-server slurm-controller-hostname
.
The configuration files received from the server will appear in
/var/run/slurm/conf
.
In our setup all nodes are configless. This is not strictly necessary but helps us in avoinding overlapping our system with puppet.
A dynamic node is a compute node that at the startup of the slurmd
service accounces one or more feature (via -Z --conf Feature=feat
)
to the slurm controller. A feature is just a string to group nodes
together and can used in the slurm config, to add eg. dynamic nodes to
a partition. The dynamic nodes can be mixed with the static nodes.
A slurm node that is both configless and dynamic, will start slurmd, specifying both some features and the controller to conntact.
In those cases the slurmd service will start with otions similar to the following ones:
-Z --conf Feature=experiment1 --conf-server controller1-cluster-b.domain.com'
In the following we distinguish two clusters: online
and
offline
. They are slurm clusters that both include
dynamic+configless nodes.
Both online and offline clusters use some dynamic nodes and those are the only ones can be moved between the two clusters.
A further requirement for the dynamic nodes to work, is that the slurm
variable MaxNodeCount
is set and it is big enough to allocate the
dynamic nodes. See the slurm.conf man page for further details.
Here is a typical flow for a compute node.
The node is initially in the base (offline) cluster:
[root@compute-node-1 ~]# flurm-status
Current cluster offline
No pending moves
Next move check active. Next run: Tue 2025-05-27 21:10:00 CEST
We then move one node to the offline cluster to the beamline xyz
for
3 days. This happens internally a combination of slurm reservation and
timers and our tool is a kind of advanced wrapper around them:
[root@login-node ~]# flurm-reserve_nodes -p xyz -c online -n 1 -d 3
Reservation flurm_offline_xyz_20250527210439_0 created
Reservation is registered. A regular timer will apply at the next run (by default every 10 minutes):
[root@compute-node ~]# flurm-status
Current cluster offline
Moving to cluster=online and partition=xyz at 2025-05-27T21:04:39
Next move check active. Next run: Tue 2025-05-27 21:10:00 CEST
After the timer has been fired and after the reboot, the node is in the dynamic cluster.
[root@compute-node ~]# flurm-status
Current cluster online
Moving to cluster=offline and partition=main at 2025-05-30T17:09:02
Next move check active. Next run: Tue 2025-05-27 21:10:00 CES
Flurm makes a heavy usage of systemd targets. So e.g. the offine and online are both systemd targets and the move of a node between the cluster happens by changing the default target (or via isolate, in case the node does not get rebooted).
The targets files can be manually created or they can be generated via
the provided bash templates in the system/templates/
directory.
Here's an example:
[~/flurm ]$ NAME=cluster-A CONFLICT="clusterB" envsubst < system/templates/generic.target.in
[Unit]
Description=cluster-A target
Requires=multi-user.target [email protected] [email protected]
Conflicts=clusterB
After=multi-user.target
AllowIsolate=yes
It is possible to define sub-targets to each cluster. E.g. it is possible to have a different target for each beamline that is a sub-target of the online cluster.
For these cases, we provide an additional template:
[~/flurm ]$ NAME=beamlineX CONFLICT="beamlineY" PARENT=clusterB envsubst < templates/generic.sub-target.in
[Unit]
Description=beamlineX target
Requires=clusterB.target
Conflicts=beamlineY.target
After=clusterB.target
AllowIsolate=yes
[~/flurm ]$ NAME=beamlineY CONFLICT="beamlineX" PARENT=clusterB envsubst < templates/generic.sub-target.in
[Unit]
Description=beamlineY target
Requires=clusterB.target
Conflicts=beamlineX.target
After=clusterB.target
AllowIsolate=yes
We are currently using two ways to deploy flurm:
- local deployment (
local_deployment.sh
), where files are copied on the local filesystem in the proper place. This is useful when developing or debugging - RPMs, built with
package.sh