Running the cluster¶
Now that you have a cluster which is up and running, it’s worth knowing what you can do with it.
A full Slurm tutorial is outside of the scope of this document but it’s configured in a fairly standard way.
By default there’s one single partition called
compute which contains all the compute nodes.
A simple first Slurm script,
test.slm, could look like:
#! /bin/bash #SBATCH --job-name=test #SBATCH --partition=compute #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=1 #SBATCH --time=10:00 #SBATCH --exclusive echo start srun -l hostname echo end
which you could run with:
[matt@mgmt ~]$ sbatch test.slm
Slurm elastic scaling¶
Slurm is configured to use its elastic computing mode. This allows Slurm to automatically turn off any nodes which are not currently being used for running jobs and turn on any nodes which are needed for running jobs. This is particularly useful in the cloud as a node which has been shut down will not be charged for.
Slurm does this by calling a script
/usr/local/bin/startnode as the
If necessary, you can call this yourself from the
opc user like:
[opc@mgmt ~]$ sudo -u slurm /usr/local/bin/startnode compute001
to turn on the node
You should never have to do anything to explicitly shut down the cluster, it will automatically turn off all nodes which are not in use after a timeout. The management node will always stay running which is why it’s worth only using a relatively cheap VM for it.
Currently, due to a quirk in OCI, it seems that while all VMs and most bare-metal nodes are not charged for while stopped, the DenseIO nodes are. This means that the auto-shutdown will not work as well for those shapes and you will be charged. Development is ongoing to avoid this.
The rate at which Slurm shuts down is managed in
/mnt/shared/apps/slurm/slurm.conf by the
See the slurm.conf documentation for more details.
A common task is to want to run commands across all nodes in a cluster. By default you have access to clustershell. Read the documentation there to get details of how to use the tool.
The gist is that you give it a hostname or a group and a command to run.
You can see a list of the available groups with
[opc@mgmt ~]$ cluset --list-all @compute @state:idle @role:mgmt
You can then run a command with
[opc@mgmt ~]$ clush -w @compute uname -r compute001: 3.10.0-862.2.3.el7.x86_64 compute002: 3.10.0-862.2.3.el7.x86_64 compute003: 3.10.0-862.2.3.el7.x86_64 compute004: 3.10.0-862.2.3.el7.x86_64
You can combine the output from different nodes using the
[opc@mgmt ~]$ clush -w @compute -b uname -r --------------- compute[001-004] (4) --------------- 3.10.0-862.2.3.el7.x86_64
Installing software on your cluster¶
In order to do any actual work you will likely need to install some software.
There are many ways to get this to work but I would recommend either using
clush to install the software
or, preferably, create a local Ansible playbook which installs it for you across the cluster.
In the latter case, you can use
/home/opc/hosts as an inventory file and point your playbook to use it.
The cluster automatically collects data from all the nodes and makes them available in a web dashboard.
It is available at the IP address of you management node on port 3000. Point your browser at http://your.mgmt.ip.address:3000 and log in with the username admin and the password admin. You will be prompted to create a new password before you continue.
Destroying the whole cluster¶
Please bear in mind that this will also destroy your file system which contains your user’s home area and any data stored on the cluster.
When you’ve completely finished with the cluster, you can destroy it using Terraform.
$ terraform destroy