Faculty HPC system - fawcett

Fawcett is the old Faculty HPC system which has been in the End of Life state since 2025-07-28. We no longer create new user accounts there, if you haven't got one yet look at Swirles instead.

Access

After receiving information that the account has been activated you can login to fawcett using ssh. Please note that direct connections to fawcett are possible from computers connected to Maths main network and from ssh.maths.cam.ac.uk. For ways to configure access from other computers please look at:

Ssh access to fawcett

Head-node etiquette

Running resource-intensive computations on the head node (i.e. where you find yourself having logged in to fawcett using ssh) is not allowed - it is the single entry point to Fawcett for all of its users so overusing its resources is very much antisocial. Please use the queuing system (see below) instead, it can run both batch and interactive jobs. The only exception to the above rule is short-lived interactive tasks such as compilation of software - if it's not expected to take more than 5-10 minutes and you do not leave it unattended, it's okay to run it on the head node.

Visual Studio Code users, please note that you are required to configure it appropriately in order not to overload the head node. See the section Software below for details.

Long-running CPU-intensive processes on the head node may be terminated with no advance warning, and repeat offenders may have their Fawcett accounts suspended.

Hardware configuration

4 nodes with 2x SkyLake 6140 18 core 2.3GHz processors and 384GB of RAM each
2 nodes with 2 x NVidia Pascal P100 GPUs (each with 16 GB of memory), 2x SkyLake 6140 18 core 2.3GHz processors and 384GB of RAM
23 nodes with 1x Intel Xeon Phi 7210 (KNL), 64 cores, 1.3GHz processor and 96GB of RAM
Intel Omni-Path HPC interconnect
220TB of dedicated storage (not backed up)

Disk space

Home directories and data disks on fawcett are separate from the ones on other Maths system. Home directory have relatively small quotas on them so they should not be used to keep big data generated by computing jobs. For this purpose every user should have access to at least on of the subdirectories of /nfs/st01/hpc-*.

Note that on Fawcett the command quota does not return correct information regarding storage occupancy. To find out how much space you have still got available use the command df on the correct directory, e.g.

df /home/CRSID to see your home-directory quota
df /nfs/st01/hpc-GROUPNAME to see the data-disk quota shared between all members of group hpc-GROUPNAME

Software

Most of the software is provided in the form environment modules. One can check the list of available modules with command:

module av

A module can be loaded with command

module load <modulename>

Other useful commands:

module      
   (no arguments)              print usage instructions
   list                        print list of loaded modules
   whatis                      as above with brief descriptions
   unload <modulename>         remove a module
   purge                       remove all modules

Most useful modules:

gcc
openmpi/4.1.2/gcc-7.5.0-qc6lyum
openmpi/4.1.2/intel-2021.5.0-h6p5iq3
python
intel-oneapi-compilers
cuda

New modules can be requested by sending email to help@maths.cam.ac.uk.

Python modules

Please note that while there are some Python modules available on Fawcett, they are there as dependencies of other modules rather than as packages for end-users to import. They may or may not work.

In case of problems, or when in doubt, please use Anaconda packages (see below) for anything Pythonic.

Anaconda setup

Some software is also installed as conda environment. They tend to interact badly with software installed with environment modules. It is possible to unload all loaded environment modules with command:

module purge

Next step is to load module miniforge3:

module load miniforge3

Afterwards you may check the list of available conda environments with command:

conda info -e

Currently available environments are:

anaconda - contain full list of packages that came with standard ancaconda installation plus few more which were requested by users
tensorflow - different versions
pytorch - different versions

A particular environment can be activated with a command like (note that standard way: conda activate ... does not work on fawcett).

source activate anaconda-2021.11

By default conda stores the environments created by users in the directory /nfs/software/Conda/users/<username>/envs. It also aggressively caches the downloaded packages in /nfs/software/Conda/users/<username>/pkgs. As both these directories can be quite large it is recommended to clean conda caches regularly.. The details can be found in conda documentation:

https://conda.io/projects/conda/en/latest/user-guide/configuration/use-c...

Visual Studio Code

In its default configuration, Visual Studio Code spawns multiple copies of the JavaScript server Node.JS on remote hosts. On multi-user systems such as the Fawcett head node this can, and has been observed to, quickly exhaust the available resources and render them virtually unusable. Therefore, users wishing to use Visual Studio Code to work on Fawcett are now required to adjust their configuration as follows:

Hit the Extensions button (on the left toolbar, looks like building blocks)
Locate the extension "TypeScript and JavaScript Language Features"; searching for "@builtin TypeScript" ought to do it
Disable that extension
Reload VS Code

Queuing system

Fawcett operates the Slurm workload managers for managing resources. If you are not familiar with Slurm, or workload managers / batch-queuing systems in general, you might want to have a look at this FAQ before proceeding.

Some useful commands:

squeue      - show global cluster information
sinfo       - show global cluster information
sview       - show global cluster information
scontrol show job <job_number> - examine the job with jobid nnnn
scontrol show node nodename - examine the node with name nodename
sbatch      - submits an executable script to the queueing system
srun        - run a command either as a new job or within an existing job
scancel     - delete a job

Submitting jobs

To submit a job one needs first create a submission script. It is a shell script with special comment lines with prefix

#SBATCH

which provide instructions to queuing system about required resources. For example:

#!/bin/bash
#! Which partition (queue) should be used
#SBATCH -p gpu
#! Number of required nodes
#SBATCH -N 1
#! Number of MPI ranks running per node
#SBATCH --ntasks-per-node=2
#! Number of GPUs per node if required
#SBATCH --gres=gpu:2
#!How much wallclock time will be required (HH:MM:SS)
#SBATCH --time=02:00:00

srun a.out

To submit a script to the queuing system use command:

sbatch <scriptname>

Interactive jobs

It is possible to request an interactive job with command srun. For example:

srun --pty -p skylake -n 2 --time=02:00:00 bash

would reserve two cores in skylake partition for two hours and run bash there.

Notes and comments

MPI jobs should be launched the same way as non-MPI ones, i.e. with srun. In case of Intel MPI it provides for better integration with Slurm than using mpirun or mpiexec, and jobs linked against OpenMPI might downright refuse to start in Slurm jobs if one of the latter two commands is used.

Memory available to jobs

By default on all nodes except for KNL ones memory available to a job is in proportion to allocated number of cores. The limits are:

10740 MB per core on skylake nodes
10740 MB per core on GPU nodes

It possible to request more memory by using --mem=size[units] option, where units are K, M, G, or T (M is default).

Available partitions

The following partitions are available:

skylake - SkyLake nodes. This is currently the default partition.
gpu - GPU nodes
knl - KNL nodes
knl-long - partition for longer jobs on KNL nodes
skylake-long - partition for longer jobs on SkyLake nodes

The maximum wall time is 12 hours for normal partitions and 72 hours for long partitions. Access to long partitions is at the discretion of the PI concerned. In order to facilitate higher throughput of jobs and better utilisation of the system jobs in a long partition can use no more than 25% of resources of specific type.