skip to content
 

We have a faculty HPC systems, called fawcett, for developing and running computationally intensive tasks.

Account

In order to obtain an account on fawcett please email your request to help@maths.cam.ac.uk stating which research group you are in.

Access

After receiving information that the account has been activated you can login to fawcett using ssh. Please note that direct connections to fawcett are possible from computers connected to Maths main network  and from ssh.maths.cam.ac.uk.. For ways to configure access from other computers please look at:

Ssh access to fawcett

Head-node etiquette

Running resource-intensive computations on the head node (i.e. where you find yourself having logged in to fawcett using ssh) is not allowed - it is the single entry point to Fawcett for all of its users so overusing its resources is very much antisocial. Please use the queuing system (see below) instead, it can run both batch and interactive jobs. The only exception to the above rule is short-lived interactive tasks such as compilation of software - if it's not expected to take more than 5-10 minutes and you do not leave it unattended, it's okay to run it on the head node.

Visual Studio Code users, please note that you are required to configure it appropriately in order not to overload the head node. See the section Software below for details.

Long-running CPU-intensive processes on the head node may be terminated with no advance warning, and repeat offends may have their Fawcett accounts suspended.

Hardware configuration

  1. 32x SkyLake 6154 3GHz 18 core processors shared memory node with 6TB of RAM
  2. 4 nodes with 2x SkyLake 6140 18 core 2.3GHz processors and 384GB of RAM each
  3. 2 nodes with 2 x NVidia Pascal P100 GPUs (each with 16 GB of memory), 2x SkyLake 6140 18 core 2.3GHz processors and 384GB of RAM
  4. 23 nodes with 1x Intel Xeon Phi 7210 (KNL), 64 cores, 1.3GHz processor and 96GB of RAM
  5. Intel Omni-Path HPC interconnect
  6. 220TB of dedicated storage (not backed up)

Disk space

Home directories and data disks on fawcett are separate from the ones on other Maths system. Home directory have relatively small quotas on them so they should not be used to keep big data generated by computing jobs. For this purpose every user should have access to at least on of the subdirectories of /nfs/st01/hpc-*.

Note that on Fawcett the command quota does not return correct information regarding storage occupancy. To find out how much space you have still got available use the command df on the correct directory, e.g.

  • df /home/CRSID to see your home-directory quota
  • df /nfs/st01/hpc-GROUPNAME to see the data-disk quota shared between all members of group hpc-GROUPNAME

Software

Most of the software is provided in the form environment modules. One can check the list of available modules with command:

module av

A module can be loaded with command

module load <modulename>

Other useful commands:

module      
   (no arguments)              print usage instructions
   list                        print list of loaded modules
   whatis                      as above with brief descriptions
   unload <modulename>         remove a module
   purge                       remove all modules

Most useful modules:

gcc
openmpi/4.1.2/gcc-7.5.0-qc6lyum
openmpi/4.1.2/intel-2021.5.0-h6p5iq3
python
intel-oneapi-compilers
cuda

New modules can be requested by sending email to help@maths.cam.ac.uk.

Python modules

Please note that while there are some Python modules available on Fawcett, they are there as dependencies of other modules rather than as packages for end-users to import. They may or may not work.

In case of problems, or when in doubt, please use Anaconda packages (see below) for anything Pythonic.

Anaconda setup

Some software is also installed as conda environment. They tend to interact badly with software installed with environment modules. It is possible to unload all loaded environment modules with command:

module purge

Next step is  to load module miniforge3:

module load miniforge3

Afterwards you may check the list of available conda environments with command:

conda info -e

Currently available environments are:

  • anaconda - contain full list of packages that came with standard ancaconda installation plus few more which were requested by users
  • tensorflow - different versions
  • pytorch - different versions

A particular environment can be activated with a command like (note that standard way: conda activate ... does not work on fawcett).

source activate anaconda-2021.11

By default conda stores the environments created by users in the directory /nfs/software/Conda/users/<username>/envs. It also aggressively caches the downloaded packages in /nfs/software/Conda/users/<username>/pkgs. As both these directories can be quite large it is recommended to clean conda caches regularly.. The details can be found in conda documentation:

https://conda.io/projects/conda/en/latest/user-guide/configuration/use-c...

https://conda.io/projects/conda/en/latest/user-guide/configuration/use-c...

Visual Studio Code

In its default configuration, Visual Studio Code spawns multiple copies of the JavaScript server Node.JS on remote hosts. On multi-user systems such as the Fawcett head node this can, and has been observed to, quickly exhaust the available resources and render them virtually unusable. Therefore, users wishing to use Visual Studio Code to work on Fawcett are now required to adjust their configuration as follows:

  1. Hit the Extensions button (on the left toolbar, looks like building blocks)
  2. Locate the extension "TypeScript and JavaScript Language Features"; searching for "@builtin TypeScript" ought to do it
  3. Disable that extension
  4. Reload VS Code

Queuing system

Fawcett operates the Slurm workload managers for managing resources. If you are not familiar with Slurm, or workload managers / batch-queuing systems in general, you might want to have a look at this FAQ before proceeding.

Some useful commands:

squeue      - show global cluster information
sinfo       - show global cluster information
sview       - show global cluster information
scontrol show job <job_number> - examine the job with jobid nnnn
scontrol show node nodename - examine the node with name nodename
sbatch      - submits an executable script to the queueing system
srun        - run a command either as a new job or within an existing job
scancel     - delete a job

Submitting jobs

To submit a job one needs first create a submission script. It is a shell script with special comment lines with prefix

#SBATCH

which provide instructions to queuing system about required resources. For example:

#!/bin/bash
#! Which partition (queue) should be used
#SBATCH -p gpu
#! Number of required nodes
#SBATCH -N 1
#! Number of MPI ranks running per node
#SBATCH --ntasks-per-node=2
#! Number of GPUs per node if required
#SBATCH --gres=gpu:2
#!How much wallclock time will be required (HH:MM:SS)
#SBATCH --time=02:00:00

srun a.out

To submit a script to the queuing system use command:

sbatch <scriptname>

Interactive jobs

It is possible to request an interactive job with command srun. For example:

srun --pty -p skylake -n 2 --time=02:00:00 bash

would reserve two cores in skylake partition for two hours and run bash there.

Notes and comments

  • MPI jobs should be launched the same way as non-MPI ones, i.e. with srun. In case of Intel MPI it provides for better integration with Slurm than using mpirun or mpiexec, and jobs linked against OpenMPI might downright refuse to start in Slurm jobs if one of the latter two commands is used.
  • Intel MPI jobs running on cosmosx occasionally refuse to start citing problems acquiring "hfi" resources. This is because Intel MPI tries to allocate resources for both intra- and internode communication even if only the former are required (as is the case here given there is only one node available in cosmosx partitions), and the highly parallel nature of cosmosx means the latter tend to rather quickly run out. To prevent this from happening, restrict Intel MPI to intranode communication only by setting
export I_MPI_FABRICS=shm

in your job-submission script.

Memory available to jobs

By default on all nodes except for KNL ones memory available to a job is in proportion to allocated number of cores. The limits are:

  • 10740 MB per core on skylake nodes
  • 10740 MB per core on GPU nodes
  • 10417 MB per core on cosmosx node

It possible to request more memory by using --mem=size[units] option, where units are  K, M, G, or T (M is default).

Available queues

The following queues are available:

  1. cosmosx - a shared memory node, although it is also possible to run MPI jobs on it
  2. skylake - two socket SkyLake nodes
  3. gpu - GPU nodes
  4. knl - KNL nodes
  5. knl-long -queu for longer jobs on KNL nodes
  6. skylake-long - queue for longer jobs on SkyLake nodes
  7. cosmosx-long - queue for longer jobs on shared memory nod

The maximum wall time is 12 hours for normal queues and 72 hours for long queues. Access to long queues is at the discretion of the PI concerned. In order to facilitate higher throughput of jobs and better utilisation of the system jobs in long queue can use no more then 25% of resources.