skip to content
 

Running long jobs on Maths workstations

Introduction

It is a good thing when the computers are used for scientific work that will benefit the University. It is a bad thing when all of the computers, or all of the newer and more capable ones, are being used as it makes it difficult for other people to use them for scientific work. It is an even worse thing when a computer is overloaded with jobs and not making much progress on any of them because it is spending all its time swapping back and forth between them. A computer in this overloaded state is also likely to be irritatingly slow for the console user (the person sitting at it).

This page is mostly aimed at those who wish to run jobs on the Maths computers, but some sections will be helpful if your computer is running slowly because of somebody else's jobs, particularly those which mention how to measure the load. If you are suffering from jobs running on your desktop computer:

  • the htop command (explained later) will show whose jobs are causing the trouble.
  • contact whomever is running the jobs and ask them, directly and politely, to do something soon. Remember to mention the name of your computer, which is usually on a sticker attached to the case, or can be obtained by typing hostname
  • do NOT reboot or switch off the computer as other people may be logged in
  • DO contact the HelpDesk if asking does not produce a result within a reasonable length of time

Before selecting a machine to run your jobs on, please consult our automatically generated list of Maths computers and their specs. Remember that you can filter it to just the machines to which you have access and sort it by memory or number of CPUs.

If the Maths desktops are not sufficient for your computing needs, some other HPC resources are listed here:

Checklist

Long-running jobs must not impact upon the console user's interactive use of the computer. If a job is causing a machine to produce a noticeable delay in responding to keystrokes or changing focus, it is inappropriate. A long job is one that requires more than about 10 minutes of CPU time. This includes Matlab, Maple and Mathematica as well as conventional FORTRAN or C programs.

If you are running lots of jobs on the public computers please take care that one or two of each vintage are left free for people to try out jobs/codes on. And if you are asked to relinquish some of the computers for other people to run jobs on please be collaborative, professional and polite about the discussions.

Obviously you need to be contactable if running jobs on computers and to have a Maths computer account. If you are not contactable then we will have to close your account. Your login ID (CRSID) @maths.cam.ac.uk address must work.

Arbitration of any disputes is done by help@maths.cam.ac.uk and is very rarely necessary.

A checklist of points to remember if you are running jobs:

Measurement - htop, top

htop and top are interactive process viewers which display information about all the active processes on a system, ranked in descending order of CPU usage by default. htop is a newer and more user-friendly program which allows you to click on a column header to sort by that column (e.g. memory instead of CPU) and to renice or kill your own processes without having to know the process ID. In top you can press "<" and ">" to change the sorting column and R (capital R, Shift+r) to reverse the sorting order

The first column in both htop and top shows the login name of the user who owns the process. Their email address will be (login name)@maths.cam.ac.uk - please email them directly if their jobs are causing a problem for you.

Note: Sometimes htop makes it look as if a user is running multiple identical processes when they are not. This is because each process is made up of multiple "threads" or lightweight processes. To stop htop displaying each thread as a separate process:

  • Press F2 (Setup)
  • Go to Display Options
  • Ensure that "Hide userland threads" is checked
  • Click "Done"

Always check the load on a computer with htop or top before running your own jobs. Specific ways in which a computer can be overloaded (e.g. more jobs than CPUs, swapping due to insufficient memory) will be discussed in detail later.

To exit htop or top press Q for Quit.

Preparing programs for running, checkpointing

You should always take care to ensure that your program has been optimised (if you don't know about optimisation, ask).

You should also make sure that long jobs are restartable. This means that if you expect the run to last more than a day, dump internal tables and data to a file periodically so that in the event of a computer being restarted you will not have to start your job from the beginning. Keep this file in your home directory or scratch space, not in /tmp which may be cleared on reboot. This is commonly known as checkpointing. If you have difficulty with this, you should split your long jobs up into sections which will take no more than about a day to run.

Jobs which do periodic checkpointing should also trap the TERM signal and perform a checkpoint, this means that if the jobs are killed during a shutdown or reboot then less of the work will be wasted.

Background jobs (nohup) and logout

Do not leave a program running on the console, except on your own office computer which nobody else sits at. Run jobs in the background and use the nohup command to ensure that they continue running after you log out.

    nohup nice -19 command <infile >outfile & disown $!
nohup
Don't terminate program when user logs out
nice -19
Run at lowest priority, don't hog the CPU
command
Replace by the name of your program
<infile
Receive input from file "infile" rather than the keyboard
>outfile
Send output to file "outfile" rather than the terminal. For some programs the names of the input and output files are hardwired in so these options can be omitted
&
Run in the background - return to the shell prompt while the process is still running

taskset (using fewer cores)

Most modern computers have more than one processor (they may be referred to as dual-core or quad-core) and with hyper-threading on Intel processors, one physical processor acts as two logical cores.

Many jobs (e.g. matrix multiplication in Matlab) default to using all the logical cores they can find, but the taskset command restricts a job to a specified set of cores. For example this will confine a job to three cores.

taskset -c 0-2 myjob

You can find out the number of logical cores of the machine by doing:

cat /proc/cpuinfo | grep ^processor | wc -l

Load average (CPU usage)

Load average is basically the average number of processes which are competing for a share of the computer processor, and ready to run, i.e. not waiting for input or halted. Therefore, if there are three non-interactive jobs running on a computer, you would expect the load average to be 3. Of course, other user and system processes also increase the load average temporarily, so you might expect the actual load average reading to be more than 3.

The w command displays a computer's load average over the past 1, 5 and 15 minutes, and these figures are also listed in the summary section of top and htop's displays.

The load average on a computer should be kept at no more than the number of logical cores, less half a core for interactive use. For example a computer with four logical cores should have a load average of 3.5 or less. If the load average is much higher than the number of CPUs then the computer is spending too much time switching back and forth between jobs and not enough working on any one job.

nice (reducing the priority of your job)

Always run long jobs in the background with a "nice" value of 19:

    nice -19 command <input >output &       # with sh, bash
    nice +19 command <input >output &       # csh, tcsh

The nice command minimises the effect of your job on other logged on users who need a good interactive response for editing etc. Use the appropriate form of the command for the shell which you use. (The default shell in Maths is bash). If you start a job and it runs longer than you expect you can alter its nice value with the renice command man renice), for example:

    ps -ef | grep program_name # to find the PID of the job/process
    renice +19 -p ${PID}

The NI column in the top and htop command shows the nice value of the processes, which must be 19 for long jobs. You can also renice your own processes from within these commands. In htop you can click on a process and press F8 to increase its niceness. In top press "r" for renice, you'll be prompted for the process-id, enter this, press enter/return and then enter the renice value i.e. 19.

Memory and swapping

You should always be aware of your program's approximate memory requirements, the total amount of memory available on the computer, and how much memory is being used by other users.

Linux and UNIX operating systems provides "virtual memory" for executing processes using a combination of real memory (RAM) and "swap space" on disk. Our computers typically have 8GB of swap space regardless of the amount of real memory (at the date of writing this is 16GB for new desktops).

The htop command tells you how much memory each process is using. SIZE is the total size of the process in kilobytes: the amount of swap space allocated to the process on the swap disk. RES is the resident size of the process in kilobytes: an estimate of the amount of real memory currently needed by the process. RES is the size of the process' "working set": the code and data which is being accessed frequently. Little-used data and code (e.g. initialisation code) will soon get swapped out to disk and not be included in the RES figure.

The total amount of virtual memory required by all the processes running in the computer is often greater than the amount of real memory, and the operating system has to move "pages" of memory to and from disk as different processes become active. (A process cannot be executed when essential pages of memory are swapped out to the disk). In normal operations the system can select unused or seldom used pages of memory as candidates for swapping and the computer works at maximum efficiency. When the total amount of resident memory needed by active processes exceeds the amount of available real memory, the computer will "page" to disk excessively and give a much reduced overall performance.

This will often make the computer unusable by other logged-on users, and lead to greatly increased elapsed time to completion of all jobs running on the computer. Users should always check that their jobs do not require more resident working set memory than is available on the computer, and if necessary use a computer with more memory.

The free command will display the total amount of physical memory and swap space for the system. The used swap space should be kept close to zero. Ideally 1GB of memory should be left free for the console user, however it is quite normal for the free memory reported to be very low since the system will often use any otherwise unused memory for buffering files.

It is generally obvious when a computer is swapping too much:

  • It runs very slowly
  • The console user may see or hear excessive disk activity
  • The use of swap space goes up
  • The load average goes up

The vmstat command can be used to display the computer's paging activity (si, so) in Kbytes/s. The computer will become overloaded if paging rates exceed typically several Mbyte/s.

I/O

A job which does very large amounts of I/O may be inappropriate for running on a machine that someone is using as a console: Linux's handling of I/O is still far from ideal. The usual rule applies - if the job is slowing the computer down for the console user, it is inappropriate.

The iostat command will allow you to see the i/o traffic to and from local devices such as scratch space - however it will not report activity to network-mounted file systems such as your home directory or store space.