Getting simulations done (productivity tips)

A large part of my work involves solving the Navier–Stokes equations to simulate the fluid flow in part of a gas turbine. This can take a while: typically, running a new case from scratch requires from 24 hours to several days of compute time. If the cluster is busy, there might also be a queue of a day before a computer becomes available. When the simulation is finished, looking at the results and deciding what to do next takes a comparatively short amount of time.

This post collects tips I have picked up on how to efficiently get these lengthy simulations done, be more productive, and finish that paper or thesis on time. The advice could apply equally to any computationally intensive work. In short:

Validate input data before starting a new case;
Automate setting up and post-processing;
Check in on your running simulations now and again;
Know how the queue system works; and finally
Wait usefully if your simulation capacity is maxed out.

Optimising your time

I focus on optimising the use of ‘James time’, not computer time. We will assume that our code is economical with computing resources and there is capacity available to eventually do all the simulations we want, even if we have to queue.

The University of Cambridge GPU cluster has got an order of magnitude faster since I started my PhD, very roughly consistent with Moore’s Law. Increasing availability of raw computing power helps to get more simulations done, but the nature of research means that the quantity and size of the simulations we want to do only increases to match.

At the frontier of computational fluid dynamics, the limiting factor on research progress is the feedback loop between starting a simulation, inspecting the results, and running the next case.

Validate input data

Before burning computational resources, or more importantly wasting your time, we need to take steps to ensure that a simulation will produce useful results. A well-written code will fail gracefully with an informative error message given invalid input data, but this is of little consolation when the job has sat in the computing cluster queue for several hours.

A good habit to get into is to maintain a validation script, that you always run on input data before submitting a job to the queue. Every time a simulation fails because of a bug in your setup code, or user error, add a test to the script that will halt submission if the problem reoccurs.

Also useful are pre-submission warnings, where the validation script raises alerts for parameters that are strictly valid, but outside the usual range. For example, failing to convert millimetres to metres will result in a Reynolds number off by a factor of 1000. While not forbidden by the laws of physics, on grounds of practicality we can rule out gas turbines with a radius of 1 kilometre!

Automate

Automation increases the number of simulations we can get done for two reasons. First, it reduces the chance of user error in setting up a case. Second, it allows us to run many simulations over a range of parameters in a way that would be too time-consuming to do manually.

The question, of course, is how much time to spend implementing and debugging the automation system. If we can accurately estimate the timescales involved, xkcd have a chart for us. I would argue that, in a research context, we should favour automation because the number of potentially interesting lines of enquiry grows rapidly as we learn from doing more simulations.

On the other hand, it is a mistake to attempt to build a fully general architecture for running simulations at first. Instead, automate gradually and incrementally. Start with a simple script for one case, implement a loop over parameters, then extract into a function. As the project progresses, you will want to change things that did not occur to you at the beginning.

Know the queue

The University of Cambridge HPC cluster uses the SLURM workload manager. Other clusters may use different software, but the principle that we should learn the useful features of our queue system applies everywhere.

For debugging purposes, we might want to just see if a job starts successfully on a new input file, without caring about the results in the first instance. On the Cambridge Wilkes3 system, this use case is given a special extra-high priority to jump the queue, for jobs up to one hour only. We can run $JOB_SCRIPT with this queue-jump priority using,

sbatch --qos=intr $JOB_SCRIPT

Dependent jobs are useful to chain a set of simulations that do not fit in the 36 hour job time limit. The dependent job will start after the previous job finishes with no manual intervention. Combined with automation, dependencies can make running a new case very slick.

On the SLURM queue, we can start $NEXT_SCRIPT after $JOBID has finished using the syntax,

sbatch --dependency=afterok:$JOBID $NEXT_SCRIPT

Check in

Time-marching computations are susceptible to divergence. The code should look for NaNs while running and automatically abort if they are encountered. Even with a NaN watch, it is worth checking in on running simulations and inspecting their output.

You may spot that the computation is converging to the wrong solution, or that an input parameter is set incorrectly. It is particularly useful to check an hour or so after the job has started, to kill unwanted simulations before they waste too much time.

Being willing to put in an occasional bit of work on evenings or weekends can accelerate progress dramatically. Suppose you are waiting for the results of some key computations that finish on a Saturday. Either, you can wait until Monday morning to to look at the results, and idle the rest of Monday while your next batch runs. Or, with just an hour of overtime, you can start the next batch on Saturday, and come Monday, you will be a day ahead.

Along similar lines, it can be a good idea to spread out starting simulations. Starting everything on Monday morning means that all results will become available at the same time. Whereas, if you put something in the queue daily, then you can inspect results on a rolling basis throughout the week. This requires good record keeping. I find it helpful to assign each simulation a unique identifier number that corresponds to a row in a spreadsheet or line in a paper log book.

Wait usefully

Inevitably, there will be times when you have many simulations going and it is not productive to set up any more. Organise a stack of ancillary work to make good use of waiting time while your simulations run. This could be: going through a list of interesting papers ’to read’; improving automation and tooling; working on a less computationally intensive side project; or even blogging.

If you have lots of waiting time, you are probably not thinking hard enough.

“The author finds that running time can double up with thinking time, and so it is not desirable for it to become too short.’’ Denton (2017).

2022-05-02

#productivity