Parallel processing

nctoolkit is written to enable rapid processing and analysis of netCDF files, and this includes the ability to process in parallel. Two methods of parallel processing are available. First is the ability to carry out operations on multi-file datasets in parallel. Second is the ability to define a processing chain in nctoolkit, and then use the multiprocessing or multiprocess package to process files in parallel using that chain. The multiprocessing package is not compatible with nctoolkit internals on macOS, so the multiprocess package should be used instead.

Parallel processing of multi-file datasets

If you have a multi-file dataset, processing the files within it in parallel is easy. All you need to is the following:

nc.options(cores = 6)

This will tell nctoolkit to process the files in multi-file datasets in parallel and to use 6 cores when doing so. You can, of course, set the number of cores as high as you want. The only thing nctoolkit will do is limit it to the number of cores on your machine.

Parallel processing using multiprocessing or multiprocess

A common task is taking a bunch of files in a folder, doing things to them, and then saving a modified version of each file in a new folder. We want to be able to parallelize that, and we can using the multiprocessing package in the usual way.

But first, we need to change the global settings:

import nctoolkit as nc
nc.options(parallel = True)

This tells nctoolkit that we are about to do something in parallel. This is critical because of the internal workings of nctoolkit. Behind the scenes nctoolkit is constantly creating and deleting temporary files. It manages this process by creating a safe-list, i.e. a list of files in use that should not be deleted. But if you are running in parallel, you are adding to this list in parallel, and this can cause problems. Telling nctoolkit it will be run in parallel tells it to switch to using a type of list that can be safely added to in parallel.

We can illustrate the use of nctoolkit to post-process multiple files in parallel with a simple chain which will convert files that have temperature in degrees Celsius and then convert them to Kelvin and also save the new outputs as separate files.

First, we would define a function that can take the input file, carry out the necessary processing and then save the output file in a new directory.

In this case, the original file is in a directory called ensemble and we will put it in a new one called new.

def process_chain(infile):
    '''
    This function takes a file, converts the temperature to Kelvin and then saves the output in a new directory
    '''

    # define the outfile name
    outfile = infile.replace('ensemble', 'new')
    # check if directory for outfile exists and create if not
    if not os.path.exists(os.path.dirname(outfile)):
        os.mkdir(os.path.dirname(outfile))
    ds = nc.open_data(infile)
    # convert to Kelvin
    ds.assign(tos = lambda x: x.sst + 273.15)
    # save the output
    ds.to_nc(outfile)

We now want to loop through all of the files in a folder, apply the function to them and then save the results in a new folder called new.

# identify files in the ensemble directory
ensemble = nc.create_ensemble("ensemble")
import multiprocessing as mp
import os
# on macOS, use:
#import multiprocess as mp
# create a pool of workers
pool = mp.Pool(3)
# apply the function to each file in the ensemble
for ff in ensemble:
    pool.apply_async(process_chain, [ff])
# close the pool and wait for the work to finish
pool.close()
# wait for the processes to finish
pool.join()

The number 3 in this case signifies that 3 cores are to be used.

Please note that if you are working interactively or in a Jupyter notebook, it is best to reset parallel as follows once you have stopped any parallel processing:

nc.options(parallel = False)

This is because of the effects of manually terminating commands on multiprocessing lists, which nctoolkit uses when in parallel mode. This appears to be due to a book in multiprocessing, which is hard to avoid.