Multi-file datasets
nctoolkit is built to handle multi-file datasets easily and efficiently. Parallel processing of files, ensemble averaging and merging are all easily done.
To create a multi-file dataset, you just need to supply a list of files
to open_data
. Alternatively, you can use wild cards. The following
will create a multi-file dataset with all of the files in the foo
folder:
import nctoolkit as nc
ds = nc.open_data("foo/*.nc")
Standard nctoolkit methods can then be applied to each file within the ensemble. For example, if we wanted a temporal mean of each file, we would do the following:
ds.tmean()
Note, to avoid any confusion: this operation will only apply to
individual members of the multi-file dataset. We will later discuss
ensemble methods such as ensemble_mean
, which let you calculate
statistics across the ensemble.
Merging multi-file datasets
There are two ways to merge mult-file datasets, time-based and variable-based.
Merging by time is done as follows:
ds.merge("time")
This will join files together so that their times join up. It should be used when files have the same variables and grids, but distinct times.
The second merging method is joining variables. In this case files should have the same time steps or one file should have at most one time step. This is done as follows:
ds.merge("variable")
By default, nctoolkit uses variable-based merging.
Speeding up multi-file processing
If you have access to multiple cores, it is very easy to ensure files within a multi-file dataset are processed in parallel. Just set the number of cores to be used. In the following case, we set it to 6:
nc.options(cores = 6)
This results in files being processed simultaneously with 6 cores.
If you are working on multi-file datasets, it is almost always much
faster to set the number of cores to a high number and carry out
operations on the files before merging them using merge
and not the
other way round.
Ensemble statistics
In some cases, you will want to calculate averages etc. across the multi-file dataset. For example, each file in a dataset could be from a different climate model and you might simply the mean value across them. This is very easily done. We can just calculate the ensemble mean as follows:
ds.ensemble_mean()
This will calculate the mean for each time step. For example, if you
have an ensemble which is made of monthly projections of temperature
from 20 different climate models, ensemble_mean
will calculate the
monthly mean of those 20 models.
Multiple ensemble methods are available: ensemble_mean
,
ensemble_percentile
, ensemble_stdev
, ensemble_var
,
ensemble_max
, ensemble_min
, ensemble_range
and
ensemble_sum
.