nctoolkit: Fast and easy analysis of netCDF data in Python

nctoolkit is a comprehensive Python package for analyzing netCDF data on Linux and macOS.

Core abilities include:

  • Cropping to geographic regions

  • Interactive plotting of data

  • Subsetting to specific time periods

  • Calculating time averages

  • Calculating spatial averages

  • Calculating rolling averages

  • Calculating climatologies

  • Creating new variables using arithmetic operations

  • Calculating anomalies

  • Horizontally and vertically remapping data

  • Calculating the correlations between variables

  • Calculating vertical averages for the likes of oceanic data

  • Calculating ensemble averages

  • Calculating phenological metrics

Installation

How to install nctoolkit

nctoolkit is available from the Python Packaging Index. To install nctoolkit using pip:

$ pip install numpy
$ pip install nctoolkit

If you already have numpy installed, ignore the first line. This is only included as it will make installing some dependencies smoother. nctoolkit partly relies on cartopy for plotting. This has some additional dependencies, so you may need to follow their guide here to ensure cartopy is installed fully. If you install nctoolkit using conda, you will not need to worry about that.

If you install nctoolkit from pypi, you will need to install the system dependencies listed below.

nctoolkit can also be installed using conda, as follows:

$ conda install -c conda-forge nctoolkit

Note that recent releases are not available for Python 3.8 on macOS on conda. This issue is being investigated at the minute, and will hopefully be resolved shortly. In the meantime, if you are using macOS and Python 3.8, it is best to install using pip.

At present this can be slow due to the time taken to resolve dependency versions. If you run into problems just use pip.

To install the development version from GitHub:

$ pip install git+https://github.com/r4ecology/nctoolkit.git

Python dependencies

System dependencies

There are two main system dependencies: Climate Data Operators, and NCO. The easiest way to install them is using conda:

$ conda install -c conda-forge cdo

$ conda install -c conda-forge nco

CDO is necessary for the package to work. NCO is an optional dependency and does not have to be installed.

If you want to install CDO from source, you can use one of the bash scripts available here.

Introduction to nctoolkit

nctoolkit is a multi-purpose tool for analyzing and post-processing netCDF files. We will see what it is capable of by carrying out an exploratory analysis of sea surface temperature since 1850.

We will use the temperature data set COBE2 from the National Oceanic and Atmospheric Administration. This is a global dataset of sea surface temperature at a horizontal resolution of 1 degree for every month since 1850.

It is best to import nctoolkit as follows:

[45]:
import nctoolkit as nc

We will set the file paths as follows:

[46]:
ff = "sst.mon.mean.nc"

nctoolkit works with datasets. These contain either a single file or a list of files that we will work with. We can create using the temperature file, as follows:

[47]:
ds = nc.open_data(ff)

We can access easily access the attributes of the dataset. For example, if we wanted to find out the number of years in the dataset, we can do this:

[48]:
[min(ds.years), max(ds.years)]
[48]:
[1850, 2019]

All years from 1850 to 2019 are available. We could find out the variables available like so:

[49]:
ds.variables
[49]:
['sst']

Now, if we want to do anything with a dataset, we need to use nctoolkit’s many methods. Let’s say we want to map mean temperature for the year 2016. We could do that as follows:

[50]:
ds = nc.open_data(ff)
ds.select(year = 2000)
ds.tmean()
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[50]:

This was carried out in 3 clear steps. First, we selected the year 2000. Second, we calculated the temporal mean for that year. And we then plotted the result.

We might want to do something more interesting. We have a dataset of sea surface temperature. How much has the ocean warmed over this time? We can calculate that as follows:

[51]:
ds = nc.open_data(ff)
ds.tmean("year")
ds.spatial_mean()
ds.plot("sst")
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[51]:

Here we did the calculation in two steps. First we used tmean to calculate the annual mean since 1850 for each grid cell. We use the year keyword to tell nctoolkit that the mean should calculated each year. We then use spatial_mean to calculate the spatial mean.

Now, we might want to map how much the oceans have warmed over the last century. We could do this as follows:

[52]:
ds_start = nc.open_data(ff)
ds_start.select(years = range(1900, 1920))
ds_start.tmean()

ds_increase = nc.open_data(ff)
ds_increase.select(years = range(2000, 2020))
ds_increase.tmean()
ds_increase.subtract(ds_start)

First, we created a dataset which gives the mean temperature between 1900 and 1919. We then create a second dataset, which initially is the mean temperature between 2000 and 2019. We then subtract the 1900-19 temperature from this dataset. We can now plot the results:

[53]:
ds_increase.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[53]:

You can see that most of the world’s oceans have warmed, but some have warmed more than others.

We might want to know how much oceans have warmed or cooled relative to the rest of the planet. We can do this using the assign method:

ds_increase.assign(sst = lambda x: x.sst - spatial_mean(x.sst)) ds_increase.plot()

Areas in the red warmed more than the global average.

Under the hood

Let’s revisit the first code example to see how nctoolkit works behind the scenes:

[54]:
ds = nc.open_data(ff)
ds.select(year = 2000)
ds.tmean()

The plotting part has been removed. Each dataset is made of up of files. We can see what they are as follows:

[55]:
ds.current
[55]:
['sst.mon.mean.nc']

You can see that this is just the file we started with. What’s going on? The answer: nctoolkit works lazily. All calculations are carried out when the user says to, or when they have to be. To force calculations to be carried out, we use run. The plot method will, of course, for everything to be evaluated before plotting.

[56]:
ds.run()

We can now see that the file in the dataset has changed:

[57]:
ds.current
[57]:
['/tmp/nctoolkitutksgqhmnctoolkittmplg5lkgk4.nc']

This is now a new temporary file. Under the hood, nctoolkit uses Climate Data Operators CDO. CDO is a powerful and ultra-efficient system for working with netCDF files. nctoolkit requires no knowledge of CDO, but if you want to understand it further you can read their excellent user guide.

We can see the CDO commands by access the history attribute:

[58]:
ds.history
[58]:
['cdo -L -timmean -selyear,2000 sst.mon.mean.nc /tmp/nctoolkitutksgqhmnctoolkittmplg5lkgk4.nc']

You can see that 2 nctoolkit methods have been converted into one CDO call.

And don’t worry, nctoolkit will automatically remove all of the temporary files once they are no longer needed.

Click on the tabs on the left to find out what nctoolkit is capable of

News

Release of v0.3.5

Version 0.3.5 was released in May 2021.

This is a minor release focusing on some under-the-hood improvements in performance and a couple of new methods.

It drops support for CDO version 1.9.3, as this is becoming too time-consuming to continue given the increasingly low reward.

A couple of new methods have been added. distribute enables files to be split up spatially into equally sized m by n rectangles. collect is the reverse of distribute. It will collect distributed data into one file.

In prior releases assign calls could not be split over multiple lines. This is now fixed.

There was a bug in previous releases where regrid did not work with multi-file datasets. This was due to the enabling of parallel processing with nctoolkit. The issue is now fixed.

The deprecated methods mutate and assign have now been removed. Variable creation should use assign.

Release of v0.3.4

Version 0.3.3 was released in April 2021.

This was a minor release focusing on performance improvements, removal of deprecated methods and introduction of one new method.

A new method fill_na has been introduced that allows missing values to be filled with the distanced weighted average.

The methods remove_variables and cell_areas have been removed and are replaced permanently by drop and cell_area.

Release of v0.3.2

Version 0.3.2 was released in March 2021. This was a quick release to fix a bug causing to_nc to not save output in the base directory.

Release of v0.3.1

Version 0.3.1 was released in March 2021. This is a minor release that includes new methods, under-the-hood improvements and the removal of deprecated methods.

New methods are introduced for identifying the first time step will specific numerical thresholds are first exceeded or fallen below etc: first_above, first_below, last_above and last_below. The thresholds are either single numbers or can come from a gridded dataset for grid-cell specific thresholds.

Methods to compare a dataset with another dataset or netCDF file have been added: gt and lt, which stand for ‘greater than’ and ‘less than’.

Users are be able to recycle the weights calculated when interpolating data. This can enable much faster interpolation of multiple files with the same grid.

The temporal methods replaced by tmean etc. have now been removed from the package. So monthly_mean etc. can no longer be used.

Release of v0.3.0

Version 0.3.0 was released in February 2021. This will be a major release introducing major improvements to the package.

A new method assign is now available for generating new variables. This replaces the mutate and transmute, which were place-holder functions in the early releases of nctoolkit until a proper method for creating variables was put in place. assign operates in the same way as the assign method in Pandas. Users can generate new variables using lambda functions.

A major-change in this release is that evaluation is now lazy by default. The previous default of non-lazy evaluation was designed to make life slightly easier for new users of the package, but it is probably overly annoying for users to have to set evaluation to lazy each time they use the package.

This release features a subtle shift in how datasets work, so that they have consistent list-like properties. Previously, the files in a dataset given by the `current` attribute could be both a str or a list, depending on whether there was one or more files in the dataset. This now always gives a list. As a result datasets in nctoolkit have list-like properties, with `append and remove methods available for adding and removing files. remove is a new method in this release. As before datasets are iterable.

This release will also allow users to run nctoolkit in parallel. Previous releases allowed files in multi-file datasets to be processed in parallel. However, it was not possible to create processing chains and process files in parallel. This is now possible in version thanks to under-the-hood changes in nctoolkit’s code base.

Users are now able to add a configuration file, which means global settings do not need to be set in every session or in every script.

Datasets

nctoolkit works with what it calls datasets. Each dataset is made up of or more netCDF files.

Opening datasets

There are 3 ways to create a dataset: open_data, open_url or open_thredds.

If the data you want to analyze that is available on your computer use open_data. This will accept either a path to a single file or a list of files. It will also accept wildcards. So if you wanted to open all of the files in a folder called data as a dataset, you could do the following:

ds = nc.open_data("data/*.nc")

If you want to use data that can be downloaded from a url, just use open_url. This will download the netCDF files to a temporary folder, and it can then be analyzed.

If you want to analyze data that is available from a thredds server or opendap, then user open_thredds. The file paths should end with .nc.

Visualization of datasets

You can visualize the contents of a dataset using the plot method. Below, we will plot temperature for January and the North Atlantic:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.plot()

Please note there may be some issues due to bugs in nctoolkit’s dependencies that cause problems plotting some data types. If data does not plot, raise an issue here.

Modifying datasets and lazy evaluation

nctoolkit works by performing operations and then saving the results as either a temporary file or in a file specified by the user. We can illustrate this with the following code. This will select the first time step from a file available over thredds and will plot the results.

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.select(time = 0)
ds.plot()

You will notice, once this is done, that the file associated with the dataset is now a temporary file.

ds.current

This will happen each time nctoolkit carries out an operation. This is potentially an invitation to slow-running code. You do not want to be constantly reading and writing data. Ideally, you want a processing chain which minimizes IO. nctoolkit enables this by allowing method chaining, thanks to the method chaining of its computational back-end CDO.

Let’s look at this chain of code:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()

What is potentially wrong with this? It carries out four operations, so we absolutely do not want to create temporary file in each step. So instead of evaluating the operations line by line nctoolkit only evaluates them either when you tell it to or it has to. So in the code example above we have told nctoolkit what to do to that dataset, but have not told it to actually do any of it.

The quickest way to evaluate everything using run. The code above would become:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.run()

Evaluation is, to use the technical term, lazy within nctoolkit. It only evaluates things until it needs to or is forced to.

This allows us to create efficient processing chain where we read the input file and write to the output file with no intermediate file writing. If, in the example above, we wanted to save the output file, we could do this:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.to_nc("foo.nc")

List-like behaviour of datasets

If you want to view the files within a dataset view the current attribute.

This is a list that gives the file(s) within the dataset. To make processing these files easier nctoolkit features a number of methods similar to lists.

First, datasets are iterable. So, you can loop through each element of a dataset as follows:

You can find out how many files are in a dataset, using len:

You can add a new file to a dataset using append:

This method also let you add the files from another dataset.

Similarly, you can remove files from a dataset using remove:

In line with typical list behaviours, you can also create empty datasets as follows:

This is particularly useful if you need to create an ensemble based on multiple files that need significant processing before being added to the dataset.

Dataset attributes

We can find out key information about a dataset using its attributes.

If we want to know the variables available in a dataset called ds, we would do:

ds.variables

If we want to know the vertical levels available in the dataset, we use the following.

ds.levels

If we want to know the files in a dataset, we would do this. nctoolkit works by generating temporary files, so if you have carried out any operations, this will show a list of temporary files.

ds.current

If we want to find out what times are in the dataset we do this:

ds.times

If we want to find out what months are in the dataset:

ds.months

If we want to find out what years are in the dataset:

ds.years

We can also access the history of operations carried out on the dataset. This will show the operations carried out by nctoolkit’s computational back-end CDO:

ds.history

Cheat sheet

A cheat sheet providing a quick 2-page overview of nctoolkit is available here.

Plotting

nctoolkit provides automatic plotting of netCDF data in a similar way to the command line tool ncview.

If you have a dataset, simply use the plot method to create an interactive plot that matches the data type.

We can illustate this using a sea surface temperature dataset available here.

Let’s start by calculating mean sea surface temperature for the year 2000 and plotting it:

[1]:
import nctoolkit as nc
ff =  "sst.mon.mean.nc"
ds = nc.open_data(ff)
ds.select(year = 2000)
ds.tmean()
ds.plot()
nctoolkit is using CDO version 1.9.9
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[1]:

We might be interested in the zonal mean. nctoolkit can automatically plot this easily:

[2]:
ff =  "sst.mon.mean.nc"
ds = nc.open_data(ff)
ds.select(year = 2000)
ds.tmean()
ds.zonal_mean()
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[2]:

nctoolkit can also easily handle heat maps. So, we can easily plot the change in zonal mean over time:

[3]:
ff =  "sst.mon.mean.nc"
ds = nc.open_data(ff)
ds.zonal_mean()
ds.annual_anomaly(baseline = [1850, 1869], window = 20)
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[3]:

In a similar vein, it can automatically handle time series. Below we plot a time series of global mean sea surface temperature since 1850:

[4]:
ff =  "sst.mon.mean.nc"
ds = nc.open_data(ff)
ds.spatial_mean()
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[4]:

Internal: ncplot

Plotting is carried out using the ncplot package. If you come across any errors, please raise an issue here.

This is a package that aims to deliver easy use. Colour scales for heat map default to a diverging blue-to-red pallette when there are positives and negatives and a viridis palette otherwise.

Importing and exporting data

nctoolkit can work with data available on local file systems, urls and over thredds and OPeNDAP.

Opening single files and ensembles

If you want to import a single netCDF file as a dataset, do the following:

import nctoolkit as nc
ds = nc.open_data(infile)

The open_data function can also import multiple files. This can be done in two ways. If we have a list of files we can do the following:

import nctoolkit as nc
ds = nc.open_data(file_list)

Alternatively, open_data is capable of handling wildcards. So if we have a folder called data, we can import all files in it as follows:

import nctoolkit as nc
ds = nc.open_data("data/*.nc")

Opening files from urls/ftp

If we want to work with a file that is available at a url or ftp, we can use the open_url function. This will start by downloading the file to a temporary folder, so that it can be analysed.

import nctoolkit as nc
ds = nc.open_url(www.foo.nc)

Opening data available over thredds servers or OPeNDAP

If you want to work with data that is available over a thredds server or OPeNDAP, you can use the open_thredds method. This will require that the url ends with “.nc”.

import nctoolkit as nc
ds = nc.open_thredds(www.foo.nc)

Exporting datasets

nctoolkit has a number of built in methods for exporting data to netCDF, pandas dataframes and xarray datasets.

Save as a netCDF

The method to_nc lets users export a dataset to a netCDF file. If you want this to be a zipped netCDF file use the zip method before to to_nc. An example of usage is as follows:

ds = nc.open_data(infile)
ds.tmean()
ds.zip()
ds.to_nc(outfile)

Convert to xarray Dataset

The method to_xarray lets users export a dataset to an xarray dataset. An example of usage is as follows:

ds = nc.open_data(infile)
ds.tmean()
xr_ds = ds.to_xarray()

Convert to pandas dataframe

The method to_dataframe lets users export a dataset to a pandas dataframe.

ds = nc.open_data(infile)
ds.tmean()
df = ds.to_dataframe()

Temporal statistics

nctoolkit has a number of built-in methods for calculating temporal statistics, all of which are prefixed with t: tmean, tmin, tmax, trange, tpercentile, tmedian, tvariance, tstdev and tcumsum.

These methods allow you to quickly calculate temporal statistics over specified time periods using the over argument.

By default the methods calculate the value over all time steps available. For example the following will calculate the temporal mean:

import nctoolkit as nc
ds = nc.open_data("sst.mon.mean.nc")
ds.tmean()

However, you may want to calculate, for example, an annual average. To do this we use over. This is a list which tells the function which time periods to average over. For example, the following will calculate an annual average:

ds.tmean(["year"])

If you are only averaging over one time period, as above, you can simply use a character string:

ds.tmean("year")

The possible options for over are “day”, “month”, “year”, and “season”. In this case “day” stands for day of year, not day of month.

In the example below we are calculating the maximum value in each month of each year in the dataset.

ds.tmax(["month", "year"])

Calculating rolling averages

nctoolkit has a range of methods to calcate rolling averages: rolling_mean, rolling_min, rolling_max, rolling_range and rolling_sum. These methods let you calculate rolling statistics over a specified time window. For example, if you had daily data and you wanted to calculate a rolling weekly mean value, you could do the following:

ds.rolling_mean(7)

If you wanted to calculated a rolling weekly sum, this would do:

ds.rolling_sum(7)

Calculating anomalies

nctoolkit has two methods for calculating anomalies: annual_anomaly and monthly_anomaly. Both methods require you to specify a baseline period to calculate the anomaly against. They require that you specify a baseline period showing the minimum and maximum years of the climatological period to compare against.

So, if you wanted to calculate the annual anomaly compared with a baseline period of 1950-1969, you would do this:

ds.annual_anomaly(baseline = [1950, 1969])

By default, the annual anomaly is calculated as the absolute difference between the annual mean in a year and the mean across the baseline period. However, in some cases this is not suitable. Instead you might want the relative change. In that case, you would do the following:

ds.annual_anomaly(baseline = [1950, 1969], metric = "relative")

You can also smooth out the anomalies, so that they are calculated on a rolling basis. The following will calculate the anomaly using a rolling window of 10 years.

ds.annual_anomaly(baseline = [1950, 1969], window = 10)

Monthly anomalies are calculated in the same way:

ds.monthly_anomaly(baseline = [1950, 1969]

Here the anomaly is the difference between the value in each month compared with the mean in that month during the baseline period.

Calculating climatologies

This means we can easily calculate climatologies. For example the following will calculate a seasonal climatology:

ds.tmean("season")

These methods allow partial matches for the arguments, which means you do not need to remember the precise argument each time. For example, the following will also calculate a seasonal climatology:

ds.tmean("Seas")

Calculating a climatological monthly mean would require the following:

ds.tmean("month")

and daily would be the following:

ds.tmean("day")

Calculating climatologies

This means we can easily calculate climatologies. For example the following will calculate a seasonal climatology:

ds.tmean("season")

Cumulative sums

We can calculate the cumulative sum as follows:

ds.tcumsum()

Please note that this can only calculate over all time periods, and does not accept an over argument.

Subsetting data

nctoolkit has many built in methods for subsetting data. The main method is select. This let’s you select specific variables, years, months, seasons and timesteps.

Selecting variables

If you want to select specific variables, you would do the following:

ds.select(variables = ["var1", "var2"])

If you only want to select one variable, you can do this:

ds.select(variables = "var1")

Selecting years

If you want to select specific years, you can do the following:

ds.select(years = [2000, 2001])

Again, if you want a single year the following will work:

ds.select(years = 2000)

The select method allows partial matches for its arguments. So if we want to select the year 2000, the following will work:

ds.select(year = 2000)

In this case we can also select a range. So the following will work:

ds.select(years = range(2000, 2010))

Selecting months

You can select months in the same way as years. The following examples will all do the same thing:

ds.select(months = [1,2,3,4])
ds.select(months = range(1,5))
ds.select(mon = [1,2,3,4])

Selecting seasons

You can easily select seasons. For example if you wanted to select winter, you would do the following:

ds.select(season = "DJF")

Selecting timesteps

You can select specific timesteps from a dataset in a similar manner. For example if you wanted to select the first two timesteps in a dataset the following two methods will work:

ds.select(time = [0,1])
ds.select(time = range(0,2))

Geographic subsetting

If you want to select a geographic subregion of a dataset, you can use crop. This method will select all data within a specific longitude/latitude box. You just need to supply the minimum longitude and latitude required. In the example below, a dataset is cropped with longitudes between -80 and 90 and latitudes between 50 and 80:

ds.crop(lon = [-80, 90], lat = [50, 80])

Manipulating variables

Creating new variables

Variable creation in nctoolkit can be done using the assign method, which works in a similar way to the method available in pandas.

The assign method works using lambda functions. Let’s say we have a dataset with a variable ‘var’ and we simply want to add 10 to it and call the new variable ‘new’. We would do the following:

ds.assign(new = lambda x: x.var + 10)

If you are unfamilar with lambda functions, note that the x after lambda signifies that x represents the dataset in whatever comes after ‘:’, which is the actual equation to evaluate. The x.var term is var from the dataset.

By default assign keeps the original variables in the dataset. However, we may only want the new variable or variables. In that case you can use the drop argument:

ds.assign(new = lambda x: x.var+ 10, drop = True)

This results in only one variable.

Note that the assign method uses kwargs for the lambda functions, so drop can be positioned anywhere. So the following will do the same thing

ds.assign(new = lambda x: x.var+ 10, drop = True)
ds.assign(drop = True, new = lambda x: x.var+ 10)

At present, assign requires that it is written on a single line. So avoid doing something like the following:

ds.assign(new = lambda x: x.var+ 10,
drop = True)

The assign method will evaluate the lambda functions sent to it for each dataset grid cell for each time step. So every part of the lambda function must evaluate to a number. So the following will work:

k = 273.15
ds.assign(drop = True, sst_k = lambda x: x.sst + k)

However, if you set k to a string or anything other than a number it will throw an error. For example, this will throw an error:

k = "273.15"
ds.assign(drop = True, sst_k = lambda x: x.sst + k)

Applying mathematical functions to dataset variables

As part of your lambda function you can use a number of standard mathematical functions. These all have the same names as those in numpy: abs, floor, ceil, sqrt, exp, log10, sin, cos, tan, arcsin, arccos and arctan.

For example if you wanted to calculate the ceiling of a variable you could do the following:

ds.assign(new = lambda x: ceil(x.old))

An example of using logs would be the following:

ds.assign(new = lambda x: log10(x.old+1))

Using spatial statistics

The assign method carries out its calculations in each time step, and you can access spatial statistics for each time step when generating new variables. A series of functions are available that have the same names as nctoolkit methods for spatial statistics: spatial_mean, spatial_max, spatial_min, spatial_sum, vertical_mean, vertical_max, vertical_min, vertical_sum, zonal_mean, zonal_max, zonal_min and zonal_sum.

An example of the usefulness of these functions would be if you were working with global temperature data and you wanted to map regions that are warmer than average. You could do this by working out the difference between temperature in one location and the global mean:

ds.assign(temp_comp = lambda x: x.temperature - spatial_mean(x.temperature), drop = True)

You can also do comparisons. In the above case, we instead might simply want to identify regions that are hotter than the global average. In that case we can simply do this:

ds.assign(temp_comp = lambda x: x.temperature > spatial_mean(x.temperature), drop = True)

Let’s say we wanted to map regions which are 3 degrees hotter than average. We could that as follows:

ds.assign(temp_comp = lambda x: x.temperature > spatial_mean(x.temperature + 3), drop = True)

or like this:

ds.assign(temp_comp = lambda x: x.temperature > (spatial_mean(x.temperature)+3), drop = True)

Logical operators work in the standard Python way. So if we had a dataset with a variable called ‘var’ and we wanted to find cells with values between 1 and 10, we could do this:

ds.assign(one2ten = lambda x: x.var > 1 & x.var < 10)

You can process multiple variables at once using assign. Variables will be created in the order given, and variables created by the first lambda function can be used by the next one, and so on. The simple example below shows how this works. First we create a var1, which is temperature plus 1. Then var2, which is var1 plus 1. Finally, we calculate the difference between var1 and var2, and this should be 1 everywhere:

ds.assign(var1 = lambda x: x.var + 1, var2 = lambda x: x.var1 + 1, diff = lambda x: x.var2 - x.var1)

Functions that work with nctoolkit variables

The following functions can be used on nctoolkit variables as part of lambda functions.

Function

Description

Example

abs

Absolute value

abs(x.sst)

ceiling

Ceiling of variable

ceiling(x.sst -1)

cell_area

Area of grid-cell (m2)

cell_area(x.var)

cos

Trigonometric cosine of variable

cos(x.var)

day

Day of the month of the variable

day(x.var)

exp

Exponential of variable

exp(x.sst)

floor

Floor of variable

floor(x.sst + 8.2)

hour

Hour of the day of the variable

hour(x.var)

isnan

Is variable a missing value/NA?

isnan(x.var)

latitude

Latitude of the grid cell

latitude(x.var)

level

Vertical level of variable.

level(x.var)

log

Natural log of variable

log10(x.sst + 1)

log10

Base log10 of variable

log10(x.sst + 1)

longitude

Longitude of the grid cell

longitude(x.var)

month

Month of the variable

month(x.var)

sin

Trigonometric sine of variable

sin(x.var)

spatial_max

Spatial max of variable at time-step

spatial_max(x.var)

spatial_mean

Spatial mean of variable at time-step

spatial_mean(x.var)

spatial_min

Spatial min of variable at time-step

spatial_min(x.var)

spatial_sum

Spatial sum of variable at time-step

spatial_sum(x.var)

sqrt

Square root of variable

sqrt(x.sst + 273.15)

tan

Trigonometric tangent of variable

tan(x.var)

timestep

Time step of variable. Using Python indexing.

timestep(x.var)

year

Year of the variable

year(x.var)

zonal_max

Zonal max of variable at time-step

zonal_max(x.var)

zonal_mean

Zonal mean of variable at time-step

zonal_mean(x.var)

zonal_min

Zonal min of variable at time-step

zonal_min(x.var)

zonal_sum

Zonal sum of variable at time-step

zonal_sum(x.var)

Simple mathematical operations on variables

If you want to do simple operations like adding or subtracting numbers from the variables in datasets you can use the add, subtract, divide and multiply methods. For example if you wanted to add 10 to every variable in a dataset, you would do the following:

ds.add(10)

If you wanted to multiply everything by 10, you would do this:

ds.multiply(10)

These methods will also let you use other datasets or netCDF files. So, you could add the values in a dataset data2 to a dataset called data1 as follows:

ds1.add(ds2)

Please note that this will require that the datasets are structured in a way that the operation makes sense. So each dimension in the datasets will either have to be identical, with the exception of when one dataset has a single value for a dimension. So for example if ds2 above has data covering only 1 timestep, but ds1 has multiple timesteps the data from that single time step will be added to all timesteps in ds1. But if the time steps match, then the data from the first time step in ds2 will be added to the data in the first time step in ds1, and the same will happen with the following time steps.

Simple numerical comparisons

If you want to do something as simple as working out whether the values of the variables in a dataset are greater than zero, you can use the compare method. This method accepts a simple comparison formula, which follows Python conventions. For example, if you wanted to figure out if the values in a dataset were greater than zero, you would do the following:

ds.compare(">0")

If you wanted to know if they were equal to zero you would do this:

ds.compare("==0")

Interpolation

nctoolkit features built in methods for horizontal and vertical interpolation.

Interpolating to a set of coordinates

If you want to regrid a dataset to a specified set of coordinates you can regrid and a pandas dataframe. The first column of the dataframe should be the longitudes and the second should be latitudes. The example below regrids a sea-surface temperature dataset to a single location with longitude -30 and latitude 50.

import nctoolkit as nc
import pandas as pd
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds.select(timestep = range(0, 12))
coords = pd.DataFrame({"lon":[-30], "lat":[50]})
ds.regrid(coords)

Interpolating to a regular latlon grid

If you want to interpolate to a regular latlon grid, you can use to_latlon. lon and lat specify the minimum and maximum longitudes and latitudes, while res, a 2 variable list specifies the resolution. For example, if we wanted to regrid the globe to 0.5 degree north-south by 1 degree east-west resolution, we could do the following:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds.select(timestep = 0)
ds.to_latlon(lon = [-79.5, 79.5], lat = [0.75, 89.75], res = [1, 0.5])

Interpolating to another dataset’s grid

If we are working with two datasets and want to put them on a common grid, we can interpolate one onto the other’s grid. We can illustate this with a dataset of global sea surface temperature. Let’s start by regridding the first timestep in this dataset to a regular latlon grid covering the North Atlantic.

ds1 = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds1.select(timestep = 0)
ds1.to_latlon(lon = [-79.5, 79.5], lat = [-0.75, 89.75], res = [1, 0.5])

We can then use this new dataset as the target grid in regrid. So

ds2 = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds2.select(timestep = 0)
ds2.regrid(ds1)

This method will also work using netCDF files. So, if you wanted you can also use a path to a netCDF file as the target grid.

How to reuse the weights for regridding

Under the hood nctoolkit regrids data by first generating a weights file. There are situations where you will want to be able to re-use these weights. For example, if you are post-processing a large number of files one after the other. To make this easier nctoolkit let’s you recycle the regridding info. This let’s you interpolate using either regrid or to_latlon, but keep the regridding data for future use by regrid.

The example below illustrates this. First, we regrid a global dataset to a regular latlon grid covering the North Atlantic, setting the recycle argument to True.

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds.select(timestep = 0)
ds.to_latlon(lon = [-79.5, 79.5], lat = [-0.75, 89.75], res = [1, 0.5], recycle = True)

We can then use the grid from data for regridding:

ds1 = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds1.select(timestep = 0)
ds1.regrid(ds)

This, of course, requires that the grids in the datasets are consistent. If you want to access the weights and grid files generated, you can do the following:

These files are deleted either when data is deleted or when the Python session is existed.

Resampling

If you want to make data more coarse spatially, just use the resample_grid method. This will, for example, let you select every 2nd grid grid cell in a north-south and east-west direction. This is illustrated in the example below, where a dataset which has spatial resolution of 1 by 1 degrees is coarsened, so that only every 10th cell is selected in a north-south and east-west. In other words it is now a 10 degrees by 10 degrees dataset.

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds.select(timestep = 0)
ds.resample_grid(10)

Vertical interpolation

We can carry out vertical interpolation using the vertical_interp method. This is particularly useful for oceanic data. This is illustrated below by interpolating ocean temperatures from NOAA’s World Ocean Atlas for January to a depth of 500 metres. The vertical_interp method requires a levels argument, which is sea-depth in this case.

ds = nc.open_thredds("https://data.nodc.noaa.gov/thredds/dodsC/ncei/woa/temperature/A5B7/1.00/woa18_A5B7_t01_01.nc")
ds.select(variables="t_an")
ds.vertical_interp(levels= [500])

Ensemble methods

Merging files with different variables

This notebook will outline some general methods for doing comparisons of multiple files. We will work with two different sea surface temperature data sets from NOAA and the Met Office Hadley Centre.

[1]:
import nctoolkit as nc
import pandas as pd
import xarray as xr
import numpy as np
nctoolkit is using CDO version 1.9.8

Let’s start by downloading the files using wget. Uncomment the code below to do this (note: you will need to extract the HadISST dataset):

[2]:
# ! wget ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.mean.nc
# ! wget https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_sst.nc.gz

The first step is to get the data. We will start by creating two separate datasets for each file.

[3]:
sst_noaa = nc.open_data("sst.mon.mean.nc")
sst_hadley = nc.open_data("HadISST_sst.nc")

We can see that both variables have sea surface temperature labelled as sst. So we will need to change that.

[4]:
sst_noaa.variables
[4]:
['sst']
[5]:
sst_hadley.variables
[5]:
['sst', 'time_bnds']
[6]:
sst_noaa.rename({"sst":"noaa"})
sst_hadley.rename({"sst":"hadley"})

The data sets also cover different time periods, and only have overlapping between 1870 and 2018. so we will need to select those years

[7]:
sst_noaa.select(years = range(1870, 2019))
sst_hadley.select(years = range(1870, 2019))

We also have a problem in that there are two horizontal grids in the Hadley Centre file. We can solve this by selecting the sst variable only

[8]:
sst_hadley.select(variables = "hadley")

At this point, the datasets have the same number of time steps and months covered. However, the grids are still a bit different. So we want to unify them by regridding one dataset on to the other’s grid. This can be done using regrid, or any grid of your choosing.

[9]:
sst_noaa.regrid(grid = sst_hadley)

We now have two separate datasets. Let’s create a new dataset that has both of them, and then merge them. When doing this we need to make sure nas are treated properly. In this case Hadley Centre values not being NAs as they should be, so we need to fix that. The merge method also requires a strict matching criteria for the dates in the merging files. In this case the Hadley Centre and NOAA data sets both give monthly means, but use a different day of the month. So we will set match to [“year”, “month”] this will ensure there are no mis-matches

[10]:
all_sst = nc.merge(sst_noaa, sst_hadley, match = ["year", "month"])
all_sst.set_missing([-9000, - 900])

Let’s work out what the global mean SST was over the time period. Note that this will not be totally accurate as there are some missing values here and there that might bias things.

[11]:
all_sst.spatial_mean()
all_sst.tmean("year")
all_sst.rolling_mean(10)
[12]:
all_sst.plot("noaa")
[12]:

We can also work out the difference between the two. Here we wil work out the monthly bias per cell. Then calculate the mean global difference per year, and then calculate a rolling 10 year mean.

[13]:
all_sst = nc.open_data([sst_noaa.current, sst_hadley.current])
all_sst.merge(match = ["year", "month"])
all_sst.transmute({"bias":"hadley-noaa"})
all_sst.set_missing([-9000, - 900])
all_sst.spatial_mean()
all_sst.tmean("year")
all_sst.rolling_mean(10)
all_sst.plot("bias")
[13]:

You can see that there is a notable difference at the start of the time series.

Merging files with different times

TBC

Ensemble averaging

TBC

Parallel processing

nctoolkit is written to enable rapid processing and analysis of netCDF files, and this includes the ability to process in parallel. Two methods of parallel processing are available. First is the ability to carry out operations on multi-file datasets in parallel. Second is the ability to define a processing chain in nctoolkit, and then use the multiprocessing package to process files in parallel using that chain.

Parallel processing of multi-file datasets

If you have a multi-file dataset, processing the files within it in parallel is easy. All you need to is the following:

nc.options(cores = 6)

This will tell nctoolkit to process the files in multi-file datasets in parallel and to use 6 cores when doing so. You can, of course, set the number of cores as high as you want. The only thing nctoolkit will do is limit it to the number of cores on your machine.

Parallel processing using multiprocessing

A common task is taking a bunch of files in a folder, doing things to them, and then saving a modified version of each file in a new folder. We want to be able to parallelize that, and we can using the multiprocessing package in the usual way.

But first, we need to change the global settings:

import nctoolkit as nc
nc.options(parallel = True)

This tells nctoolkit that we are about to do something in parallel. This is critical because of the internal workings of nctoolkit. Behind the scenes nctoolkit is constantly creating and deleting temporary files. It manages this process by creating a safe-list, i.e. a list of files in use that should not be deleted. But if you are running in parallel, you are adding to this list in parallel, and this can cause problems. Telling nctoolkit it will be run in parallel tells it to switch to using a type of list that can be safely added to in parallel.

We can use multiprocessing to do the following: take all of the files in folder foo, do a bunch of things to them, then save the results in a new folder:

We start with a function giving a processing chain. There are obviously different ways of doing this, but I like to use a function that takes the input file and output file:

def process_chain(infile, outfile):
    ds = nc.open_data(ff)
    ds.assign(tos = lambda x: x.sst + 273.15)
    ds.tmean()
    ds.to_nc(outfile)

We now want to loop through all of the files in a folder, apply the function to them and then save the results in a new folder called new:

ensemble = nc.create_ensemble("../../data/ensemble")
import multiprocessing
pool = multiprocessing.Pool(3)
for ff in ensemble:
    pool.apply_async(process_chain, [ff, ff.replace("ensemble", "new")])
pool.close()
pool.join()

The number 3 in this case signifies that 3 cores are to be used.

Please note that if you are working interactively or in a Jupyter notebook, it is best to reset parallel as follows once you have stopped any parallel processing:

nc.options(parallel = False)

This is because of the effects of manually terminating commands on multiprocessing lists, which nctoolkit uses when in parallel mode.

Examples

This tutorial runs through a number of example work flows.

Global sea surface temperature since 1850

This example analyzes a global sea surface temperature dataset, covering the years since 1850. The data is available from the National Oceanic and Atmospheric Administration (NOAA) here.

We are looking at global sea surface temperature since 1850, so an obvious question is how much the oceans have warmed over this time period. We can use nctoolkit’s spatial_mean method to calculate this:

Once the file is downloaded, we should set it to ff:

[1]:
import nctoolkit as nc
ff = "sst.mon.mean.nc"
nctoolkit is using CDO version 1.9.9
[2]:
ds = nc.open_data(ff)
ds.spatial_mean()
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[2]:

We can see a clear temperature rise of about 1 degree Celcius. But this is monthly data, so a bit noisy. We can smooth it out by taking an annual mean. In this case we send “year” to tmean to tell it to calculate the mean for each year:

[3]:
ds = nc.open_data(ff)
ds.tmean(["year"])
ds.spatial_mean()
ds.plot("sst")
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[3]:

That is getting better. But again, we possibly want a rolling average. We can use the rolling_mean method to calculate the mean over every 20-year period:

[4]:
ds = nc.open_data(ff)
ds.tmean(["year"])
ds.spatial_mean()
ds.rolling_mean(20)
ds.plot("sst")
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[4]:

We’ll finish things off by tweaking this so that we can work out how much temperature has increased since the first 20 years in the time period. For this we can use the annual_anomaly method.

[5]:
ds = nc.open_data(ff)
ds.annual_anomaly(baseline = [1850, 1869], window = 20)
ds.spatial_mean()
ds.plot("sst")
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[5]:

More to come….

Random Data Hacks

nctoolkit features a number of useful methods to tweak data.

Shifting time

Sometimes the times in datasets are not quite what we want, and we need some way to adjust time. An example of this is when you are missing a year of data, so want to copy data from the prior year and use it. But first you would need to shift the times in that year forward by a year. You can do this with the shift method. This let’s you shift time forward by a specified number of hours, days, months or years. You just need to supply hours, days, months or years as an argument. So, if you wanted to shift time backward by one year, you would do the following:

ds.shift(years = -1)

If you wanted to shift time forward by 12 hours, this would do it:

ds.shift(hours = 12)

Note: this method allows partial matches to the arguments, so you could use hour, day, month or year just as easily.

Adding cell areas to a dataset

You can add grid cell areas to a dataset as follows:

ds.cell_area()

By default, this will add the cell area (in square metres) to the dataset. If you want the dataset to only include cell areas you need to set the join argument to False:

ds.cell_area(join = False)

Of course, this method will only if it is possible to calculate the areas the grid cells.

Changing the format of the netCDF files in a dataset

Sometimes you will want to change the format of the files in a dataset. You can do this using the format method. This let’s you set the format, with the following options:

  • netCDF = “nc1”

  • netCDF version 2 (64-bit offset) = “nc2”/”nc”

  • netCDF4 (HDF5) = “nc4”

  • netCDF4-classi = “nc4c”

  • netCDF version 5 (64-bit data) = “nc5”

So, if you want to set the format to netCDF4, you would do the following:

ds.format("nc4")

Getting rid of dimensions with only one value

Sometimes you will have a dataset that has a dimension with only one value, and you might want to get rid of that dimension. For example, you might only have one one timestep and keeping it may have no value. Getting rid of that dimension can be done using the reduce_dims method. It works as follows:

ds.reduce_dims()

Global settings

nctoolkit let’s you set global settings using options.

The most important and recommended to update is to set evaluation to lazy. This can be done as follows:

nc.options(lazy = True)

This means that commands will only be evaluated when either request them to be or they need to be.

For example, in the code below the 3 specified commands will only be calculated after it is told to run. This cuts down on IO, and can result in significant improvements in run time. At present lazy defaults to False, but this may change in a future release of nctoolkit.

nc.options(lazy = True)
data.tmean()
data.crop(lat = [0,90])
data.spatial_mean()
data.run()

If you are working with ensembles, you may want to change the number of cores used for processing multiple files. For example, you can process multiple files in parallel using 6 cores as follows. By default cores = 1. Most methods can run in parallel when working with multi-file datasets.

nc.options(cores = 6)

By default nctoolkit uses the OS’s temporary directories when it needs to create temporary files. In most cases this is optimal. Most of the time reading and writing to temporary folders is faster. However, in some cases this may not be a good idea because you may not have enough space in the temporary folder. In this case you can change the directory used for saving temporary files as follows:

nc.options(temp_dir = "/foo")

Setting global settings using a configuration file

You may want to set some global settings either permanently or on a project level. You can do this by setting up a configruation file. This should be a plain text file called .nctoolkitrc or nctoolkitrc. It should be placed in one of two locations: your working directory or your home directory. When nctoolkit is imported, it will look first in your working directory and then in your home directory for a file called .nctoolkitrc or nctoolkitrc. It will then use the first it finds to change the global settings from the defaults.

The structure of this file is straightforward. For example, if you wanted to set evaluation to lazy and the number of cores used for processing multi-file datasets, you would the following in your configuration file:

lazy : True

cores : 6

The files roughly follow Python dictionary syntax, with the setting and value separate by :. Note that unless the setting is specified in the file, the defaults will be used. If you do not provide a configuration file, nctoolkit will use the default settings.

API Reference

Session options

options(\*\*kwargs)

Define session options.

Opening/copying data

open_data([x, checks])

Read netCDF data as a DataSet object

open_url([x, ftp_details, wait, file_stop])

Read netCDF data from a url as a DataSet object

open_thredds([x, wait, checks])

Read thredds data as a DataSet object

DataSet.copy(self)

Make a deep copy of an DataSet object

Merging or analyzing multiple datasets

merge(\*datasets[, match])

Merge datasets

cor_time([x, y])

Calculate the temporal correlation coefficient between two datasets This will calculate the temporal correlation coefficient, for each time step, between two datasets.

cor_space([x, y])

Calculate the spatial correlation coefficient between two datasets This will calculate the spatial correlation coefficient, for each time step, between two datasets.

Adding file(s) to a dataset

append

Accessing attributes

DataSet.variables

List variables contained in a dataset

DataSet.years

List years contained in a dataset

DataSet.months

List months contained in a dataset

DataSet.times

List times contained in a dataset

DataSet.levels

List levels contained in a dataset

DataSet.size

The size of an object This will print the number of files, total size, and smallest and largest files in an DataSet object.

DataSet.current

The current file or files in the DataSet object

DataSet.history

The history of operations on the DataSet

DataSet.start

The starting file or files of the DataSet object

Plotting

DataSet.plot(self[, vars])

Variable modification

DataSet.assign(self[, drop])

Create new variables Existing columns that are re-assigned will be overwritten. :param drop: Set to True if you want existing variables to be removed once the new ones have been created. Defaults to False.

DataSet.rename(self, newnames)

Rename variables in a dataset

DataSet.set_missing(self[, value])

Set the missing value for a single number or a range

DataSet.sum_all(self[, drop])

Calculate the sum of all variables for each time step

netCDF file attribute modification

DataSet.set_longnames(self[, name_dict])

Set the long names of variables

DataSet.set_units(self[, unit_dict])

Set the units for variables

Vertical/level methods

DataSet.surface(self)

Extract the top/surface level from a dataset This extracts the first vertical level from each file in a dataset.

DataSet.bottom(self)

Extract the bottom level from a dataset This extracts the bottom level from each netCDF file.

DataSet.vertical_interp(self[, levels])

Verticaly interpolate a dataset based on given vertical levels This is calculated for each time step and grid cell

DataSet.vertical_mean(self)

Calculate the depth-averaged mean for each variable This is calculated for each time step and grid cell

DataSet.vertical_min(self)

Calculate the vertical minimum of variable values This is calculated for each time step and grid cell

DataSet.vertical_max(self)

Calculate the vertical maximum of variable values This is calculated for each time step and grid cell

DataSet.vertical_range(self)

Calculate the vertical range of variable values This is calculated for each time step and grid cell

DataSet.vertical_sum(self)

Calculate the vertical sum of variable values This is calculated for each time step and grid cell

DataSet.vertical_cumsum(self)

Calculate the vertical sum of variable values This is calculated for each time step and grid cell

DataSet.invert_levels(self)

Invert the levels of 3D variables This is calculated for each time step and grid cell

DataSet.bottom_mask(self)

Create a mask identifying the deepest cell without missing values.

Rolling methods

DataSet.rolling_mean(self[, window])

Calculate a rolling mean based on a window

DataSet.rolling_min(self[, window])

Calculate a rolling minimum based on a window

DataSet.rolling_max(self[, window])

Calculate a rolling maximum based on a window

DataSet.rolling_sum(self[, window])

Calculate a rolling sum based on a window

DataSet.rolling_range(self[, window])

Calculate a rolling range based on a window

Evaluation setting

DataSet.run(self)

Run all stored commands in a dataset

Cleaning functions


Ensemble creation

create_ensemble([path, recursive])

Generate an ensemble

Arithemetic methods

DataSet.assign(self[, drop])

Create new variables Existing columns that are re-assigned will be overwritten. :param drop: Set to True if you want existing variables to be removed once the new ones have been created. Defaults to False.

DataSet.add(self[, x, var])

Add to a dataset This will add a constant, another dataset or a netCDF file to the dataset. :param x: An int, float, single file dataset or netCDF file to add to the dataset. If a dataset or netCDF file is supplied, this must have only one variable, unless var is provided. The grids must be the same. :type x: int, float, DataSet or netCDF file :param var: A variable in the x to use for the operation :type var: str.

DataSet.subtract(self[, x, var])

Subtract from a dataset This will subtract a constant, another dataset or a netCDF file from the dataset. :param x: An int, float, single file dataset or netCDF file to subtract from the dataset. If a dataset or netCDF is supplied this must only have one variable, unless var is provided. The grids must be the same. :type x: int, float, DataSet or netCDF file :param var: A variable in the x to use for the operation :type var: str.

DataSet.multiply(self[, x, var])

Multiply a dataset This will multiply a dataset by a constant, another dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to multiply the dataset by. If multiplying by a dataset or single file there must only be a single variable in it, unless var is supplied. The grids must be the same. :type x: int, float, DataSet or netCDF file :param var: A variable in the x to multiply the dataset by :type var: str.

DataSet.divide(self[, x, var])

Divide the data This will divide the dataset by a constant, another dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to divide the dataset by. If a dataset or netCDF file is supplied, this must have only one variable, unless var is provided. The grids must be the same. :type x: int, float, DataSet or netCDF file :param var: A variable in the x to use for the operation :type var: str.

Ensemble statistics

DataSet.ensemble_mean(self[, nco, ignore_time])

Calculate an ensemble mean

DataSet.ensemble_min(self[, nco, ignore_time])

Calculate an ensemble min

DataSet.ensemble_max(self[, nco, ignore_time])

Calculate an ensemble maximum

DataSet.ensemble_percentile(self[, p])

Calculate an ensemble percentile This will calculate the percentles for each time step in the files.

DataSet.ensemble_range(self)

Calculate an ensemble range The range is calculated for each time step; for example, if each file in the ensemble has 12 months of data the statistic will be calculated for each month.

DataSet.ensemble_sum(self)

Calculate an ensemble sum The sum is calculated for each time step; for example, if each file in the ensemble has 12 months of data the statistic will be calculated for each month.

Subsetting operations

DataSet.crop(self[, lon, lat, nco, nco_vars])

Crop to a rectangular longitude and latitude box

DataSet.select(self, \*\*kwargs)

A method for subsetting datasets to specific variables, years, longitudes etc.

DataSet.drop(self[, vars])

Remove variables This will remove stated variables from files in the dataset.

Time-based methods

DataSet.set_date(self[, year, month, day, …])

Set the date in a dataset You should only do this if you have to fix/change a dataset with a single, not multiple dates.

DataSet.shift(self, \*\*kwargs)

Shift method.

Interpolation and resampling methods

DataSet.regrid(self[, grid, method, recycle])

Regrid a dataset to a target grid

DataSet.to_latlon(self[, lon, lat, res, …])

Regrid a dataset to a regular latlon grid

DataSet.resample_grid(self[, factor])

Resample the horizontal grid of a dataset

DataSet.time_interp(self[, start, end, …])

Temporally interpolate variables based on date range and time resolution

DataSet.timestep_interp(self[, steps])

Temporally interpolate a dataset to given number of time steps between existing time steps

Masking methods

DataSet.mask_box(self[, lon, lat])

Mask a lon/lat box

Statistical methods

DataSet.tmean(self[, over])

Calculate the temporal mean of all variables

DataSet.tmin(self[, over])

Calculate the temporal minimum of all variables

DataSet.tmedian(self[, over])

Calculate the temporal median of all variables :param over: Time periods to average over.

DataSet.tpercentile(self[, p, over])

Calculate the temporal percentile of all variables

DataSet.tmax(self[, over])

Calculate the temporal maximum of all variables

DataSet.tsum(self[, over])

Calculate the temporal sum of all variables

DataSet.trange(self[, over])

Calculate the temporal range of all variables

DataSet.tvariance(self[, over])

Calculate the temporal variance of all variables

DataSet.tstdev(self[, over])

Calculate the temporal standard deviation of all variables

DataSet.tcumsum(self)

Calculate the temporal cumulative sum of all variables

DataSet.cor_space(self[, var1, var2])

Calculate the correlation correct between two variables in space This is calculated for each time step.

DataSet.cor_time(self[, var1, var2])

Calculate the correlation correct in time between two variables The correlation is calculated for each grid cell, ignoring missing values.

DataSet.spatial_mean(self)

Calculate the area weighted spatial mean for all variables This is performed for each time step.

DataSet.spatial_min(self)

Calculate the spatial minimum for all variables This is performed for each time step.

DataSet.spatial_max(self)

Calculate the spatial maximum for all variables This is performed for each time step.

DataSet.spatial_percentile(self[, p])

Calculate the spatial sum for all variables This is performed for each time step.

DataSet.spatial_range(self)

Calculate the spatial range for all variables This is performed for each time step.

DataSet.spatial_sum(self[, by_area])

Calculate the spatial sum for all variables This is performed for each time step.

DataSet.centre(self[, by, by_area])

Calculate the latitudinal or longitudinal centre for each year/month combination in files. This applies to each file in an ensemble. by : str Set to ‘latitude’ if you want the latitiduinal centre calculated. ‘longitude’ for longitudinal. by_area : bool If the variable is a value/m2 type variable, set to True, otherwise set to False.

DataSet.zonal_mean(self)

Calculate the zonal mean for each year/month combination in files.

DataSet.zonal_min(self)

Calculate the zonal minimum for each year/month combination in files.

DataSet.zonal_max(self)

Calculate the zonal maximum for each year/month combination in files.

DataSet.zonal_range(self)

Calculate the zonal range for each year/month combination in files.

DataSet.meridonial_mean(self)

Calculate the meridonial mean for each year/month combination in files.

DataSet.meridonial_min(self)

Calculate the meridonial minimum for each year/month combination in files.

DataSet.meridonial_max(self)

Calculate the meridonial maximum for each year/month combination in files.

DataSet.meridonial_range(self)

Calculate the meridonial range for each year/month combination in files.

Merging methods

DataSet.merge(self[, match])

Merge a multi-file ensemble into a single file Merging will occur based on the time steps in the first file.

DataSet.merge_time(self)

Time-based merging of a multi-file ensemble into a single file This method is ideal if you have the same data split over multiple files covering different data sets.

Splitting methods

DataSet.split(self[, by])

Split the dataset Each file in the ensemble will be separated into new files based on the splitting argument.

Output and formatting methods

DataSet.to_nc(self, out[, zip, overwrite])

Save a dataset to a named file This will only work with single file datasets.

DataSet.to_xarray(self[, decode_times, …])

Open a dataset as an xarray object

DataSet.to_dataframe(self[, decode_times, …])

Open a dataset as a pandas data frame

DataSet.zip(self)

Zip the dataset This will compress the files within the dataset.

DataSet.format(self[, ext])

Zip the dataset This will compress the files within the dataset. This works lazily. :param ext: New format. Must be one of “nc”, “nc1”, “nc2”, “nc4” and “nc5” . netCDF = nc1 netCDF version 2 (64-bit offset) = nc2/nc netCDF4 (HDF5) = nc4 netCDF4-classi = nc4c netCDF version 5 (64-bit data) = nc5 :type ext: str.

Miscellaneous methods

DataSet.cell_area(self[, join])

Calculate the area of grid cells.

DataSet.first_above(self[, x])

Identify the time step when a value is first above a threshold This will do the comparison with either a number, a Dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to use for the threshold(s). If comparing with a dataset or single file there must only be a single variable in it. The grids must be the same. :type x: int, float, DataSet or netCDF file.

DataSet.first_below(self[, x])

Identify the time step when a value is first below a threshold This will do the comparison with either a number, a Dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to use for the threshold(s). If comparing with a dataset or single file there must only be a single variable in it. The grids must be the same. :type x: int, float, DataSet or netCDF file.

DataSet.last_above(self[, x])

Identify the final time step when a value is above a threshold This will do the comparison with either a number, a Dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to use for the threshold(s). If comparing with a dataset or single file there must only be a single variable in it. The grids must be the same. :type x: int, float, DataSet or netCDF file.

DataSet.last_above(self[, x])

Identify the final time step when a value is above a threshold This will do the comparison with either a number, a Dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to use for the threshold(s). If comparing with a dataset or single file there must only be a single variable in it. The grids must be the same. :type x: int, float, DataSet or netCDF file.

DataSet.cdo_command(self[, command])

Apply a cdo command

DataSet.nco_command(self[, command, ensemble])

Apply an nco command

DataSet.compare(self[, expression])

Compare all variables to a constant

DataSet.gt(self, x)

Method to calculate if variable in dataset is greater than that in another file or dataset This currently only works with single file datasets

DataSet.lt(self, x)

Method to calculate if variable in dataset is less than that in another file or dataset This currently only works with single file datasets

DataSet.reduce_dims(self)

Reduce dimensions of data This will remove any dimensions with only one value.

DataSet.reduce_grid(self[, mask])

Reduce the dataset to non-zero locations in a mask :param mask: single variable dataset or path to .nc file. The mask must have an identical grid to the dataset. :type mask: str or dataset.

Ecological methods

DataSet.phenology(self[, var, metric, p])

Calculate phenologies from a dataset Each file in an ensemble must only cover a single year, and ideally have all days.

Package info

This package was created by Robert Wilson at Plymouth Marine Laboratory (PML).

Acknowledgements

The current codebase of nctoolkit was developed using funding from the NERC Climate Linked Atlantic Sector Science programme (NE/R015953/1) and a combination of UK Research and Innovation (UKRI) and European Research Council (ERC) funded research projects.

Bugs and issues

If you identify bugs or issues with the package please raise an issue at PML’s Marine Systems Modelling group’s GitHub page here or contact nctoolkit’s creator at rwi@pml.ac.uk.

Contributions welcome

The package is new, with new features being added each month. There remain a large number of features that could be added, especially for dealing with atmospheric data. If packages users are interested in contributing or suggesting new features they are welcome to raise and issue at the package’s GitHub page or contact me.