nctoolkit: Fast and easy analysis of netCDF data in Python¶
nctoolkit is a comprehensive Python package for analyzing netCDF data on Linux and macOS.
Core abilities include:
Cropping to geographic regions
Interactive plotting of data
Subsetting to specific time periods
Calculating time averages
Calculating spatial averages
Calculating rolling averages
Calculating climatologies
Creating new variables using arithmetic operations
Calculating anomalies
Horizontally and vertically remapping data
Calculating the correlations between variables
Calculating vertical averages for the likes of oceanic data
Calculating ensemble averages
Calculating phenological metrics
Installation¶
How to install nctoolkit¶
nctoolkit is available from the Python Packaging Index. To install nctoolkit using pip:
$ pip install numpy
$ pip install nctoolkit
If you already have numpy installed, ignore the first line. This is only included as it will make installing some dependencies smoother. nctoolkit partly relies on cartopy for plotting. This has some additional dependencies, so you may need to follow their guide here to ensure cartopy is installed fully. If you install nctoolkit using conda, you will not need to worry about that.
If you install nctoolkit from pypi, you will need to install the system dependencies listed below.
nctoolkit can also be installed using conda, as follows:
$ conda install -c conda-forge nctoolkit
Note that recent releases are not available for Python 3.8 on macOS on conda. This issue is being investigated at the minute, and will hopefully be resolved shortly. In the meantime, if you are using macOS and Python 3.8, it is best to install using pip.
At present this can be slow due to the time taken to resolve dependency versions. If you run into problems just use pip.
To install the development version from GitHub:
$ pip install git+https://github.com/r4ecology/nctoolkit.git
Python dependencies¶
System dependencies¶
There are two main system dependencies: Climate Data Operators, and NCO. The easiest way to install them is using conda:
$ conda install -c conda-forge cdo
$ conda install -c conda-forge nco
CDO is necessary for the package to work. NCO is an optional dependency and does not have to be installed.
If you want to install CDO from source, you can use one of the bash scripts available here.
Introduction to nctoolkit¶
nctoolkit is a multi-purpose tool for analyzing and post-processing netCDF files. We will see what it is capable of by carrying out an exploratory analysis of sea surface temperature since 1850.
We will use the temperature data set COBE2 from the National Oceanic and Atmospheric Administration. This is a global dataset of sea surface temperature at a horizontal resolution of 1 degree for every month since 1850.
It is best to import nctoolkit as follows:
[45]:
import nctoolkit as nc
We will set the file paths as follows:
[46]:
ff = "sst.mon.mean.nc"
nctoolkit works with datasets. These contain either a single file or a list of files that we will work with. We can create using the temperature file, as follows:
[47]:
ds = nc.open_data(ff)
We can access easily access the attributes of the dataset. For example, if we wanted to find out the number of years in the dataset, we can do this:
[48]:
[min(ds.years), max(ds.years)]
[48]:
[1850, 2019]
All years from 1850 to 2019 are available. We could find out the variables available like so:
[49]:
ds.variables
[49]:
['sst']
Now, if we want to do anything with a dataset, we need to use nctoolkit’s many methods. Let’s say we want to map mean temperature for the year 2016. We could do that as follows:
[50]:
ds = nc.open_data(ff)
ds.select(year = 2000)
ds.tmean()
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[50]:
This was carried out in 3 clear steps. First, we selected the year 2000. Second, we calculated the temporal mean for that year. And we then plotted the result.
We might want to do something more interesting. We have a dataset of sea surface temperature. How much has the ocean warmed over this time? We can calculate that as follows:
[51]:
ds = nc.open_data(ff)
ds.tmean("year")
ds.spatial_mean()
ds.plot("sst")
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[51]:
Here we did the calculation in two steps. First we used tmean
to calculate the annual mean since 1850 for each grid cell. We use the year
keyword to tell nctoolkit that the mean should calculated each year. We then use spatial_mean
to calculate the spatial mean.
Now, we might want to map how much the oceans have warmed over the last century. We could do this as follows:
[52]:
ds_start = nc.open_data(ff)
ds_start.select(years = range(1900, 1920))
ds_start.tmean()
ds_increase = nc.open_data(ff)
ds_increase.select(years = range(2000, 2020))
ds_increase.tmean()
ds_increase.subtract(ds_start)
First, we created a dataset which gives the mean temperature between 1900 and 1919. We then create a second dataset, which initially is the mean temperature between 2000 and 2019. We then subtract the 1900-19 temperature from this dataset. We can now plot the results:
[53]:
ds_increase.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[53]:
You can see that most of the world’s oceans have warmed, but some have warmed more than others.
We might want to know how much oceans have warmed or cooled relative to the rest of the planet. We can do this using the assign
method:
ds_increase.assign(sst = lambda x: x.sst - spatial_mean(x.sst)) ds_increase.plot()
Areas in the red warmed more than the global average.
Under the hood¶
Let’s revisit the first code example to see how nctoolkit works behind the scenes:
[54]:
ds = nc.open_data(ff)
ds.select(year = 2000)
ds.tmean()
The plotting part has been removed. Each dataset is made of up of files. We can see what they are as follows:
[55]:
ds.current
[55]:
['sst.mon.mean.nc']
You can see that this is just the file we started with. What’s going on? The answer: nctoolkit works lazily. All calculations are carried out when the user says to, or when they have to be. To force calculations to be carried out, we use run
. The plot
method will, of course, for everything to be evaluated before plotting.
[56]:
ds.run()
We can now see that the file in the dataset has changed:
[57]:
ds.current
[57]:
['/tmp/nctoolkitutksgqhmnctoolkittmplg5lkgk4.nc']
This is now a new temporary file. Under the hood, nctoolkit uses Climate Data Operators CDO. CDO is a powerful and ultra-efficient system for working with netCDF files. nctoolkit requires no knowledge of CDO, but if you want to understand it further you can read their excellent user guide.
We can see the CDO commands by access the history attribute:
[58]:
ds.history
[58]:
['cdo -L -timmean -selyear,2000 sst.mon.mean.nc /tmp/nctoolkitutksgqhmnctoolkittmplg5lkgk4.nc']
You can see that 2 nctoolkit methods have been converted into one CDO call.
And don’t worry, nctoolkit will automatically remove all of the temporary files once they are no longer needed.
Click on the tabs on the left to find out what nctoolkit is capable of
News¶
Release of v0.3.5¶
Version 0.3.5 was released in May 2021.
This is a minor release focusing on some under-the-hood improvements in performance and a couple of new methods.
It drops support for CDO version 1.9.3, as this is becoming too time-consuming to continue given the increasingly low reward.
A couple of new methods have been added. distribute
enables files to be split up spatially into equally sized m by n rectangles. collect
is the reverse of distribute
. It will collect distributed data into one file.
In prior releases assign
calls could not be split over multiple lines. This is now fixed.
There was a bug in previous releases where regrid
did not work with multi-file datasets. This was due to the enabling of parallel processing with nctoolkit. The issue is now fixed.
The deprecated methods mutate
and assign
have now been removed. Variable creation should use assign
.
Release of v0.3.4¶
Version 0.3.3 was released in April 2021.
This was a minor release focusing on performance improvements, removal of deprecated methods and introduction of one new method.
A new method fill_na
has been introduced that allows missing values to be filled with the distanced weighted average.
The methods remove_variables
and cell_areas
have been removed and are replaced permanently by drop
and cell_area
.
Release of v0.3.2¶
Version 0.3.2 was released in March 2021. This was a quick release to fix a bug causing to_nc
to not save output in the base directory.
Release of v0.3.1¶
Version 0.3.1 was released in March 2021. This is a minor release that includes new methods, under-the-hood improvements and the removal of deprecated methods.
New methods are introduced for identifying the first time step will specific numerical thresholds are first exceeded or fallen below etc:
first_above
, first_below
, last_above
and last_below
. The thresholds are either single numbers or can come from a gridded dataset
for grid-cell specific thresholds.
Methods to compare a dataset with another dataset or netCDF file have been added: gt
and lt
, which stand for ‘greater than’ and ‘less than’.
Users are be able to recycle the weights calculated when interpolating data. This can enable much faster interpolation of multiple files with the same grid.
The temporal methods replaced by tmean
etc. have now been removed from the package. So monthly_mean
etc. can no longer be used.
Release of v0.3.0¶
Version 0.3.0 was released in February 2021. This will be a major release introducing major improvements to the package.
A new method assign
is now available for generating new variables. This replaces the mutate
and transmute
, which were
place-holder functions in the early releases of nctoolkit until a proper method for creating variables was put in place.
assign
operates in the same way as the assign
method in Pandas. Users can generate new variables using lambda functions.
A major-change in this release is that evaluation is now lazy by default. The previous default of non-lazy evaluation was designed to make life slightly easier for new users of the package, but it is probably overly annoying for users to have to set evaluation to lazy each time they use the package.
This release features a subtle shift in how datasets work, so that they have consistent list-like properties. Previously, the
files in a dataset given by the `current`
attribute could be both a str or a list, depending on whether there was one or
more files in the dataset. This now always gives a list. As a result datasets in nctoolkit have list-like properties, with `append
and remove
methods available for adding and removing files. remove
is a new method in this release. As before datasets are iterable.
This release will also allow users to run nctoolkit in parallel. Previous releases allowed files in multi-file datasets to be processed in parallel. However, it was not possible to create processing chains and process files in parallel. This is now possible in version thanks to under-the-hood changes in nctoolkit’s code base.
Users are now able to add a configuration file, which means global settings do not need to be set in every session or in every script.
Datasets¶
nctoolkit works with what it calls datasets. Each dataset is made up of or more netCDF files.
Opening datasets¶
There are 3 ways to create a dataset: open_data
, open_url
or
open_thredds
.
If the data you want to analyze that is available on your computer
use open_data
. This will accept either a path to a single file or a
list of files. It will also accept wildcards. So if you wanted to open
all of the files in a folder called data as a dataset, you could do the following:
ds = nc.open_data("data/*.nc")
If you want to use data that can be downloaded from a url, just use
open_url
. This will download the netCDF files to a temporary folder,
and it can then be analyzed.
If you want to analyze data that is available from a thredds server or opendap,
then user open_thredds
. The file paths should end with .nc.
Visualization of datasets¶
You can visualize the contents of a dataset using the plot
method.
Below, we will plot temperature for January and the North Atlantic:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.plot()
Please note there may be some issues due to bugs in nctoolkit’s dependencies that cause problems plotting some data types. If data does not plot, raise an issue here.
Modifying datasets and lazy evaluation¶
nctoolkit works by performing operations and then saving the results as either a temporary file or in a file specified by the user. We can illustrate this with the following code. This will select the first time step from a file available over thredds and will plot the results.
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.select(time = 0)
ds.plot()
You will notice, once this is done, that the file associated with the dataset is now a temporary file.
ds.current
This will happen each time nctoolkit carries out an operation. This is potentially an invitation to slow-running code. You do not want to be constantly reading and writing data. Ideally, you want a processing chain which minimizes IO. nctoolkit enables this by allowing method chaining, thanks to the method chaining of its computational back-end CDO.
Let’s look at this chain of code:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
What is potentially wrong with this? It carries out four operations, so we absolutely do not want to create temporary file in each step. So instead of evaluating the operations line by line nctoolkit only evaluates them either when you tell it to or it has to. So in the code example above we have told nctoolkit what to do to that dataset, but have not told it to actually do any of it.
The quickest way to evaluate everything using run
. The code above would become:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.run()
Evaluation is, to use the technical term, lazy within nctoolkit. It only evaluates things until it needs to or is forced to.
This allows us to create efficient processing chain where we read the input file and write to the output file with no intermediate file writing. If, in the example above, we wanted to save the output file, we could do this:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/data.mon.ltm.1981-2010.nc")
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.to_nc("foo.nc")
List-like behaviour of datasets¶
If you want to view the files within a dataset view the current
attribute.
This is a list that gives the file(s) within the dataset. To make processing these files easier nctoolkit features a number of methods similar to lists.
First, datasets are iterable. So, you can loop through each element of a dataset as follows:
You can find out how many files are in a dataset, using len
:
You can add a new file to a dataset using append
:
This method also let you add the files from another dataset.
Similarly, you can remove files from a dataset using remove
:
In line with typical list behaviours, you can also create empty datasets as follows:
This is particularly useful if you need to create an ensemble based on multiple files that need significant processing before being added to the dataset.
Dataset attributes¶
We can find out key information about a dataset using its attributes.
If we want to know the variables available in a dataset called ds, we would do:
ds.variables
If we want to know the vertical levels available in the dataset, we use the following.
ds.levels
If we want to know the files in a dataset, we would do this. nctoolkit works by generating temporary files, so if you have carried out any operations, this will show a list of temporary files.
ds.current
If we want to find out what times are in the dataset we do this:
ds.times
If we want to find out what months are in the dataset:
ds.months
If we want to find out what years are in the dataset:
ds.years
We can also access the history of operations carried out on the dataset. This will show the operations carried out by nctoolkit’s computational back-end CDO:
ds.history
Plotting¶
nctoolkit provides automatic plotting of netCDF data in a similar way to the command line tool ncview.
If you have a dataset, simply use the plot
method to create an interactive plot that matches the data type.
We can illustate this using a sea surface temperature dataset available here.
Let’s start by calculating mean sea surface temperature for the year 2000 and plotting it:
[1]:
import nctoolkit as nc
ff = "sst.mon.mean.nc"
ds = nc.open_data(ff)
ds.select(year = 2000)
ds.tmean()
ds.plot()
nctoolkit is using CDO version 1.9.9
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[1]:
We might be interested in the zonal mean. nctoolkit can automatically plot this easily:
[2]:
ff = "sst.mon.mean.nc"
ds = nc.open_data(ff)
ds.select(year = 2000)
ds.tmean()
ds.zonal_mean()
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[2]:
nctoolkit can also easily handle heat maps. So, we can easily plot the change in zonal mean over time:
[3]:
ff = "sst.mon.mean.nc"
ds = nc.open_data(ff)
ds.zonal_mean()
ds.annual_anomaly(baseline = [1850, 1869], window = 20)
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[3]:
In a similar vein, it can automatically handle time series. Below we plot a time series of global mean sea surface temperature since 1850:
[4]:
ff = "sst.mon.mean.nc"
ds = nc.open_data(ff)
ds.spatial_mean()
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[4]:
Internal: ncplot¶
Plotting is carried out using the ncplot package. If you come across any errors, please raise an issue here.
This is a package that aims to deliver easy use. Colour scales for heat map default to a diverging blue-to-red pallette when there are positives and negatives and a viridis palette otherwise.
Importing and exporting data¶
nctoolkit can work with data available on local file systems, urls and over thredds and OPeNDAP.
Opening single files and ensembles¶
If you want to import a single netCDF file as a dataset, do the following:
import nctoolkit as nc
ds = nc.open_data(infile)
The open_data function can also import multiple files. This can be done in two ways. If we have a list of files we can do the following:
import nctoolkit as nc
ds = nc.open_data(file_list)
Alternatively, open_data is capable of handling wildcards. So if we have a folder called data, we can import all files in it as follows:
import nctoolkit as nc
ds = nc.open_data("data/*.nc")
Opening files from urls/ftp¶
If we want to work with a file that is available at a url or ftp, we can use the open_url function. This will start by downloading the file to a temporary folder, so that it can be analysed.
import nctoolkit as nc
ds = nc.open_url(www.foo.nc)
Opening data available over thredds servers or OPeNDAP¶
If you want to work with data that is available over a thredds server or OPeNDAP, you can use the open_thredds method. This will require that the url ends with “.nc”.
import nctoolkit as nc
ds = nc.open_thredds(www.foo.nc)
Exporting datasets¶
nctoolkit has a number of built in methods for exporting data to netCDF, pandas dataframes and xarray datasets.
Save as a netCDF¶
The method to_nc
lets users export a dataset to a netCDF file. If
you want this to be a zipped netCDF file use the zip
method before
to to_nc
. An example of usage is as follows:
ds = nc.open_data(infile)
ds.tmean()
ds.zip()
ds.to_nc(outfile)
Convert to xarray Dataset¶
The method to_xarray
lets users export a dataset to an xarray
dataset. An example of usage is as follows:
ds = nc.open_data(infile)
ds.tmean()
xr_ds = ds.to_xarray()
Convert to pandas dataframe¶
The method to_dataframe
lets users export a dataset to a pandas
dataframe.
ds = nc.open_data(infile)
ds.tmean()
df = ds.to_dataframe()
Temporal statistics¶
nctoolkit has a number of built-in methods for calculating temporal
statistics, all of which are prefixed with t: tmean
, tmin
,
tmax
, trange
, tpercentile
, tmedian
, tvariance
,
tstdev
and tcumsum
.
These methods allow you to quickly calculate temporal statistics over
specified time periods using the over
argument.
By default the methods calculate the value over all time steps available. For example the following will calculate the temporal mean:
import nctoolkit as nc
ds = nc.open_data("sst.mon.mean.nc")
ds.tmean()
However, you may want to calculate, for example, an annual average. To
do this we use over
. This is a list which tells the function which
time periods to average over. For example, the following will calculate
an annual average:
ds.tmean(["year"])
If you are only averaging over one time period, as above, you can simply use a character string:
ds.tmean("year")
The possible options for over
are “day”, “month”, “year”, and
“season”. In this case “day” stands for day of year, not day of month.
In the example below we are calculating the maximum value in each month of each year in the dataset.
ds.tmax(["month", "year"])
Calculating rolling averages¶
nctoolkit has a range of methods to calcate rolling averages: rolling_mean
, rolling_min
, rolling_max
, rolling_range
and rolling_sum
. These
methods let you calculate rolling statistics over a specified time window. For example, if you had daily data and you wanted to calculate a rolling weekly mean
value, you could do the following:
ds.rolling_mean(7)
If you wanted to calculated a rolling weekly sum, this would do:
ds.rolling_sum(7)
Calculating anomalies¶
nctoolkit has two methods for calculating anomalies: annual_anomaly
and monthly_anomaly
. Both methods require you to specify a baseline period
to calculate the anomaly against. They require that you specify a baseline period showing the minimum and maximum years of the climatological period to
compare against.
So, if you wanted to calculate the annual anomaly compared with a baseline period of 1950-1969, you would do this:
ds.annual_anomaly(baseline = [1950, 1969])
By default, the annual anomaly is calculated as the absolute difference between the annual mean in a year and the mean across the baseline period. However, in some cases this is not suitable. Instead you might want the relative change. In that case, you would do the following:
ds.annual_anomaly(baseline = [1950, 1969], metric = "relative")
You can also smooth out the anomalies, so that they are calculated on a rolling basis. The following will calculate the anomaly using a rolling window of 10 years.
ds.annual_anomaly(baseline = [1950, 1969], window = 10)
Monthly anomalies are calculated in the same way:
ds.monthly_anomaly(baseline = [1950, 1969]
Here the anomaly is the difference between the value in each month compared with the mean in that month during the baseline period.
Calculating climatologies¶
This means we can easily calculate climatologies. For example the following will calculate a seasonal climatology:
ds.tmean("season")
These methods allow partial matches for the arguments, which means you do not need to remember the precise argument each time. For example, the following will also calculate a seasonal climatology:
ds.tmean("Seas")
Calculating a climatological monthly mean would require the following:
ds.tmean("month")
and daily would be the following:
ds.tmean("day")
Calculating climatologies¶
This means we can easily calculate climatologies. For example the following will calculate a seasonal climatology:
ds.tmean("season")
Cumulative sums¶
We can calculate the cumulative sum as follows:
ds.tcumsum()
Please note that this can only calculate over all time periods, and does
not accept an over
argument.
Subsetting data¶
nctoolkit has many built in methods for subsetting data. The main method
is select
. This let’s you select specific variables, years, months,
seasons and timesteps.
Selecting variables¶
If you want to select specific variables, you would do the following:
ds.select(variables = ["var1", "var2"])
If you only want to select one variable, you can do this:
ds.select(variables = "var1")
Selecting years¶
If you want to select specific years, you can do the following:
ds.select(years = [2000, 2001])
Again, if you want a single year the following will work:
ds.select(years = 2000)
The select
method allows partial matches for its arguments. So if we
want to select the year 2000, the following will work:
ds.select(year = 2000)
In this case we can also select a range. So the following will work:
ds.select(years = range(2000, 2010))
Selecting months¶
You can select months in the same way as years. The following examples will all do the same thing:
ds.select(months = [1,2,3,4])
ds.select(months = range(1,5))
ds.select(mon = [1,2,3,4])
Selecting seasons¶
You can easily select seasons. For example if you wanted to select winter, you would do the following:
ds.select(season = "DJF")
Selecting timesteps¶
You can select specific timesteps from a dataset in a similar manner. For example if you wanted to select the first two timesteps in a dataset the following two methods will work:
ds.select(time = [0,1])
ds.select(time = range(0,2))
Geographic subsetting¶
If you want to select a geographic subregion of a dataset, you can use crop. This method will select all data within a specific longitude/latitude box. You just need to supply the minimum longitude and latitude required. In the example below, a dataset is cropped with longitudes between -80 and 90 and latitudes between 50 and 80:
ds.crop(lon = [-80, 90], lat = [50, 80])
Manipulating variables¶
Creating new variables¶
Variable creation in nctoolkit can be done using the assign
method,
which works in a similar way to the method available in pandas.
The assign
method works using lambda functions. Let’s say we have a
dataset with a variable ‘var’ and we simply want to add 10 to it and call
the new variable ‘new’. We would do the following:
ds.assign(new = lambda x: x.var + 10)
If you are unfamilar with lambda functions, note that the x after lambda signifies that x represents the dataset in whatever comes after ‘:’, which is the actual equation to evaluate. The x.var term is var from the dataset.
By default assign keeps the original variables in the dataset. However, we may only want the new variable or variables. In that case you can use the drop argument:
ds.assign(new = lambda x: x.var+ 10, drop = True)
This results in only one variable.
Note that the assign
method uses kwargs for the lambda functions, so
drop can be positioned anywhere. So the following will do the same thing
ds.assign(new = lambda x: x.var+ 10, drop = True)
ds.assign(drop = True, new = lambda x: x.var+ 10)
At present, assign
requires that it is written on a single line. So avoid doing something
like the following:
ds.assign(new = lambda x: x.var+ 10,
drop = True)
The assign method will evaluate the lambda functions sent to it for each dataset grid cell for each time step. So every part of the lambda function must evaluate to a number. So the following will work:
k = 273.15
ds.assign(drop = True, sst_k = lambda x: x.sst + k)
However, if you set k
to a string or anything other than a number it
will throw an error. For example, this will throw an error:
k = "273.15"
ds.assign(drop = True, sst_k = lambda x: x.sst + k)
Applying mathematical functions to dataset variables¶
As part of your lambda function you can use a number of standard
mathematical functions. These all have the same names as those in numpy:
abs
, floor
, ceil
, sqrt
, exp
, log10
, sin
,
cos
, tan
, arcsin
, arccos
and arctan
.
For example if you wanted to calculate the ceiling of a variable you could do the following:
ds.assign(new = lambda x: ceil(x.old))
An example of using logs would be the following:
ds.assign(new = lambda x: log10(x.old+1))
Using spatial statistics¶
The assign
method carries out its calculations in each time step,
and you can access spatial statistics for each time step when generating
new variables. A series of functions are available that have the same
names as nctoolkit methods for spatial statistics: spatial_mean
,
spatial_max
, spatial_min
, spatial_sum
, vertical_mean
,
vertical_max
, vertical_min
, vertical_sum
, zonal_mean
,
zonal_max
, zonal_min
and zonal_sum
.
An example of the usefulness of these functions would be if you were working with global temperature data and you wanted to map regions that are warmer than average. You could do this by working out the difference between temperature in one location and the global mean:
ds.assign(temp_comp = lambda x: x.temperature - spatial_mean(x.temperature), drop = True)
You can also do comparisons. In the above case, we instead might simply want to identify regions that are hotter than the global average. In that case we can simply do this:
ds.assign(temp_comp = lambda x: x.temperature > spatial_mean(x.temperature), drop = True)
Let’s say we wanted to map regions which are 3 degrees hotter than average. We could that as follows:
ds.assign(temp_comp = lambda x: x.temperature > spatial_mean(x.temperature + 3), drop = True)
or like this:
ds.assign(temp_comp = lambda x: x.temperature > (spatial_mean(x.temperature)+3), drop = True)
Logical operators work in the standard Python way. So if we had a dataset with a variable called ‘var’ and we wanted to find cells with values between 1 and 10, we could do this:
ds.assign(one2ten = lambda x: x.var > 1 & x.var < 10)
You can process multiple variables at once using assign
. Variables
will be created in the order given, and variables created by the first
lambda function can be used by the next one, and so on. The simple
example below shows how this works. First we create a var1, which is
temperature plus 1. Then var2, which is var1 plus 1. Finally, we
calculate the difference between var1 and var2, and this should be 1
everywhere:
ds.assign(var1 = lambda x: x.var + 1, var2 = lambda x: x.var1 + 1, diff = lambda x: x.var2 - x.var1)
Functions that work with nctoolkit variables¶
The following functions can be used on nctoolkit variables as part of lambda functions.
Function |
Description |
Example |
---|---|---|
|
Absolute value |
|
|
Ceiling of variable |
|
|
Area of grid-cell (m2) |
|
|
Trigonometric cosine of variable |
|
|
Day of the month of the variable |
|
|
Exponential of variable |
|
|
Floor of variable |
|
|
Hour of the day of the variable |
|
|
Is variable a missing value/NA? |
|
|
Latitude of the grid cell |
|
|
Vertical level of variable. |
|
|
Natural log of variable |
|
|
Base log10 of variable |
|
|
Longitude of the grid cell |
|
|
Month of the variable |
|
|
Trigonometric sine of variable |
|
|
Spatial max of variable at time-step |
|
|
Spatial mean of variable at time-step |
|
|
Spatial min of variable at time-step |
|
|
Spatial sum of variable at time-step |
|
|
Square root of variable |
|
|
Trigonometric tangent of variable |
|
|
Time step of variable. Using Python indexing. |
|
|
Year of the variable |
|
|
Zonal max of variable at time-step |
|
|
Zonal mean of variable at time-step |
|
|
Zonal min of variable at time-step |
|
|
Zonal sum of variable at time-step |
|
Simple mathematical operations on variables¶
If you want to do simple operations like adding or subtracting numbers from the variables in datasets you can use
the add
, subtract
, divide
and multiply
methods. For example if you wanted to add 10 to every variable
in a dataset, you would do the following:
ds.add(10)
If you wanted to multiply everything by 10, you would do this:
ds.multiply(10)
These methods will also let you use other datasets or netCDF files. So, you could add the values in a dataset data2 to a dataset called data1 as follows:
ds1.add(ds2)
Please note that this will require that the datasets are structured in a way that the operation makes sense. So each dimension in the datasets will either have to be identical, with the exception of when one dataset has a single value for a dimension. So for example if ds2 above has data covering only 1 timestep, but ds1 has multiple timesteps the data from that single time step will be added to all timesteps in ds1. But if the time steps match, then the data from the first time step in ds2 will be added to the data in the first time step in ds1, and the same will happen with the following time steps.
Simple numerical comparisons¶
If you want to do something as simple as working out whether the values of the variables in a dataset are greater than zero, you can use the
compare
method. This method accepts a simple comparison formula, which follows Python conventions. For example, if you wanted to figure
out if the values in a dataset were greater than zero, you would do the following:
ds.compare(">0")
If you wanted to know if they were equal to zero you would do this:
ds.compare("==0")
Interpolation¶
nctoolkit features built in methods for horizontal and vertical interpolation.
Interpolating to a set of coordinates¶
If you want to regrid a dataset to a specified set of coordinates you
can regrid
and a pandas dataframe. The first column of the dataframe
should be the longitudes and the second should be latitudes. The example
below regrids a sea-surface temperature dataset to a single location
with longitude -30 and latitude 50.
import nctoolkit as nc
import pandas as pd
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds.select(timestep = range(0, 12))
coords = pd.DataFrame({"lon":[-30], "lat":[50]})
ds.regrid(coords)
Interpolating to a regular latlon grid¶
If you want to interpolate to a regular latlon grid, you can use
to_latlon
. lon
and lat
specify the minimum and maximum
longitudes and latitudes, while res
, a 2 variable list specifies the
resolution. For example, if we wanted to regrid the globe to 0.5 degree
north-south by 1 degree east-west resolution, we could do the following:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds.select(timestep = 0)
ds.to_latlon(lon = [-79.5, 79.5], lat = [0.75, 89.75], res = [1, 0.5])
Interpolating to another dataset’s grid¶
If we are working with two datasets and want to put them on a common grid, we can interpolate one onto the other’s grid. We can illustate this with a dataset of global sea surface temperature. Let’s start by regridding the first timestep in this dataset to a regular latlon grid covering the North Atlantic.
ds1 = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds1.select(timestep = 0)
ds1.to_latlon(lon = [-79.5, 79.5], lat = [-0.75, 89.75], res = [1, 0.5])
We can then use this new dataset as the target grid in regrid
. So
ds2 = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds2.select(timestep = 0)
ds2.regrid(ds1)
This method will also work using netCDF files. So, if you wanted you can also use a path to a netCDF file as the target grid.
How to reuse the weights for regridding¶
Under the hood nctoolkit regrids data by first generating a weights file. There are situations where you
will want to be able to re-use these weights. For example, if you are post-processing a large number of files
one after the other. To make this easier nctoolkit let’s you recycle the regridding info. This let’s you interpolate
using either regrid
or to_latlon
, but keep the regridding data for future use by regrid
.
The example below illustrates this. First, we regrid a global dataset to a regular latlon grid covering the North Atlantic, setting the recycle argument to True.
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds.select(timestep = 0)
ds.to_latlon(lon = [-79.5, 79.5], lat = [-0.75, 89.75], res = [1, 0.5], recycle = True)
We can then use the grid from data for regridding:
ds1 = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds1.select(timestep = 0)
ds1.regrid(ds)
This, of course, requires that the grids in the datasets are consistent. If you want to access the weights and grid files generated, you can do the following:
These files are deleted either when data
is deleted or when the Python session is existed.
Resampling¶
If you want to make data more coarse spatially, just use the
resample_grid
method. This will, for example, let you select every
2nd grid grid cell in a north-south and east-west direction. This is
illustrated in the example below, where a dataset which has spatial
resolution of 1 by 1 degrees is coarsened, so that only every 10th cell
is selected in a north-south and east-west. In other words it is now a
10 degrees by 10 degrees dataset.
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.mean.nc")
ds.select(timestep = 0)
ds.resample_grid(10)
Vertical interpolation¶
We can carry out vertical interpolation using the vertical_interp
method. This is particularly useful for oceanic data. This is
illustrated below by interpolating ocean temperatures from NOAA’s World
Ocean Atlas for January to a depth of 500 metres. The
vertical_interp
method requires a levels
argument, which is
sea-depth in this case.
ds = nc.open_thredds("https://data.nodc.noaa.gov/thredds/dodsC/ncei/woa/temperature/A5B7/1.00/woa18_A5B7_t01_01.nc")
ds.select(variables="t_an")
ds.vertical_interp(levels= [500])
Ensemble methods¶
Merging files with different variables¶
This notebook will outline some general methods for doing comparisons of multiple files. We will work with two different sea surface temperature data sets from NOAA and the Met Office Hadley Centre.
[1]:
import nctoolkit as nc
import pandas as pd
import xarray as xr
import numpy as np
nctoolkit is using CDO version 1.9.8
Let’s start by downloading the files using wget. Uncomment the code below to do this (note: you will need to extract the HadISST dataset):
[2]:
# ! wget ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.mean.nc
# ! wget https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_sst.nc.gz
The first step is to get the data. We will start by creating two separate datasets for each file.
[3]:
sst_noaa = nc.open_data("sst.mon.mean.nc")
sst_hadley = nc.open_data("HadISST_sst.nc")
We can see that both variables have sea surface temperature labelled as sst. So we will need to change that.
[4]:
sst_noaa.variables
[4]:
['sst']
[5]:
sst_hadley.variables
[5]:
['sst', 'time_bnds']
[6]:
sst_noaa.rename({"sst":"noaa"})
sst_hadley.rename({"sst":"hadley"})
The data sets also cover different time periods, and only have overlapping between 1870 and 2018. so we will need to select those years
[7]:
sst_noaa.select(years = range(1870, 2019))
sst_hadley.select(years = range(1870, 2019))
We also have a problem in that there are two horizontal grids in the Hadley Centre file. We can solve this by selecting the sst variable only
[8]:
sst_hadley.select(variables = "hadley")
At this point, the datasets have the same number of time steps and months covered. However, the grids are still a bit different. So we want to unify them by regridding one dataset on to the other’s grid. This can be done using regrid, or any grid of your choosing.
[9]:
sst_noaa.regrid(grid = sst_hadley)
We now have two separate datasets. Let’s create a new dataset that has both of them, and then merge them. When doing this we need to make sure nas are treated properly. In this case Hadley Centre values not being NAs as they should be, so we need to fix that. The merge method also requires a strict matching criteria for the dates in the merging files. In this case the Hadley Centre and NOAA data sets both give monthly means, but use a different day of the month. So we will set match to [“year”, “month”] this will ensure there are no mis-matches
[10]:
all_sst = nc.merge(sst_noaa, sst_hadley, match = ["year", "month"])
all_sst.set_missing([-9000, - 900])
Let’s work out what the global mean SST was over the time period. Note that this will not be totally accurate as there are some missing values here and there that might bias things.
[11]:
all_sst.spatial_mean()
all_sst.tmean("year")
all_sst.rolling_mean(10)
[12]:
all_sst.plot("noaa")
[12]:
We can also work out the difference between the two. Here we wil work out the monthly bias per cell. Then calculate the mean global difference per year, and then calculate a rolling 10 year mean.
[13]:
all_sst = nc.open_data([sst_noaa.current, sst_hadley.current])
all_sst.merge(match = ["year", "month"])
all_sst.transmute({"bias":"hadley-noaa"})
all_sst.set_missing([-9000, - 900])
all_sst.spatial_mean()
all_sst.tmean("year")
all_sst.rolling_mean(10)
all_sst.plot("bias")
[13]:
You can see that there is a notable difference at the start of the time series.
Merging files with different times¶
TBC
Ensemble averaging¶
TBC
Parallel processing¶
nctoolkit is written to enable rapid processing and analysis of netCDF files, and this includes the ability to process in parallel. Two methods of parallel processing are available. First is the ability to carry out operations on multi-file datasets in parallel. Second is the ability to define a processing chain in nctoolkit, and then use the multiprocessing package to process files in parallel using that chain.
Parallel processing of multi-file datasets¶
If you have a multi-file dataset, processing the files within it in parallel is easy. All you need to is the following:
nc.options(cores = 6)
This will tell nctoolkit to process the files in multi-file datasets in parallel and to use 6 cores when doing so. You can, of course, set the number of cores as high as you want. The only thing nctoolkit will do is limit it to the number of cores on your machine.
Parallel processing using multiprocessing¶
A common task is taking a bunch of files in a folder, doing things to them, and then saving a modified version of each file in a new folder. We want to be able to parallelize that, and we can using the multiprocessing package in the usual way.
But first, we need to change the global settings:
import nctoolkit as nc
nc.options(parallel = True)
This tells nctoolkit that we are about to do something in parallel. This is critical because of the internal workings of nctoolkit. Behind the scenes nctoolkit is constantly creating and deleting temporary files. It manages this process by creating a safe-list, i.e. a list of files in use that should not be deleted. But if you are running in parallel, you are adding to this list in parallel, and this can cause problems. Telling nctoolkit it will be run in parallel tells it to switch to using a type of list that can be safely added to in parallel.
We can use multiprocessing to do the following: take all of the files in folder foo, do a bunch of things to them, then save the results in a new folder:
We start with a function giving a processing chain. There are obviously different ways of doing this, but I like to use a function that takes the input file and output file:
def process_chain(infile, outfile):
ds = nc.open_data(ff)
ds.assign(tos = lambda x: x.sst + 273.15)
ds.tmean()
ds.to_nc(outfile)
We now want to loop through all of the files in a folder, apply the function to them and then save the results in a new folder called new:
ensemble = nc.create_ensemble("../../data/ensemble")
import multiprocessing
pool = multiprocessing.Pool(3)
for ff in ensemble:
pool.apply_async(process_chain, [ff, ff.replace("ensemble", "new")])
pool.close()
pool.join()
The number 3 in this case signifies that 3 cores are to be used.
Please note that if you are working interactively or in a Jupyter notebook, it is best to reset parallel as follows once you have stopped any parallel processing:
nc.options(parallel = False)
This is because of the effects of manually terminating commands on multiprocessing lists, which nctoolkit uses when in parallel mode.
Examples¶
This tutorial runs through a number of example work flows.
Global sea surface temperature since 1850¶
This example analyzes a global sea surface temperature dataset, covering the years since 1850. The data is available from the National Oceanic and Atmospheric Administration (NOAA) here.
We are looking at global sea surface temperature since 1850, so an obvious question is how much the oceans have warmed over this time period. We can use nctoolkit’s spatial_mean
method to calculate this:
Once the file is downloaded, we should set it to ff:
[1]:
import nctoolkit as nc
ff = "sst.mon.mean.nc"
nctoolkit is using CDO version 1.9.9
[2]:
ds = nc.open_data(ff)
ds.spatial_mean()
ds.plot()
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[2]:
We can see a clear temperature rise of about 1 degree Celcius. But this is monthly data, so a bit noisy. We can smooth it out by taking an annual mean. In this case we send “year” to tmean
to tell it to calculate the mean for each year:
[3]:
ds = nc.open_data(ff)
ds.tmean(["year"])
ds.spatial_mean()
ds.plot("sst")
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[3]:
That is getting better. But again, we possibly want a rolling average. We can use the rolling_mean
method to calculate the mean over every 20-year period:
[4]:
ds = nc.open_data(ff)
ds.tmean(["year"])
ds.spatial_mean()
ds.rolling_mean(20)
ds.plot("sst")
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[4]:
We’ll finish things off by tweaking this so that we can work out how much temperature has increased since the first 20 years in the time period. For this we can use the annual_anomaly
method.
[5]:
ds = nc.open_data(ff)
ds.annual_anomaly(baseline = [1850, 1869], window = 20)
ds.spatial_mean()
ds.plot("sst")
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace
[5]:
More to come….¶
Random Data Hacks¶
nctoolkit features a number of useful methods to tweak data.
Shifting time¶
Sometimes the times in datasets are not quite what we want, and we need some way to adjust time. An example of this
is when you are missing a year of data, so want to copy data from the prior year and use it. But first you would need
to shift the times in that year forward by a year. You can do this with the shift
method. This let’s you shift
time forward by a specified number of hours, days, months or years. You just need to supply hours, days, months or years
as an argument. So, if you wanted to shift time backward by one year, you would do the following:
ds.shift(years = -1)
If you wanted to shift time forward by 12 hours, this would do it:
ds.shift(hours = 12)
Note: this method allows partial matches to the arguments, so you could use hour, day, month or year just as easily.
Adding cell areas to a dataset¶
You can add grid cell areas to a dataset as follows:
ds.cell_area()
By default, this will add the cell area (in square metres) to the dataset. If you want the dataset to only include cell areas
you need to set the join
argument to False
:
ds.cell_area(join = False)
Of course, this method will only if it is possible to calculate the areas the grid cells.
Changing the format of the netCDF files in a dataset¶
Sometimes you will want to change the format of the files in a dataset. You can do this using the format
method. This let’s
you set the format, with the following options:
netCDF = “nc1”
netCDF version 2 (64-bit offset) = “nc2”/”nc”
netCDF4 (HDF5) = “nc4”
netCDF4-classi = “nc4c”
netCDF version 5 (64-bit data) = “nc5”
So, if you want to set the format to netCDF4, you would do the following:
ds.format("nc4")
Getting rid of dimensions with only one value¶
Sometimes you will have a dataset that has a dimension with only one value, and you might want to get rid of that dimension. For example,
you might only have one one timestep and keeping it may have no value. Getting rid of that dimension can be done using the reduce_dims
method.
It works as follows:
ds.reduce_dims()
Global settings¶
nctoolkit let’s you set global settings using options.
The most important and recommended to update is to set evaluation to lazy. This can be done as follows:
nc.options(lazy = True)
This means that commands will only be evaluated when either request them to be or they need to be.
For example, in the code below the 3 specified commands will only be
calculated after it is told to run
. This cuts down on IO, and can
result in significant improvements in run time. At present lazy defaults
to False, but this may change in a future release of nctoolkit.
nc.options(lazy = True)
data.tmean()
data.crop(lat = [0,90])
data.spatial_mean()
data.run()
If you are working with ensembles, you may want to change the number of cores used for processing multiple files. For example, you can process multiple files in parallel using 6 cores as follows. By default cores = 1. Most methods can run in parallel when working with multi-file datasets.
nc.options(cores = 6)
By default nctoolkit uses the OS’s temporary directories when it needs to create temporary files. In most cases this is optimal. Most of the time reading and writing to temporary folders is faster. However, in some cases this may not be a good idea because you may not have enough space in the temporary folder. In this case you can change the directory used for saving temporary files as follows:
nc.options(temp_dir = "/foo")
Setting global settings using a configuration file¶
You may want to set some global settings either permanently or on a project level. You can do this by setting up a configruation file. This should be a plain text file called .nctoolkitrc or nctoolkitrc. It should be placed in one of two locations: your working directory or your home directory. When nctoolkit is imported, it will look first in your working directory and then in your home directory for a file called .nctoolkitrc or nctoolkitrc. It will then use the first it finds to change the global settings from the defaults.
The structure of this file is straightforward. For example, if you wanted to set evaluation to lazy and the number of cores used for processing multi-file datasets, you would the following in your configuration file:
lazy : True
cores : 6
The files roughly follow Python dictionary syntax, with the setting and value separate by :. Note that unless the setting is specified in the file, the defaults will be used. If you do not provide a configuration file, nctoolkit will use the default settings.
API Reference¶
Opening/copying data¶
|
Read netCDF data as a DataSet object |
|
Read netCDF data from a url as a DataSet object |
|
Read thredds data as a DataSet object |
|
Make a deep copy of an DataSet object |
Merging or analyzing multiple datasets¶
|
Merge datasets |
|
Calculate the temporal correlation coefficient between two datasets This will calculate the temporal correlation coefficient, for each time step, between two datasets. |
|
Calculate the spatial correlation coefficient between two datasets This will calculate the spatial correlation coefficient, for each time step, between two datasets. |
Accessing attributes¶
List variables contained in a dataset |
|
List years contained in a dataset |
|
List months contained in a dataset |
|
List times contained in a dataset |
|
List levels contained in a dataset |
|
The size of an object This will print the number of files, total size, and smallest and largest files in an DataSet object. |
|
The current file or files in the DataSet object |
|
The history of operations on the DataSet |
|
The starting file or files of the DataSet object |
Plotting¶
|
Variable modification¶
|
Create new variables Existing columns that are re-assigned will be overwritten. :param drop: Set to True if you want existing variables to be removed once the new ones have been created. Defaults to False. |
|
Rename variables in a dataset |
|
Set the missing value for a single number or a range |
|
Calculate the sum of all variables for each time step |
netCDF file attribute modification¶
|
Set the long names of variables |
|
Set the units for variables |
Vertical/level methods¶
|
Extract the top/surface level from a dataset This extracts the first vertical level from each file in a dataset. |
|
Extract the bottom level from a dataset This extracts the bottom level from each netCDF file. |
|
Verticaly interpolate a dataset based on given vertical levels This is calculated for each time step and grid cell |
|
Calculate the depth-averaged mean for each variable This is calculated for each time step and grid cell |
|
Calculate the vertical minimum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical maximum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical range of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical sum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical sum of variable values This is calculated for each time step and grid cell |
|
Invert the levels of 3D variables This is calculated for each time step and grid cell |
|
Create a mask identifying the deepest cell without missing values. |
Rolling methods¶
|
Calculate a rolling mean based on a window |
|
Calculate a rolling minimum based on a window |
|
Calculate a rolling maximum based on a window |
|
Calculate a rolling sum based on a window |
|
Calculate a rolling range based on a window |
Evaluation setting¶
|
Run all stored commands in a dataset |
Cleaning functions¶
Ensemble creation¶
|
Generate an ensemble |
Arithemetic methods¶
|
Create new variables Existing columns that are re-assigned will be overwritten. :param drop: Set to True if you want existing variables to be removed once the new ones have been created. Defaults to False. |
|
Add to a dataset This will add a constant, another dataset or a netCDF file to the dataset. :param x: An int, float, single file dataset or netCDF file to add to the dataset. If a dataset or netCDF file is supplied, this must have only one variable, unless var is provided. The grids must be the same. :type x: int, float, DataSet or netCDF file :param var: A variable in the x to use for the operation :type var: str. |
|
Subtract from a dataset This will subtract a constant, another dataset or a netCDF file from the dataset. :param x: An int, float, single file dataset or netCDF file to subtract from the dataset. If a dataset or netCDF is supplied this must only have one variable, unless var is provided. The grids must be the same. :type x: int, float, DataSet or netCDF file :param var: A variable in the x to use for the operation :type var: str. |
|
Multiply a dataset This will multiply a dataset by a constant, another dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to multiply the dataset by. If multiplying by a dataset or single file there must only be a single variable in it, unless var is supplied. The grids must be the same. :type x: int, float, DataSet or netCDF file :param var: A variable in the x to multiply the dataset by :type var: str. |
|
Divide the data This will divide the dataset by a constant, another dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to divide the dataset by. If a dataset or netCDF file is supplied, this must have only one variable, unless var is provided. The grids must be the same. :type x: int, float, DataSet or netCDF file :param var: A variable in the x to use for the operation :type var: str. |
Ensemble statistics¶
|
Calculate an ensemble mean |
|
Calculate an ensemble min |
|
Calculate an ensemble maximum |
|
Calculate an ensemble percentile This will calculate the percentles for each time step in the files. |
|
Calculate an ensemble range The range is calculated for each time step; for example, if each file in the ensemble has 12 months of data the statistic will be calculated for each month. |
|
Calculate an ensemble sum The sum is calculated for each time step; for example, if each file in the ensemble has 12 months of data the statistic will be calculated for each month. |
Subsetting operations¶
|
Crop to a rectangular longitude and latitude box |
|
A method for subsetting datasets to specific variables, years, longitudes etc. |
|
Remove variables This will remove stated variables from files in the dataset. |
Time-based methods¶
|
Set the date in a dataset You should only do this if you have to fix/change a dataset with a single, not multiple dates. |
|
Shift method. |
Interpolation and resampling methods¶
|
Regrid a dataset to a target grid |
|
Regrid a dataset to a regular latlon grid |
|
Resample the horizontal grid of a dataset |
|
Temporally interpolate variables based on date range and time resolution |
|
Temporally interpolate a dataset to given number of time steps between existing time steps |
Masking methods¶
|
Mask a lon/lat box |
Statistical methods¶
|
Calculate the temporal mean of all variables |
|
Calculate the temporal minimum of all variables |
|
Calculate the temporal median of all variables :param over: Time periods to average over. |
|
Calculate the temporal percentile of all variables |
|
Calculate the temporal maximum of all variables |
|
Calculate the temporal sum of all variables |
|
Calculate the temporal range of all variables |
|
Calculate the temporal variance of all variables |
|
Calculate the temporal standard deviation of all variables |
|
Calculate the temporal cumulative sum of all variables |
|
Calculate the correlation correct between two variables in space This is calculated for each time step. |
|
Calculate the correlation correct in time between two variables The correlation is calculated for each grid cell, ignoring missing values. |
|
Calculate the area weighted spatial mean for all variables This is performed for each time step. |
|
Calculate the spatial minimum for all variables This is performed for each time step. |
|
Calculate the spatial maximum for all variables This is performed for each time step. |
|
Calculate the spatial sum for all variables This is performed for each time step. |
|
Calculate the spatial range for all variables This is performed for each time step. |
|
Calculate the spatial sum for all variables This is performed for each time step. |
|
Calculate the latitudinal or longitudinal centre for each year/month combination in files. This applies to each file in an ensemble. by : str Set to ‘latitude’ if you want the latitiduinal centre calculated. ‘longitude’ for longitudinal. by_area : bool If the variable is a value/m2 type variable, set to True, otherwise set to False. |
|
Calculate the zonal mean for each year/month combination in files. |
|
Calculate the zonal minimum for each year/month combination in files. |
|
Calculate the zonal maximum for each year/month combination in files. |
|
Calculate the zonal range for each year/month combination in files. |
|
Calculate the meridonial mean for each year/month combination in files. |
|
Calculate the meridonial minimum for each year/month combination in files. |
|
Calculate the meridonial maximum for each year/month combination in files. |
|
Calculate the meridonial range for each year/month combination in files. |
Merging methods¶
|
Merge a multi-file ensemble into a single file Merging will occur based on the time steps in the first file. |
|
Time-based merging of a multi-file ensemble into a single file This method is ideal if you have the same data split over multiple files covering different data sets. |
Splitting methods¶
|
Split the dataset Each file in the ensemble will be separated into new files based on the splitting argument. |
Output and formatting methods¶
|
Save a dataset to a named file This will only work with single file datasets. |
|
Open a dataset as an xarray object |
|
Open a dataset as a pandas data frame |
|
Zip the dataset This will compress the files within the dataset. |
|
Zip the dataset This will compress the files within the dataset. This works lazily. :param ext: New format. Must be one of “nc”, “nc1”, “nc2”, “nc4” and “nc5” . netCDF = nc1 netCDF version 2 (64-bit offset) = nc2/nc netCDF4 (HDF5) = nc4 netCDF4-classi = nc4c netCDF version 5 (64-bit data) = nc5 :type ext: str. |
Miscellaneous methods¶
|
Calculate the area of grid cells. |
|
Identify the time step when a value is first above a threshold This will do the comparison with either a number, a Dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to use for the threshold(s). If comparing with a dataset or single file there must only be a single variable in it. The grids must be the same. :type x: int, float, DataSet or netCDF file. |
|
Identify the time step when a value is first below a threshold This will do the comparison with either a number, a Dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to use for the threshold(s). If comparing with a dataset or single file there must only be a single variable in it. The grids must be the same. :type x: int, float, DataSet or netCDF file. |
|
Identify the final time step when a value is above a threshold This will do the comparison with either a number, a Dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to use for the threshold(s). If comparing with a dataset or single file there must only be a single variable in it. The grids must be the same. :type x: int, float, DataSet or netCDF file. |
|
Identify the final time step when a value is above a threshold This will do the comparison with either a number, a Dataset or a netCDF file. :param x: An int, float, single file dataset or netCDF file to use for the threshold(s). If comparing with a dataset or single file there must only be a single variable in it. The grids must be the same. :type x: int, float, DataSet or netCDF file. |
|
Apply a cdo command |
|
Apply an nco command |
|
Compare all variables to a constant |
|
Method to calculate if variable in dataset is greater than that in another file or dataset This currently only works with single file datasets |
|
Method to calculate if variable in dataset is less than that in another file or dataset This currently only works with single file datasets |
|
Reduce dimensions of data This will remove any dimensions with only one value. |
|
Reduce the dataset to non-zero locations in a mask :param mask: single variable dataset or path to .nc file. The mask must have an identical grid to the dataset. :type mask: str or dataset. |
Ecological methods¶
|
Calculate phenologies from a dataset Each file in an ensemble must only cover a single year, and ideally have all days. |
Package info¶
This package was created by Robert Wilson at Plymouth Marine Laboratory (PML).
Acknowledgements¶
The current codebase of nctoolkit was developed using funding from the NERC Climate Linked Atlantic Sector Science programme (NE/R015953/1) and a combination of UK Research and Innovation (UKRI) and European Research Council (ERC) funded research projects.
Bugs and issues¶
If you identify bugs or issues with the package please raise an issue at PML’s Marine Systems Modelling group’s GitHub page here or contact nctoolkit’s creator at rwi@pml.ac.uk.
Contributions welcome¶
The package is new, with new features being added each month. There remain a large number of features that could be added, especially for dealing with atmospheric data. If packages users are interested in contributing or suggesting new features they are welcome to raise and issue at the package’s GitHub page or contact me.