nctoolkit: Fast and easy analysis of netCDF data in Python¶
nctoolkit is a comprehensive Python package for analyzing netCDF data on Linux and MacOS.
Core abilities include:
Cropping to geographic regions
Interactive plotting of data
Subsetting to specific time periods
Calculating time averages
Calculating spatial averages
Calculating rolling averages
Calculating climatologies
Creating new variables using arithmetic operations
Calculating anomalies
Horizontally and vertically remapping data
Calculating the correlations between variables
Calculating vertical averages for the likes of oceanic data
Calculating ensemble averages
Calculating phenological metrics
Fixing plotting problem due to xarray bug¶
There is currently a bug in xarray caused by the update of pandas to version 1.1. As a result some plots will fail in nctoolkit. To fix this ensure pandas version 1.0.5 is installed. Do this after installing nctoolkit. This can be done as follows:
$ conda install -c conda-forge pandas=1.0.5
or:
$ pip install pandas==1.0.5
Documentation¶
Getting Started
Installation¶
Python dependencies¶
How to install nctoolkit¶
The easiest way to install the package is using conda, which will install nctoolkit and all system dependencies:
$ conda install -c conda-forge nctoolkit
nctoolkit is available from the Python Packaging Index. To install nctoolkit using pip:
$ pip install nctoolkit
If you install nctoolkit from pypi, you will need to install the system dependencies listed below.
To install the development version from GitHub:
$ pip install git+https://github.com/r4ecology/nctoolkit.git
Fixing plotting problem due to xarray bug¶
There is currently a bug in xarray caused by the update of pandas to version 1.1. As a result some plots will fail in nctoolkit. To fix this ensure pandas version 1.0.5 is installed. Do this after installing nctoolkit. This can be done as follows:
$ conda install -c conda-forge pandas=1.0.5
or:
$ pip install pandas==1.0.5
System dependencies¶
There are two main system dependencies: Climate Data Operators, and NCO. The easiest way to install them is using conda:
$ conda install -c conda-forge cdo
$ conda install -c conda-forge nco
CDO is necessary for the package to work. NCO is an optional dependency and does not have to be installed.
If you want to install CDO from source, you can use one of the bash scripts available here.
Introduction tutorial¶
nctoolkit is designed for the efficient analysis and manipulation of netCDF files. This tutorial provides an overview of how to work with individual files.
Opening netcdf data¶
This tutorial will illustrate the basic usage using a dataset of average global sea surface temperature from NOAA, which is available here.
nctoolkit should be imported using the nc shorthand:
[1]:
import nctoolkit as nc
Reading in a dataset is straightforward:
[2]:
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
We might want to know some basic information about the file. This can be done easily. Listing the available variables can be found quickly:
The current state of the dataset can be found as follows:
[3]:
sst.variables
[3]:
['sst', 'valid_yr_count']
The months available can be found using:
[4]:
sst.months
[4]:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
We have 12 months available. In this case it is the monthly average temperature from 1981-2010.
Modifying datasets¶
Each time nctoolkit executes a command that modifies a dataset, it will generate a new NetCDF file, which becomes the current
file in the dataset. Before any modification this is as follows:
[5]:
sst.current
[5]:
'sst.mon.ltm.1981-2010.nc'
We have seen that there are two variables in the dataset. But we only really care about sst
. So let’s select that variable:
[6]:
sst.select_variables("sst")
We can now see that there is only one variable in the sst dataset
[7]:
sst.variables
[7]:
['sst']
We can also that a temporary file has been created with only this variable in it
[8]:
sst.current
[8]:
'/tmp/nctoolkitesugmpemnctoolkittmpxmrohbap.nc'
We have data for 12 months. But what we might really want is an average of those values. This can be quickly calculated:
[9]:
sst.mean()
Once again a new temporary file has been generated.
[10]:
sst.current
[10]:
'/tmp/nctoolkitesugmpemnctoolkittmpgz_hzyoq.nc'
Do not worry about the temporary folder getting clogged up. nctoolkit cleans it up automatically.
Quick visualization of netCDF data is always a good thing. So nctoolkit provides an easy autoplot feature.
[11]:
sst.plot()
[11]:
What we have seen so far is not computionally efficient. In the code below nctoolkit has generated temporary files twice:
[12]:
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.mean()
We can see what went on behind the scenes by accessing history
:
[13]:
sst.history
[13]:
['cdo -L -selname,sst sst.mon.ltm.1981-2010.nc /tmp/nctoolkitesugmpemnctoolkittmpxpb_323a.nc',
'cdo -L -timmean /tmp/nctoolkitesugmpemnctoolkittmpxpb_323a.nc /tmp/nctoolkitesugmpemnctoolkittmp5agj679e.nc']
nctoolkit uses CDO. You do not understand how CDO works to use nctoolkit. But one nice feature of CDO is method chaining, which works like Python’s. To get it working you just need to set evaluation to lazy in nctoolkit. This means nothing is evaluated until you force it to or it has to be.
[14]:
nc.options(lazy = True)
Now, let’s run the code again:
[15]:
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.mean()
sst.plot()
[15]:
When we look at history
, we now see that only one temporary file was generated:
[16]:
sst.history
[16]:
['cdo -L -timmean -selname,sst sst.mon.ltm.1981-2010.nc /tmp/nctoolkitesugmpemnctoolkittmpooqi1xou.nc']
In the example, above the commands were only executed when plot was called. If we want to force commands to run we use run
:
[17]:
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.mean()
sst.run()
Ensemble methods¶
Merging files with different variables¶
This notebook will outline some general methods for doing comparisons of multiple files. We will work with two different sea surface temperature data sets from NOAA and the Met Office Hadley Centre.
[1]:
import nctoolkit as nc
import pandas as pd
import xarray as xr
import numpy as np
Let’s start by downloading the files using wget. Uncomment the code below to do this (note: you will need to extract the HadISST dataset):
[2]:
# ! wget ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.mean.nc
# ! wget https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_sst.nc.gz
The first step is to get the data. We will start by creating two separate datasets for each file.
[3]:
sst_noaa = nc.open_data("sst.mon.mean.nc")
sst_hadley = nc.open_data("HadISST_sst.nc")
We can see that both variables have sea surface temperature labelled as sst. So we will need to change that.
[4]:
sst_noaa.variables
[4]:
['sst']
[5]:
sst_hadley.variables
[5]:
['time_bnds', 'sst']
[6]:
sst_noaa.rename({"sst":"noaa"})
sst_hadley.rename({"sst":"hadley"})
The data sets also cover different time periods, and only have overlapping between 1870 and 2018. so we will need to select those years
[7]:
sst_noaa.select_years(range(1870, 2019))
sst_hadley.select_years(range(1870, 2019))
We also have a problem in that there are two horizontal grids in the Hadley Centre file. We can solve this by selecting the sst variable only
[8]:
sst_hadley.select_variables("hadley")
At this point, the datasets have the same number of time steps and months covered. However, the grids are still a bit different. So we want to unify them by regridding one dataset on to the other’s grid. This can be done using regrid, or any grid of your choosing.
[9]:
sst_noaa.regrid(grid = sst_hadley)
We now have two separate datasets. Let’s create a new dataset that has both of them, and then merge them. When doing this we need to make sure nas are treated properly. In this case Hadley Centre values not being NAs as they should be, so we need to fix that. The merge method also requires a strict matching criteria for the dates in the merging files. In this case the Hadley Centre and NOAA data sets both give monthly means, but use a different day of the month. So we will set match to [“year”, “month”] this will ensure there are no mis-matches
[10]:
all_sst = nc.merge(sst_noaa, sst_hadley, match = ["year", "month"])
all_sst.set_missing([-9000, - 900])
Let’s work out what the global mean SST was over the time period. Note that this will not be totally accurate as there are some missing values here and there that might bias things.
[11]:
all_sst.spatial_mean()
all_sst.annual_mean()
all_sst.rolling_mean(10)
[12]:
all_sst.plot()
Data type cannot be displayed:
[12]:
We can also work out the difference between the two. Here we wil work out the monthly bias per cell. Then calculate the mean global difference per year, and then calculate a rolling 10 year mean.
[13]:
all_sst = nc.open_data([sst_noaa.current, sst_hadley.current])
all_sst.merge(match = ["year", "month"])
all_sst.transmute({"bias":"hadley-noaa"})
all_sst.set_missing([-9000, - 900])
all_sst.spatial_mean()
all_sst.annual_mean()
all_sst.rolling_mean(10)
all_sst.plot()
Data type cannot be displayed:
[13]:
You can see that there is a notable difference at the start of the time series.
Merging files with different times¶
TBC
Ensemble averaging¶
TBC
Speeding up code¶
Lazy evaluation¶
Under the hood nctoolkit relies mostly on CDO to carry out the specified manipulation of netcdf files. Each time CDO is called a new temporary file is generated. This has the potential to result in slower than necessary processing chains, as IO takes up far too much time.
I will demonstrate this using a netcdf file os sea surface temperature. To download the file we can just use wget:
[1]:
import nctoolkit as nc
import warnings
warnings.filterwarnings('ignore')
from IPython.display import clear_output
!wget ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.ltm.1981-2010.nc
clear_output()
We can then set up the dataset which we will use for manipulating the SST climatology.
[2]:
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
Now, let’s select the variable sst, clip the file to the northern hemisphere, calculate the mean value in each grid cell for the first half of the year, and then calculate the spatial mean.
[3]:
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
The dataset’s history is as follows:
[4]:
sst.history
[4]:
['cdo -L -selname,sst sst.mon.ltm.1981-2010.nc /tmp/nctoolkitqhgujflsnctoolkittmpipj7up1l.nc',
'cdo -L -sellonlatbox,-180,180,0,90 /tmp/nctoolkitqhgujflsnctoolkittmpipj7up1l.nc /tmp/nctoolkitqhgujflsnctoolkittmp920v1_r7.nc',
'cdo -L -selmonth,1,2,3,4,5,6 /tmp/nctoolkitqhgujflsnctoolkittmp920v1_r7.nc /tmp/nctoolkitqhgujflsnctoolkittmpbnck_dy2.nc',
'cdo -L -timmean /tmp/nctoolkitqhgujflsnctoolkittmpbnck_dy2.nc /tmp/nctoolkitqhgujflsnctoolkittmpjmzt1l67.nc',
'cdo -L -fldmean /tmp/nctoolkitqhgujflsnctoolkittmpjmzt1l67.nc /tmp/nctoolkitqhgujflsnctoolkittmpdus63y8i.nc']
In total, there are 5 operations, with temporary files created each time. However, we only want to generate one temporary file. So, can we do that? Yes, thanks to CDO’s method chaining ability. If we want to utilize this we need to set the session’s evaluation to lazy, using options. Once this is done nctoolkit will only evaluate things either when it needs to, e.g. you call a method that cannot possibly be chained, or if you evaluate it using run. This works as follows:
[5]:
ff = "sst.mon.ltm.1981-2010.nc"
nc.options(lazy = True)
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
sst.run()
We can now see that the history is much cleaner, with only one command.
[6]:
sst.history
[6]:
['cdo -L -fldmean -timmean -selmonth,1,2,3,4,5,6 -sellonlatbox,-180,180,0,90 -selname,sst sst.mon.ltm.1981-2010.nc /tmp/nctoolkitqhgujflsnctoolkittmpkdkiwey2.nc']
How does this impact run time? Let’s time the original, unchained method.
[7]:
%%time
nc.options(lazy = False)
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
CPU times: user 37.2 ms, sys: 61.6 ms, total: 98.7 ms
Wall time: 667 ms
[8]:
%%time
nc.options(lazy = True)
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
sst.run()
CPU times: user 17.3 ms, sys: 4.28 ms, total: 21.6 ms
Wall time: 161 ms
This was almost 4 times faster. Exact speed improvements, will of course depend on specific IO requirements, and some times using lazy evaluation will make negligible impact, but in others can make code over 10 times fasteExact speed improvements, will of course depend on specific IO requirements, and some times using lazy evaluation will make negligible impact, but in others can make code over 10 times faster.
Processing files in parallel¶
When processing a dataset made up of multiple files, it is possible carry out the processing in parallel for more or less all of the methods available in nctoolkit. To carry out processing in parallel with 6 cores, we would use options as follows:
[9]:
nc.options(cores = 6)
By default the number of cores in use is 1. Of course, this can result in you crashing your computer if the total RAM in use is excessive, so it’s best practise to check RAM used with one core first.
Using thread-safe libraries¶
If the CDO installation being called by nctoolkit is compiled with threadsafe hdf5, then you can achieve potentially significant speed ups with the following command:
[10]:
nc.options(thread_safe = True)
If you are not sure, if hdf5 has been built thread safe, a simple way to find this out is to run the code below. If it fails, you can be more or less certain it is not threadsafe.
[11]:
nc.options(lazy = True)
nc.options(thread_safe = True)
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
sst.run()
User Guide
Datasets¶
nctoolkit works with what it calls datasets. Each dataset is made up of a single or multiple NetCDF files. Each time you apply a method to a dataset the NetCDF file or files within the dataset will be modified.
Opening datasets¶
There are 3 ways to create a dataset: open_data
, open_url
or open_thredds
.
If the data you want to analyze is already available on your computer use open_data
. This will accept either a path to a single file or a list of files to create a dataset.
If you want to use data that can be downloaded from a url, just use open_url
. This will download the NetCDF files to a temporary folder, and it can then be analyzed.
If you want to analyze data that is available from a thredds server, then user open_thredds
. The file paths should end with .nc.
Dataset attributes¶
We can find out key information about a dataset using its attributes. Here we will use a sea surface temperature file that is available via thredds.
[2]:
import nctoolkit as nc
sst = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/sst.mon.ltm.1981-2010.nc")
1 file was created by nctoolkit in prior or current sessions. Consider running deep_clean!
If we want to know a dataset’s variables:
[3]:
sst.variables
[3]:
['sst', 'valid_yr_count']
If we want to know a dataset’s variables, we use the following. In this case there is only one because the file only shows the sea surface.
[4]:
sst.levels
[4]:
[0.0]
If we want to know where the dataset’s NetCDF files are stored we can do the following:
[5]:
sst.current
[5]:
'https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/sst.mon.ltm.1981-2010.nc'
If we want to find out what times are in the dataset:
[6]:
sst.times
[6]:
['0001-01-01T00:00:00',
'0001-02-01T00:00:00',
'0001-03-01T00:00:00',
'0001-04-01T00:00:00',
'0001-05-01T00:00:00',
'0001-06-01T00:00:00',
'0001-07-01T00:00:00',
'0001-08-01T00:00:00',
'0001-09-01T00:00:00',
'0001-10-01T00:00:00',
'0001-11-01T00:00:00',
'0001-12-01T00:00:00']
If we want to find out what months are in the dataset:
[7]:
sst.months
[7]:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
If we want to find out what years are in the dataset:
[8]:
sst.years
[8]:
[1]
If we do anything to the dataset, things will change. Let’s calculate the average temperature:
[9]:
sst.mean()
We can see that there is now a new temporary file associated with the dataset:
[10]:
sst.current
[10]:
'/tmp/nctoolkityjoudsyznctoolkittmpm3zaqqam.nc'
We can also access the history of operations carried out on the dataset:
[11]:
sst.history
[11]:
['cdo -L -timmean https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/sst.mon.ltm.1981-2010.nc /tmp/nctoolkityjoudsyznctoolkittmpm3zaqqam.nc']
Behind the scenes, nctoolkit mostly uses Climata Data Operators. If you are not familiar with Climate Data Operators, you can almost certainly just ignore the operations history.
We can also see that the times have changed. The only month now available is June, which is the mid-point of the year.
[12]:
sst.months
[12]:
[6]
Lazy evaluation of datasets¶
The code below will calculate the average sea surface temperature for a region in the North Atlantic for January. It does not do it efficiently.
[13]:
sst = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/sst.mon.ltm.1981-2010.nc")
sst.select_months(1)
sst.clip(lon = [-80, 20], lat = [30, 70])
sst.spatial_mean()
If we look at the operation history, we see that temporary files have been created 3 times. Why not just once? We can do this by setting evaluation to lazy and then using run
to evaluate everything when we need to.
[14]:
sst.history
[14]:
['cdo -L -selmonth,1 https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/sst.mon.ltm.1981-2010.nc /tmp/nctoolkityjoudsyznctoolkittmpwcmy57jx.nc',
'cdo -L -sellonlatbox,-80,20,30,70 /tmp/nctoolkityjoudsyznctoolkittmpwcmy57jx.nc /tmp/nctoolkityjoudsyznctoolkittmp34c020st.nc',
'cdo -L -fldmean /tmp/nctoolkityjoudsyznctoolkittmp34c020st.nc /tmp/nctoolkityjoudsyznctoolkittmpwey97q4f.nc']
[15]:
nc.options(lazy = True)
sst = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/sst.mon.ltm.1981-2010.nc")
sst.select_months(1)
sst.clip(lon = [-80, 20], lat = [30, 70])
sst.spatial_mean()
sst.run()
We can now see that only one temporary file was created
[16]:
sst.history
[16]:
['cdo -L -fldmean -sellonlatbox,-80,20,30,70 -selmonth,1 https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/sst.mon.ltm.1981-2010.nc /tmp/nctoolkityjoudsyznctoolkittmpdixe9jo3.nc']
Visualization of datasets¶
You can visualize the contents of a dataset using the plot
method. Below, we will plot temperature for January and the North Atlantic:
[17]:
sst = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE/sst.mon.ltm.1981-2010.nc")
sst.select_months(1)
sst.clip(lon = [-80, 20], lat = [30, 70])
sst.plot()
Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
The global colormaps dictionary is no longer considered public API.
The global colormaps dictionary is no longer considered public API.
The global colormaps dictionary is no longer considered public API.
[17]:
To see how to use all of nctoolk’s methods, check out the options on the left panel.
Help & reference
An A-Z guide to nctoolkit methods¶
This guide will provide examples of how to use almost every method available in nctoolkit.
add¶
This method can add to a dataset. You can add a constant, another dataset or a NetCDF file. In the case of datasets or NetCDF files the grids etc. must be of the same structure as the original dataset.
For example, if we had a temperature dataset where temperature was in Celsius, we could convert it to Kelvin by adding 273.15.
data.add(273.15)
If we have two sets, we add one to the other as follows:
data1 = nc.open_data(infile1)
data2 = nc.open_data(infile2)
data1.add(data2)
In the above example, all we are doing is adding infile2 to data2, so instead we could simply do this:
data1.add(infile2)
annual_anomaly¶
This method will calculate the annual anomaly for each variable (and in each grid cell) compared with a baseline. This is a standard anomaly calculation where first the mean value is calculated for the baseline period, and the difference between the values is calculated.
For example, if we wanted to calculate the anomalies in a dataset compared with a baseline period of 1900-1919 we would do the following:
data.annual_anomaly(baseline=[1900, 1919])
We may be more interested in the rolling anomaly, in particular when there is a lot of annual variation. In the above case, if you wanted a 20 year rolling mean anomaly, you would do the following:
data.annual_anomaly(baseline=[1900, 1919], window=20)
By default this method works out the absolute anomaly. However, in some cases the relative anomaly is more interesting. To calculate this we set the metric argument to “relative”:
data.annual_anomaly(baseline=[1900, 1919], metric = "relative")
annual_max¶
This method will calculate the maximum value in each available year and for each grid cell of dataset.
data.annual_max()
annual_mean¶
This method will calculate the maximum value in each available year and for each grid cell of dataset.
data.annual_mean()
annual_min¶
This method will calculate the minimum value in each available year and for each grid cell of dataset.
data = nc.open_data(infile)
data.annual_min()
annual_range¶
This method will calculate the range of values in each available year and for each grid cell of dataset.
data = nc.open_data(infile)
data.annual_range()
annual_sum¶
This method will calculate the sum of values in each available year and for each grid cell of dataset.
data = nc.open_data(infile)
data.annual_sum()
append¶
This method will let you append individual or multiple files to your dataset. Usage is straightforward. Note that this will not perform any merging on the dataset.
data.append(newfile)
bottom¶
This method will extract the bottom vertical level from a dataset. This is useful for some oceanographic datasets, where the method can let you select the seabed. Note that this method will not work with all data types. For example, in ocean data with fixed depth levels, the bottom cell in the NetCDF data is not the actual seabed. See bottom_mask for these cases.
data = nc.open_data(infile)
data.bottom()
bottom_mask¶
This method will identify the bottommost level in each grid with a non-NA value.
data = nc.open_data(infile)
data.bottom_mask()
cdo_command¶
This method let’s you run a cdo command. CDO commands are generally of the form “cdo {command} infile outfile”. cdo_command therefore only requires the command portion of this. If we wanted to run the following CDO command
cdo -timmean -selmon,4 infile outfile
we would do the following:
data = nc.open_data(infile)
data.cdo_command("-timmean -selmon,4")
cell_areas¶
This method either adds the areas of each grid cell to the dataset or converts the dataset to a new dataset showing only the grid cell areas. By default it adds the cell areas (in square metres) to the dataset.
data = nc.open_data(infile)
data.cell_areas()
If we only want the cell areas we can set join to False:
data.cell_areas(join=False)
clip¶
This method will clip a region to a specified longitude and latitude box. For example, if we wanted to clip a dataset to the North Atlantic, we could do this:
data = nc.open_data(infile)
data.clip(lon = [-80, 20], lat = [40, 70])
compare_all¶
This method let’s us compare all variables in a dataset with a constant. If we wanted to identify the grid cells with values above 20, we could do the following:
data = nc.open_data(infile)
data.compare_all(">20")
Similarly, if we wanted to identify grid cells with negative values we would do this:
data = nc.open_data(infile)
data.compare_all("<0")
cor_space¶
This method calculates the correlation coefficients between two variables in space for each time step. So, if we wanted to work out the correlation between the variables var1 and var2, we would do this:
data = nc.open_data(infile)
data.cor_space("var1", "var2")
cor_time¶
This method calculates the correlation coefficients between two variables in time for each grid cell. If we wanted to work out the correlation between two variables var1 and var2 we would do the following:
data = nc.open_data(infile)
data.cor_time("var1", "var2")
cum_sum¶
This method will calculate the cumulative sum, over time, for all variables. Usage is simple:
data = nc.open_data(infile)
daily_max¶
This method will calculate the maximum value in each available day and for each grid cell of dataset.
data.daily_max()
daily_mean¶
This method will calculate the maximum value in each available day and for each grid cell of dataset.
data.daily_mean()
daily_min¶
This method will calculate the minimum value in each available day and for each grid cell of dataset.
data = nc.open_data(infile)
data.daily_min()
daily_range¶
This method will calculate the range of values in each available day and for each grid cell of dataset.
data = nc.open_data(infile)
data.daily_range()
daily_sum¶
This method will calculate the sum of values in each available day and for each grid cell of dataset.
data = nc.open_data(infile)
data.daily_sum()
data.cum_sum()
daily_max_climatology¶
This method will calculate the maximum value that is observed on each day of the year over time. So, for example, if you had 100 years of daily temperature data, it will calculate the maximum value ever observed on each day.
data = nc.open_data(infile)
data.daily_max_climatology()
daily_mean_climatology¶
This method will calculate the mean value that is observed on each day of the year over time. So, for example, if you had 100 years of daily temperature data, it will calculate the mean value ever observed on each day.
data = nc.open_data(infile)
data.daily_mean_climatology()
daily_min_climatology¶
This method will calculate the minimum value that is observed on each day of the year over time. So, for example, if you had 100 years of daily temperature data, it will calculate the minimum value ever observed on each day.
data = nc.open_data(infile)
data.daily_min_climatology()
daily_range_climatology¶
This method will calculate the value range that is observed on each day of the year over time. So, for example, if you had 100 years of daily temperature data, it will calculate the difference between the maximum and minimum observed values each day.
data = nc.open_data(infile)
data.daily_range_climatology()
divide¶
This method will divide a dataset by a constant, or the values in another dataset of NetCDF file. If we wanted to divide everything in a dataset by 2, we would do the following:
data = nc.open_data(infile)
data.divide(2)
If we want to divide a dataset by another, we can do this easily. Note that the datasets must be comparable, i.e. they must have the same grid. The second dataset must have either the same number of variables or only one variable. In the latter case everything is divided by that variable. The same holds for vertical levels.
data1 = nc.open_data(infile1)
data2 = nc.open_data(infile2)
data1.divide(data2)
ensemble_max, ensemble_min, ensemble_range and ensemble_mean¶
These methods will calculate the ensemble statistic, when a dataset is made up of multiple files. Two methods are available. First, the statistic across all available time steps can be calculated. For this ignore_time must be set to False. For example:
data = nc.open_data(file_list)
data.ensemble_max(ignore_time = True)
The second method is to calculate the maximum value in each given time step. For example, if the ensemble was made up of 100 files where each file contains 12 months of data, ensemble_max will work out the maximum monthly value. By default ignore_time is False.
data = nc.open_data(file_list)
data.ensemble_max(ignore_time = False)
ensemble_percentile¶
This method works in the same way as ensemble_mean etc. above. However, it requires an additional term p, which is the percentile. For example, if we had to calculate the 75th ensemble percentile, we would do the following:
data = nc.open_data(file_list)
data = nc.ensemble_percentile(75)
invert_levels¶
This method will invert the vertical levels of a dataset.
data = nc.open_data(infile)
data.invert_levels()
mask_box¶
This method will set everything outside a specificied longitude/latitude box to NA. The code below illustrates how to mask the North Atlantic in the SST dataset.
data = nc.open_data(infile)
data.mask_box(lon = [-80, 20], lat = [40, 70])
max¶
This method will calculate the maximum value of all variables in all grid cells. If we wanted to calculate the maximum observed monthly sea surface temperature in the SST dataset we would do the following:
data = nc.open_data(infile)
data.max()
mean¶
This method will calculate the mean value of all variables in all grid cells. If we wanted to calculate the maximum observed monthly sea surface temperature in the SST dataset we would do the following:
data = nc.open_data(infile)
data.mean()
merge and merge_time¶
nctoolkit offers two methods for merging the files within a multi-file dataset. These methods operate in a similar way to column based joining and row-based binding in dataframes.
The merge method is suitable for merging files that have different variables, but the same time steps. The merge_time method is suitable for merging files that have the same variables, but have different time steps.
Usage for merge_time is as simple as:
data = nc.open_data(file_list)
data.merge_time()
Merging NetCDF files with different variables is potentially risky, as it is possible you can merge files that have the same number of time steps but have different times. nctoolkit’s merge method therefore offers some security against a major error when merging. It requires a match argument to be supplied. This ensures that the times in each file is comparable to the others. By default match = [“year”, “month”, “day”], i.e. it checks if the times in each file all have the same year, month and day. The match argument must be some subset of [“year”, “month”, “day”]. For example, if you wanted to only make sure the files had the same year, you would do the following:
data = nc.open_data(file_list)
data.merge(match = ["year", "month", "day"])
max¶
This method will calculate the maximum value of all variables in all grid cells. If we wanted to calculate the maximum observed monthly sea surface temperature in the SST dataset we would do the following:
data = nc.open_data(infile)
data.max()
mean¶
This method will calculate the mean value of all variables in all grid cells. If we wanted to calculate the mean observed monthly sea surface temperature in the SST dataset we would do the following:
data = nc.open_data(infile)
data.mean()
meridonial statistics¶
Calculate the following meridonial statistics: mean, min, max and range:
data.meridonial_mean()
data.meridonial_min()
data.meridonial_max()
data.meridonial_range()
monthly_anomaly¶
This method will calculate the monthly anomaly compared with the mean value for a baseline period. For example, if we wanted the monthly anomaly compared with the mean for 1990-1999 we would do the below.
data = nc.open_data(infile)
data.monthly_anomaly(baseline = [1990, 1999])
monthly_max¶
This method will calculate the maximum value in the month of each year of a dataset. This is useful for daily time series. If you want to calculate the mean value in each month across all available years, use monthly_max_climatology. Usage is simple:
data = nc.open_data(infile)
data.monthly_max()
monthly_max_climatology¶
This method will calculate, for each month, the maximum value of each variable over all time steps.
data = nc.open_data(infile)
data.monthly_max_climatology()
monthly_mean¶
This method will calculate the mean value of each variable in each month of a dataset. Note that this is calculated for each year. See monthly_mean_climatology if you want to calculate a climatological monthly mean.
data = nc.open_data(infile)
data.monthly_mean()
monthly_mean_climatology¶
This method will calculate, for each month, the maximum value of each variable over all time steps. Usage is simple:
data = nc.open_data(infile)
data.monthly_mean_climatology()
monthly_min¶
This method will calculate the minimum value in the month of each year of a dataset. This is useful for daily time series. If you want to calculate the mean value in each month across all available years, use monthly_max_climatology. Usage is simple:
data = nc.open_data(infile)
data.monthly_min()
monthly_min_climatology¶
This method will calculate, for each month, the minimum value of each variable over all time steps. Usage is simple:
data = nc.open_data(infile)
data.monthly_min_climatology()
monthly_range¶
This method will calculate the value range in the month of each year of a dataset. This is useful for daily time series. If you want to calculate the value range in each month across all available years, use monthly_range_climatology. Usage is simple:
data = nc.open_data(infile)
data.monthly_range()
monthly_range_climatology¶
This method will calculate, for each month, the value range of each variable over all time steps. Usage is simple:
data = nc.open_data(infile)
data.monthly_range_climatology()
multiply¶
This method will multiply a dataset by a constant, another dataset or a NetCDF file. If multiplied by a dataset or NetCDF file, the dataset must have the same grid and can only have one variable.
If you want to multiply a dataset by 2, you can do the following:
data = nc.open_data(infile)
data.multiply(2)
If you wanted to multiply a dataset data1 by another, data2, you can do the following:
data1 = nc.open_data(infile1)
data2 = nc.open_data(infile2)
data1.multiply(data2)
mutate¶
This method can be used to generate new variables using arithmetic expressions. New variables are added to the dataset. The method requires a dictionary, where the key-value pairs are the new variables and expression required to generate it.
For example, if had a temperature dataset, with temperature in Celsius, we might want to convert that to Kelvin. We can do this easily:
data = nc.open_data(infile)
data.mutate({"temperature_k":"temperature+273.15"})
percentile¶
This method will calculate a given percentile for each variable and grid cell. This will calculate the percentile using all available timesteps.
We can calculate the 75th percentile of sea surface temperature as follows:
data = nc.open_data(infile)
data.percentile(75)
phenology¶
A number of phenological indices can be calculated. These are based on the plankton metrics listed by Ji et al. 2010. These methods require datasets or the files within a dataset to only be made up of individual years, and ideally every day of year is available. At present this method can only calculate the phenology metric for a single variable.
The available metrics are: peak - the time of year when the maximum value of a variable occurs. middle - the time of year when 50% of the annual cumulative sum of a variable is first exceeded start - the time of year when a lower threshold (which must be defined) of the annual cumulative sum of a variable is first exceeded end - the time of year when an upper threshold (which must be defined) of the annual cumulative sum of a variable is first exceeded
For example, if you wanted to calculate timing of the peak, you set metric to “peak”, and define the variable to be analyzed:
data = nc.open_data(infile)
data.phenology(metric = "peak", var = "var_chosen")
plot¶
This method will plot the contents of a dataset. It will either show a map or a time series, depending on the data type. While it should work on at least 90% of NetCDF data, there are some data types that remain incompatible, but will be added to nctoolkit over time. Usage is simple:
data = nc.open_data(infile)
data.plot()
range¶
This method calculates the range for all variables in each grid cell across all steps.
We can calculate the range of sea surface temperatures in the SST dataset as follows:
data = nc.open_data(infile)
data.range()
regrid¶
This method will remap a dataset to a new grid. This grid must be either a pandas data frame, a NetCDF file or a single file nctoolkit dataset.
For example, if we wanted to regrid a dataset to a single location, we could do the following:
import pandas as pd
data = nc.open_data(infile)
grid = pd.DataFrame({"lon":[-20], "lat":[50]})
data.regrid(grid, method = "nn")
If we wanted to regrid one dataset, dataset1, to the grid of another, dataset2, using bilinear interpolation, we would do the following:
data1 = nc.open_data(infile1)
data2 = nc.open_data(infile2)
data1.regrid(data2, method = "bil")
remove_variables¶
This method will remove variables from a dataset. Usage is simple, with the method only requiring either a str of a single variable or a list of variables to remove:
data = nc.open_data(infile)
data.remove_variables(vars)
rename¶
This method allows you to rename variables. It requires a dictionary, with key-value pairs representing the old variable names and new variables. For example, if we wanted to rename a variable old to new, we would do the following:
data = nc.open_data(infile)
data.rename({"old":"new"})
rolling_max¶
This method will calculate the rolling maximum over a specifified window. For example, if you needed to calculate the rolling maximum with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_max(window = 10)
rolling_mean¶
This method will calculate the rolling mean over a specifified window. For example, if you needed to calculate the rolling mean with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_mean(window = 10)
rolling_min¶
This method will calculate the rolling minimum over a specifified window. For example, if you needed to calculate the rolling minimum with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_min(window = 10)
rolling_range¶
This method will calculate the rolling range over a specifified window. For example, if you needed to calculate the rolling range with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_range(window = 10)
rolling_sum¶
This method will calculate the rolling sum over a specifified window. For example, if you needed to calculate the rolling sum with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_sum(window = 10)
run¶
This method will evaluate all of a dataset’s unevaluated commands. Usage is simple:
nc.options(lazy = True)
data = nc.open_data(infile)
data.select_years(1990)
data.run()
seasonal_max¶
This method will calculate the maximum value observed in each season. Note this is worked out for the seasons of each year. See seasonal_max_climatology for climatological seasonal maximums.
data.seasonal_max()
seasonal_max_climatology¶
This method calculates the maximum value observed in each season across all years. Usage is simple:
data = nc.open_data(infile)
data.seasonal_max_climatology()
seasonal_mean¶
This method will calculate the mean value observed in each season. Note this is worked out for the seasons of each year. See seasonal_mean_climatology for climatological seasonal means.
data = nc.open_data(infile)
data.seasonal_mean()
seasonal_mean_climatology¶
This method calculates the mean value observed in each season across all years. Usage is simple:
data = nc.open_data(infile)
data.seasonal_mean_climatology()
seasonal_min¶
This method will calculate the minimum value observed in each season. Note this is worked out for the seasons of each year. See seasonal_min_climatology for climatological seasonal minimums.
data = nc.open_data(infile)
data.seasonal_min()
seasonal_min_climatology¶
This method calculates the minimum value observed in each season across all years. Usage is simple:
data = nc.open_data(infile)
data.seasonal_min_climatology()
seasonal_range¶
This method will calculate the value range observed in each season. Note this is worked out for the seasons of each year. See seasonal_range_climatology for climatological seasonal ranges.
data = nc.open_data(infile)
data.seasonal_range()
seasonal_range_climatology¶
This method calculates the value range observed in each season across all years. Usage is simple:
data = nc.open_data(infile)
data.seasonal_range_climatology()
select¶
A method to subset a dataset based on multiple criteria. This acts as a wrapper for select_variables, select_months, select_years, select_seasons, and select_timesteps, with the args used being variables, months, years, seasons, and timesteps. Subsetting will occur in the order given. For example, if you want to select the years 1990 and 1991 and months June and July, you would do the following:
data.select(years = [1990, 1991], months = [6, 7])
select_months¶
This method allows you to subset a dataset to specific months. This can either be a single month, a list of months or a range. For example, if we wanted the first half of a year, we would do the following:
data = nc.open_data(infile)
data.select_months(range(1, 7))
select_variables¶
This method allows you to subset a dataset to specific variables. This either accepts a single variable or a list of variables. For example, if you wanted two variables, var1 and var2, you would do the following:
data = nc.open(infile)
data.select_variables(["var1", "var2"])
select_years¶
This method subsets datasets to specified years. It will accept either a single year, a list of years, or a range. For example, if you wanted to subset a dataset the 1990s, you would do the following:
data = nc.open_data(infile)
data.select_years(range(1990, 2000))
set_missing¶
This method allows you to set a range to missing values. It either accepts a single variable or two variables, specifying the range to be set to missing values. For example, if you wanted all values between 0 and 10 to be set to missing, you would do the following:
data = nc.open_data(infile)
data.set_missing([0, 10])
shift_days¶
This method allows you to shift time by a set number of hours, days, months or years. This acts as a wrapper for shift_hours, shift_days, shift_months and shift_years. Use the args hours, days, months, or years. This takes any number of arguments. So, if you wanted to shift time forward by 1 year, 1 month and 1 days you would do the following:
data = nc.open_data(infile)
data.shift(years = 1, months = 1, days = 1)
shift_days¶
This method allows you to shift time by a set number of days. For example, if you want time moved forward by 2 hours you would do the following:
data = nc.open_data(infile)
data.shift_days(2)
shift_hours¶
This method allows you to shift time by a set number of hours. For example, if you want time moved back by 1 hour you would do the following:
data = nc.open_data(infile)
data.shift_hours(-1)
shift_months¶
This method allows you to shift time by a set number of months. For example, if you want time moved back by 2 months you would do the following:
data = nc.open_data(infile)
data.shift_months(2)
shift_years¶
This method allows you to shift time by a set number of years. For example, if you want time moved back by 10 years you would do the following:
data = nc.open_data(infile)
data.shift_years(10)
spatial_max¶
This method will calculate the maximum value observed in space for each variable and time step. Usage is simple:
data = nc.open_data(infile)
data.spatial_max()
spatial_mean¶
This method will calculate the spatial mean for each variable and time step. If the grid cell area can be calculated, this will be an area weighted mean. Usage is simple:
data = nc.open_data(infile)
data.spatial_mean()
spatial_min¶
This method will calculate the minimum observed in space for each variable and time step. Usage is simple:
data = nc.open_data(infile)
data.spatial_min()
spatial_percentile¶
This method will calculate the percentile of variable across space for time step. For example, if you wanted to calculate the 75th percentile, you would do the following:
data = nc.open_data(infile)
data.spatial_percentile(p=75)
spatial_range¶
This method will calculate the value range observed in space for each variable and time step. Usage is simple:
data = nc.open_data(infile)
data.spatial_range()
spatial_sum¶
This method will calculate the spatial sum for each variable and time step. In some cases, for example when variables are concentrations, it makes more sense to multiply the value in each grid cell by the grid cell area, when doing a spatial sum. This method therefore has an argument by_area which defines whether to multiply the variable value by the area when doing the sum. By default by_area is False.
Usage is simple:
data = nc.open_data(infile)
data.spatial_sum()
split¶
Except for methods that begin with merge or ensemble, all nctoolkit methods operate on individual files within a dataset. There are therefore cases when you might want to be able to split a dataset into separate files for analysis. This can be done using split, which let’s you split a file into separate years, months or year/month combinations. For example, if you want to split a dataset into files of different years, you can do this:
data = nc.open_data(infile)
data.split("year")
subtract¶
This method can subtract from a dataset. You can substract a constant, another dataset or a NetCDF file. In the case of datasets or NetCDF files the grids etc. must be of the same structure as the original dataset.
For example, if we had a temperature dataset where temperature was in Kelvin, we could convert it to Celsiu by subtracting 273.15.
data = nc.open_data(infile)
data.substract(273.15)
sum¶
This method will calculate the sum of values of all variables in all grid cells. Usage is simple:
data = nc.open_data(infile)
data.sum()
surface¶
This method will extract the surface level from a multi-level dataset. Usage is simple:
data = nc.open_data(infile)
data.surface()
to_dataframe¶
This method will return a pandas dataframe with the contents of the dataset. This has a decode_times argument to specify whether you want the times to be decoded. Defaults to True. Usage is simple:
data = nc.open_data(infile)
data.to_dataframe()
to_latlon¶
This method will regrid a dataset to a regular latlon grid. The minimum and maximum longitudes and latitudes must be specified, along with the horizontal and vertical resolutions.
data = nc.open_data(infile)
data.to_latlon(lon = [-80, 20], lat = [30, 80], res = [1,1])
to_xarray¶
This method will return an xarray datasetwith the contents of the dataset. This has a decode_times argument to specify whether you want the times to be decoded. Defaults to True. Usage is simple:
data = nc.open_data(infile)
data.to_xarray()
transmute¶
This method can be used to generate new variables using arithmetic expressions. Existing will be removed from the dataset. See mutate if you want to keep existing variables. The method requires a dictionary, where the key-value pairs are the new variables and expression required to generate it.
For example, if had a temperature dataset, with temperature in Celsius, we might want to convert that to Kelvin. We can do this easily:
data = nc.open_data(infile)
data.transmute({"temperature_k":"temperature+273.15"})
var¶
This method calculates the variance of each variable in the dataset. This is calculate across all time steps. Usage is simple:
data = nc.open_data(infile)
data.var()
vertical_interp¶
This method interpolates variables vertically. It requires a list of vertical levels, for example depths, you want to interpolate. For example, if you had an ocean dataset and you wanted to interpolate to 10 and 20 metres you would do the following:
data = nc.open_data(infile)
data.vertical_interp(levels = [10, 20])
vertical_max¶
This method calculates the maximum value of each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_max()
vertical_mean¶
This method calculates the mean value of each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_mean()
vertical_min¶
This method calculates the minimum value of each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_min()
vertical_range¶
This method calculates the value range of each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_range()
vertical_sum¶
This method calculates the sum each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_sum()
write_nc¶
This method allows you to write the contents of a dataset to a NetCDF file. If the target file exists and you want to overwrite it set overwrite to True. Usage is simple:
data.write_nc(outfile)
zip¶
This method will zip the contents of a dataset. This is mostly useful for processing chains where you want to minimize disk space usage by the output. Please note this method works lazily. In the code below only one file is generated, a zipped “outfile”.
nc.options(lazy = True)
data = nc.open_data(infile)
data.select_years(1990)
data.zip()
data.write_nc(outfile)
zonal statistics¶
Calculate the following zonal statistics: mean, min, max and range:
data.zonal_mean()
data.zonal_min()
data.zonal_max()
data.zonal_range()
How to guide¶
This guide will show how to carry out key nctoolkit operations. We will use a sea surface temperature data set and a depth-resolved ocean temperature data set. The data set can be downloaded from here.
[1]:
import nctoolkit as nc
import os
import pandas as pd
import xarray as xr
How to select years and months¶
If we want to select specific years and months we can use the select_years
and select_months
method
[2]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_years(1960)
sst.select_months(1)
sst.times
[2]:
['1960-01-01T00:00:00']
How to mean, mean, max etc.¶
If you want to calculate the mean value of a variable over all time steps you can use mean
:
[3]:
sst = nc.open_data("sst.mon.mean.nc")
sst.mean()
sst.plot()
[3]:
Similarly, if you want to calculate the minimum, maximum, sum and range of values over time just use min
, max
, sum
and range
.
How to copy a data set¶
If you want to make a deep copy of a data set, use the built in copy method. This method will return a new data set. This method should be used because of nctoolkit’s built in methods to automatically delete temporary files that are no longer required. Behind the scenes, using copy will result in nctoolkit registering that it needs the NetCDF file for both the original dataset and the new copied one. So if you copy a dataset, and then delete the original, nctoolkit knows to not remove any NetCDF files related to the dataset.
[4]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_years(1960)
sst.select_months(1)
sst1 = sst.copy()
del sst
os.path.exists(sst1.current)
[4]:
True
How to clip to a region¶
If you want to clip the data to a specific longitude and latitude box, we can use clip
, with the longitude and latitude range given by lon and lat.
[5]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_months(1)
sst.select_years(1980)
sst.clip(lon = [-80, 20], lat = [40, 70])
sst.plot()
[5]:
How to rename a variable¶
If we want to rename a variable we use the rename
method, and supply a dictionary where the key-value pairs are the original and new names
[6]:
sst = nc.open_data("sst.mon.mean.nc")
sst.variables
[6]:
['sst']
The original dataset had only one variable called sst. We can now rename it, and display the new variables.
[7]:
sst.rename({"sst": "temperature"})
sst.variables
[7]:
['temperature']
How to create new variables¶
New variables can be created using arithmetic operations using either mutate
or transmute
. The mutate
method will maintain the original variables, whereas transmute
will not. This method requires a dictionary, where the key, values pairs are the names of the new variables and the arithemtic operations to perform. The example below shows how to create a new variable with
[8]:
sst = nc.open_data("sst.mon.mean.nc")
sst.mutate({"sst_k": "sst+273.15"})
sst.variables
[8]:
['sst', 'sst_k']
How to calculate a spatial average¶
You can calculate a spatial average using the spatial_mean
method. There are additional methods for maximums etc.
[9]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.plot()
[9]:
How to calculate an annual mean¶
You can calculate an annual mean using the annual_mean
method.
[10]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.annual_mean()
sst.plot()
[10]:
How to calculate a rolling average¶
You can calculate a rolling mean using the rolling_mean
method, with the window argument providing the number of time steps to average over. There are additional methods for rolling sums etc. The code below will calculate a rolling mean of global SST using a 20 year window.
[11]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.annual_mean()
sst.rolling_mean(20)
sst.plot()
[11]:
How to calculate temporal anomalies¶
You can calculate annual temporal anomalies using the annual_anomaly
method. This requires a baseline period.
[12]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.annual_anomaly(baseline = [1960, 1979])
sst.plot()
[12]:
How to split data by year etc¶
Files within a dataset can be split by year, day, year and month or season using the split
method. If we wanted to split by year, we do the following:
[13]:
sst = nc.open_data("sst.mon.mean.nc")
sst.split("year")
How to merge files in time¶
We can merge files based on time using merge_time
. We can do this by merging the dataset that results from splitting the original sst dataset. If we split the dataset by year, we see that there are 169 files, one for each year.
[14]:
sst = nc.open_data("sst.mon.mean.nc")
sst.split("year")
We can then merge them together to get a single file dataset:
[15]:
sst.merge_time()
How to do variables-based merging¶
If we have two more more files that have the same time steps, but different variables, we can merge them using merge
. The code below will first create a dataset with a NetCDF file with SST in K, and it will then create a new dataset with this netcd file and the original, and then merge them.
[16]:
sst1 = nc.open_data("sst.mon.mean.nc")
sst2 = nc.open_data("sst.mon.mean.nc")
sst2.transmute({"sst_k": "sst+273.15"})
new_sst = nc.open_data([sst1.current, sst2.current])
new_sst.current
new_sst.merge()
In some cases we will have two or more datasets we want to merge. In this case we can use the merge
function as follows:
[17]:
sst1 = nc.open_data("sst.mon.mean.nc")
sst2 = nc.open_data("sst.mon.mean.nc")
sst2.transmute({"sst_k": "sst+273.15"})
new_sst = nc.merge(sst1, sst2)
new_sst.variables
[17]:
['sst', 'sst_k']
How to horizontally regrid data¶
Variables can be regridded horizontally using regrid
. This method requires the new grid to be defined. This can either be a pandas data frame, with lon/lat as columns, an xarray object, a NetCDF file or an nctolkit dataset. I will demonstrate all three methods by regridding SST to the North Atlantic. Let’s begin by getting a grid for the North Atlantic.
[18]:
new_grid = nc.open_data("sst.mon.mean.nc")
new_grid.clip(lon = [-80, 20], lat = [30, 70])
new_grid.select_months(1)
new_grid.select_years( 2000)
First, we will use the new dataset itself to do the regridding. I will calculate mean SST using the original data, and then regrid to the North Atlantic.
[19]:
sst = nc.open_data("sst.mon.mean.nc")
sst.mean()
sst.regrid(grid = new_grid)
sst.plot()
[19]:
We can also do this using the NetCDF, which is new_grid.current
[20]:
sst = nc.open_data("sst.mon.mean.nc")
sst.mean()
sst.regrid(grid = new_grid.current)
sst.plot()
[20]:
or we can use a pandas data frame. In this case I will convert the xarray data set to a data frame.
[21]:
na_grid = xr.open_dataset(new_grid.current)
na_grid = na_grid.to_dataframe().reset_index().loc[:,["lon", "lat"]]
sst = nc.open_data("sst.mon.mean.nc")
sst.mean()
sst.regrid(grid = na_grid)
sst.plot()
[21]:
How to temporally interpolate¶
Temporal interpolation can be carried out using time_interp
. This method requires a start date (start) of the format YYYY/MM/DD and an end date (end), and a temporal resolution (resolution), which is either 1 day (“daily”), 1 week (“weekly”), 1 month (“monthly”), or 1 year (“yearly”).
[22]:
sst = nc.open_data("sst.mon.mean.nc")
sst.time_interp(start = "1990/01/01", end = "1990/12/31", resolution = "daily")
How to calculate a monthly average from daily data¶
If you have daily data, you can calculate a month average using monthly_mean
. There are also methods for maximums etc.
[23]:
sst = nc.open_data("sst.mon.mean.nc")
sst.time_interp(start = "1990/01/01", end = "1990/12/31", resolution = "daily")
sst.monthly_mean()
How to calculate a monthly climatology¶
If we want to calculate the mean value of variables for each month in a given dataset, we can use the monthly_mean_climatology
method as follows:
[24]:
sst = nc.open_data("sst.mon.mean.nc")
sst.monthly_mean_climatology()
sst.select_months(1)
sst.plot()
[24]:
How to calculate a seasonal climatology¶
[25]:
sst = nc.open_data("sst.mon.mean.nc")
sst.seasonal_mean_climatology()
sst.select_timesteps(0)
sst.plot()
[25]:
[26]:
## How to read a dataset using pandas or xarray
To read the dataset to an xarray Dataset use to_xarray
:
[27]:
sst = nc.open_data("sst.mon.mean.nc")
sst.to_xarray()
[27]:
<xarray.Dataset> Dimensions: (lat: 180, lon: 360, time: 2028) Coordinates: * lat (lat) float32 89.5 88.5 87.5 86.5 85.5 ... -86.5 -87.5 -88.5 -89.5 * lon (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5 * time (time) datetime64[ns] 1850-01-01 1850-02-01 ... 2018-12-01 Data variables: sst (time, lat, lon) float32 ... Attributes: title: created 12/2013 from data provided by JRA history: Created 12/2012 from data obtained from JRA by ESRL/PSD platform: Analyses citation: Hirahara, S., Ishii, M., and Y. Fukuda,2014: Centennial... institution: NOAA ESRL/PSD Conventions: CF-1.2 References: http://www.esrl.noaa.gov/psd/data/gridded/cobe2.html dataset_title: COBE-SST2 Sea Surface Temperature and Ice original_source: https://climate.mri-jma.go.jp/pub/ocean/cobe-sst2/
- lat: 180
- lon: 360
- time: 2028
- lat(lat)float3289.5 88.5 87.5 ... -88.5 -89.5
- units :
- degrees_north
- long_name :
- Latitude
- actual_range :
- [ 89.5 -89.5]
- axis :
- Y
- standard_name :
- latitude
array([ 89.5, 88.5, 87.5, 86.5, 85.5, 84.5, 83.5, 82.5, 81.5, 80.5, 79.5, 78.5, 77.5, 76.5, 75.5, 74.5, 73.5, 72.5, 71.5, 70.5, 69.5, 68.5, 67.5, 66.5, 65.5, 64.5, 63.5, 62.5, 61.5, 60.5, 59.5, 58.5, 57.5, 56.5, 55.5, 54.5, 53.5, 52.5, 51.5, 50.5, 49.5, 48.5, 47.5, 46.5, 45.5, 44.5, 43.5, 42.5, 41.5, 40.5, 39.5, 38.5, 37.5, 36.5, 35.5, 34.5, 33.5, 32.5, 31.5, 30.5, 29.5, 28.5, 27.5, 26.5, 25.5, 24.5, 23.5, 22.5, 21.5, 20.5, 19.5, 18.5, 17.5, 16.5, 15.5, 14.5, 13.5, 12.5, 11.5, 10.5, 9.5, 8.5, 7.5, 6.5, 5.5, 4.5, 3.5, 2.5, 1.5, 0.5, -0.5, -1.5, -2.5, -3.5, -4.5, -5.5, -6.5, -7.5, -8.5, -9.5, -10.5, -11.5, -12.5, -13.5, -14.5, -15.5, -16.5, -17.5, -18.5, -19.5, -20.5, -21.5, -22.5, -23.5, -24.5, -25.5, -26.5, -27.5, -28.5, -29.5, -30.5, -31.5, -32.5, -33.5, -34.5, -35.5, -36.5, -37.5, -38.5, -39.5, -40.5, -41.5, -42.5, -43.5, -44.5, -45.5, -46.5, -47.5, -48.5, -49.5, -50.5, -51.5, -52.5, -53.5, -54.5, -55.5, -56.5, -57.5, -58.5, -59.5, -60.5, -61.5, -62.5, -63.5, -64.5, -65.5, -66.5, -67.5, -68.5, -69.5, -70.5, -71.5, -72.5, -73.5, -74.5, -75.5, -76.5, -77.5, -78.5, -79.5, -80.5, -81.5, -82.5, -83.5, -84.5, -85.5, -86.5, -87.5, -88.5, -89.5], dtype=float32)
- lon(lon)float320.5 1.5 2.5 ... 357.5 358.5 359.5
- units :
- degrees_east
- long_name :
- Longitude
- actual_range :
- [ 0.5 359.5]
- axis :
- X
- standard_name :
- longitude
array([ 0.5, 1.5, 2.5, ..., 357.5, 358.5, 359.5], dtype=float32)
- time(time)datetime64[ns]1850-01-01 ... 2018-12-01
- long_name :
- Time
- delta_t :
- 0000-01-00 00:00:00
- avg_period :
- 0000-01-00 00:00:00
- prev_avg_period :
- 0000-00-01 00:00:00
- standard_name :
- time
- axis :
- T
- coordinate_defines :
- start
- actual_range :
- [-14975. 46720.]
array(['1850-01-01T00:00:00.000000000', '1850-02-01T00:00:00.000000000', '1850-03-01T00:00:00.000000000', ..., '2018-10-01T00:00:00.000000000', '2018-11-01T00:00:00.000000000', '2018-12-01T00:00:00.000000000'], dtype='datetime64[ns]')
- sst(time, lat, lon)float32...
- long_name :
- Monthly Means of Global Sea Surface Temperature
- valid_range :
- [-5. 40.]
- units :
- degC
- var_desc :
- Sea Surface Temperature
- dataset :
- COBE-SST2 Sea Surface Temperature
- statistic :
- Mean
- parent_stat :
- Individual obs
- level_desc :
- Surface
- actual_range :
- [-2.043 34.392]
[131414400 values with dtype=float32]
- title :
- created 12/2013 from data provided by JRA
- history :
- Created 12/2012 from data obtained from JRA by ESRL/PSD
- platform :
- Analyses
- citation :
- Hirahara, S., Ishii, M., and Y. Fukuda,2014: Centennial-scale sea surface temperature analysis and its uncertainty. J of Climate, 27, 57-75. http://journals.ametsoc.org/doi/pdf/10.1175/JCLI-D-12-00837.1
- institution :
- NOAA ESRL/PSD
- Conventions :
- CF-1.2
- References :
- http://www.esrl.noaa.gov/psd/data/gridded/cobe2.html
- dataset_title :
- COBE-SST2 Sea Surface Temperature and Ice
- original_source :
- https://climate.mri-jma.go.jp/pub/ocean/cobe-sst2/
To read the dataset in as a pandas dataframe use to_dataframe
:
[28]:
sst.to_dataframe()
[28]:
sst | |||
---|---|---|---|
lat | lon | time | |
89.5 | 0.5 | 1850-01-01 | -1.712 |
1850-02-01 | -1.698 | ||
1850-03-01 | -1.707 | ||
1850-04-01 | -1.742 | ||
1850-05-01 | -1.725 | ||
... | ... | ... | ... |
-89.5 | 359.5 | 2018-08-01 | NaN |
2018-09-01 | NaN | ||
2018-10-01 | NaN | ||
2018-11-01 | NaN | ||
2018-12-01 | NaN |
131414400 rows × 1 columns
How to calculate cell areas¶
If we want to calculate the area of each cell in a dataset, we use the cell_area
method. The join
argument let’s you choose whether to join the cell areas to the existing dataset, or to only include cell areas in the dataset.
[29]:
sst = nc.open_data("sst.mon.mean.nc")
sst.cell_areas(join=False)
sst.plot()
[29]:
How to use urls¶
If a file is located at a url, we can send it to open_data
:
[30]:
url = "ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(url)
Downloading ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.ltm.1981-2010.nc
This will download the file from the url and save it as a temp file. We can then work with it as usual. A future release of nctoolkit will have thredds support.
How to calculate an ensemble average¶
nctoolkit has built in methods for working with ensembles. Let’s start by splitting the 1850-2019 sst dataset into an ensemble, where each file is a separate year:
[31]:
sst = nc.open_data("sst.mon.mean.nc")
sst.split("year")
An ensemble mean can be calculated in two ways. First, we can calculate the mean in each time step. So here the files have temperature from 1850 onwards. We can calculate the monthly mean temperature over that time period as follows, and from there we can calculate the global mean:
[32]:
sst.ensemble_mean()
sst.spatial_mean()
sst.plot()
[32]:
We might want to calculate the average over all time steps, i.e. calculating mean temperature since 1850. We do this by changing the ignore_time
argument:
[33]:
sst = nc.open_data("sst.mon.mean.nc")
sst.split("year")
sst.ensemble_mean(ignore_time=True)
sst.plot()
[33]:
API Reference¶
Reading/copying data¶
|
Read netcdf data as a DataSet object |
|
Read netcdf data from a url as a DataSet object |
|
Read thredds data as a DataSet object |
|
Make a deep copy of an DataSet object |
Merging or analyzing multiple datasets¶
|
Merge datasets |
|
Calculate the temporal correlation coefficient between two datasets This will calculate the temporal correlation coefficient, for each time step, between two datasets. |
|
Calculate the spatial correlation coefficient between two datasets This will calculate the spatial correlation coefficient, for each time step, between two datasets. |
Accessing attributes¶
List variables contained in a dataset |
|
List years contained in a dataset |
|
List months contained in a dataset |
|
List times contained in a dataset |
|
List levels contained in a dataset |
|
The size of an object This will print the number of files, total size, and smallest and largest files in an DataSet object. |
|
The current file or files in the DataSet object |
|
The history of operations on the DataSet |
|
The starting file or files of the DataSet object |
Plotting¶
|
Autoplotting method. |
|
Open the current dataset’s file in ncview |
Variable modification¶
|
Create new variables using mathematical expressions, and keep original variables |
|
Create new variables using mathematical expressions, and drop original variables |
|
Rename variables in a dataset |
|
Set the missing value for a single number or a range |
|
Calculate the sum of all variables for each time step |
NetCDF file attribute modification¶
|
Set the long names of variables |
|
Set the units for variables |
Vertical/level methods¶
|
Extract the top/surface level from a dataset This extracts the first vertical level from each file in a dataset. |
|
Extract the bottom level from a dataset This extracts the bottom level from each NetCDF file. |
|
Verticaly interpolate a dataset based on given vertical levels This is calculated for each time step and grid cell |
|
Calculate the depth-averaged mean for each variable This is calculated for each time step and grid cell |
|
Calculate the vertical minimum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical maximum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical range of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical sum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical sum of variable values This is calculated for each time step and grid cell |
|
Invert the levels of 3D variables This is calculated for each time step and grid cell |
|
Create a mask identifying the deepest cell without missing values. |
Rolling methods¶
|
Calculate a rolling mean based on a window |
|
Calculate a rolling minimum based on a window |
|
Calculate a rolling maximum based on a window |
|
Calculate a rolling sum based on a window |
|
Calculate a rolling range based on a window |
Evaluation setting¶
|
Run all stored commands in a dataset |
Cleaning functions¶
Ensemble creation¶
|
Generate an ensemble |
Arithemetic methods¶
|
Create new variables using mathematical expressions, and keep original variables |
|
Create new variables using mathematical expressions, and drop original variables |
|
Add to a dataset This will add a constant, another dataset or a NetCDF file to the dataset. |
|
Subtract from a dataset This will subtract a constant, another dataset or a NetCDF file from the dataset. |
|
Multiply a dataset This will multiply a dataset by a constant, another dataset or a NetCDF file. |
|
Divide the data This will divide the dataset by a constant, another dataset or a NetCDF file. |
Ensemble statistics¶
|
Calculate an ensemble mean |
|
Calculate an ensemble min |
|
Calculate an ensemble maximum |
|
Calculate an ensemble percentile This will calculate the percentles for each time step in the files. |
|
Calculate an ensemble range The range is calculated for each time step; for example, if each file in the ensemble has 12 months of data the statistic will be calculated for each month. |
Subsetting operations¶
|
Clip to a rectangular longitude and latitude box |
|
Select variables from a dataset |
|
Remove variables This will remove stated variables from files in the dataset. |
|
Select years from a dataset This method will subset the dataset to only contain years within the list given. |
|
Select months from a dataset This method will subset the dataset to only contain months within the list given. |
|
Select season from a dataset |
|
Select timesteps from a dataset |
Time-based methods¶
|
Set the date in a dataset You should only do this if you have to fix/change a dataset with a single, not multiple dates. |
|
Shift times in dataset by a number of hours |
|
Shift times in dataset by a number of days |
|
Shift times in dataset by a number of months |
|
Shift times in dataset by a number of years |
|
Shift method. |
Interpolation methods¶
|
Regrid a dataset to a target grid |
|
Regrid a dataset to a regular latlon grid |
|
Temporally interpolate variables based on date range and time resolution |
Masking methods¶
|
Mask a lon/lat box |
Summary methods¶
|
Calculate annual anomalies for each variable based on a baseline period The anomaly is derived by first calculating the climatological annual mean for the given baseline period. |
|
Calculate monthly anomalies based on a baseline period The anomaly is derived by first calculating the climatological monthly mean for the given baseline period. |
|
Calculate phenologies from a dataset Each file in an ensemble must only cover a single year, and ideally have all days. |
Statistical methods¶
|
Calculate the temporal mean of all variables |
|
Calculate the temporal minimum of all variables |
|
Calculate the temporal percentile of all variables |
|
Calculate the temporal maximum of all variables |
|
Calculate the temporal sum of all variables |
|
Calculate the temporal range of all variables |
|
Calculate the temporal variance of all variables |
|
Calculate the temporal cumulative sum of all variables |
|
Calculate the correlation correct between two variables in space This is calculated for each time step. |
|
Calculate the correlation correct in time between two variables The correlation is calculated for each grid cell, ignoring missing values. |
|
Calculate the area weighted spatial mean for all variables This is performed for each time step. |
|
Calculate the spatial minimum for all variables This is performed for each time step. |
|
Calculate the spatial maximum for all variables This is performed for each time step. |
|
Calculate the spatial sum for all variables This is performed for each time step. |
|
Calculate the spatial range for all variables This is performed for each time step. |
|
Calculate the spatial sum for all variables This is performed for each time step. |
|
Calculate the monthly mean for each year/month combination in files. |
|
Calculate the monthly minimum for each year/month combination in files. |
|
Calculate the monthly maximum for each year/month combination in files. |
|
Calculate the monthly range for each year/month combination in files. |
|
Calculate the daily mean for each variable |
|
Calculate the daily minimum for each variable |
|
Calculate the daily maximum for each variable |
|
Calculate the daily mean for each variable |
|
Calculate the daily range for each variable |
Calculate a daily mean climatology |
|
Calculate a daily minimum climatology |
|
Calculate a daily maximum climatology |
|
Calculate a daily mean climatology |
|
Calculate a daily range climatology |
|
Calculate the monthly mean climatologies Defined as the minimum value in each month across all years. |
|
Calculate the monthly minimum climatologies Defined as the minimum value in each month across all years. |
|
Calculate the monthly maximum climatologies Defined as the maximum value in each month across all years. |
|
Calculate the monthly range climatologies Defined as the range of value in each month across all years. |
|
|
Calculate the annual mean for each variable |
|
Calculate the annual minimum for each variable |
|
Calculate the annual maximum for each variable |
|
Calculate the annual sum for each variable |
|
Calculate the annual range for each variable |
|
Calculate the seasonal mean for each year. |
|
Calculate the seasonal minimum for each year. |
|
Calculate the seasonal maximum for each year. |
|
Calculate the seasonal range for each year. |
Calculate a climatological seasonal mean |
|
Calculate a climatological seasonal min This is defined as the minimum value in each season across all years. |
|
Calculate a climatological seasonal max This is defined as the maximum value in each season across all years. |
|
Calculate a climatological seasonal range This is defined as the range of values in each season across all years. |
|
|
Calculate the zonal mean for each year/month combination in files. |
|
Calculate the zonal minimum for each year/month combination in files. |
|
Calculate the zonal maximum for each year/month combination in files. |
|
Calculate the zonal range for each year/month combination in files. |
|
Calculate the meridonial mean for each year/month combination in files. |
|
Calculate the meridonial minimum for each year/month combination in files. |
|
Calculate the meridonial maximum for each year/month combination in files. |
|
Calculate the meridonial range for each year/month combination in files. |
Seasonal methods¶
|
Calculate the seasonal mean for each year. |
|
Calculate the seasonal minimum for each year. |
|
Calculate the seasonal maximum for each year. |
|
Calculate the seasonal range for each year. |
Calculate a climatological seasonal mean |
|
Calculate a climatological seasonal min This is defined as the minimum value in each season across all years. |
|
Calculate a climatological seasonal max This is defined as the maximum value in each season across all years. |
|
Calculate a climatological seasonal range This is defined as the range of values in each season across all years. |
|
|
Select season from a dataset |
Merging methods¶
|
Merge a multi-file ensemble into a single file Merging will occur based on the time steps in the first file. |
|
Time-based merging of a multi-file ensemble into a single file This method is ideal if you have the same data split over multiple files covering different data sets. |
Climatology methods¶
Calculate a daily mean climatology |
|
Calculate a daily minimum climatology |
|
Calculate a daily maximum climatology |
|
Calculate a daily mean climatology |
|
Calculate a daily range climatology |
|
Calculate the monthly mean climatologies Defined as the minimum value in each month across all years. |
|
Calculate the monthly minimum climatologies Defined as the minimum value in each month across all years. |
|
Calculate the monthly maximum climatologies Defined as the maximum value in each month across all years. |
|
Calculate the monthly range climatologies Defined as the range of value in each month across all years. |
Splitting methods¶
|
Split the dataset Each file in the ensemble will be separated into new files based on the splitting argument. |
Output methods¶
|
Save a dataset to a named file This will only work with single file datasets. |
|
Open a dataset as an xarray object |
|
Open a dataset as a pandas data frame |
|
Zip the dataset This will compress the files within the dataset. |
Miscellaneous methods¶
|
Calculate the area of grid cells. |
|
Apply a cdo command |
|
Apply an nco command |
|
Compare all variables to a constant |
|
Reduce dimensions of data This will remove any dimensions with only one value. |
|
Reduce the dataset to non-zero locations in a mask :param mask: single variable dataset or path to .nc file. |