nctoolkit: Efficient and intuitive tools for analyzing netCDF data in Python¶
nctoolkit is a comprehensive Python (3.6 and above) package for analyzing netCDF data.
- Core abilities include:
Clipping to spatial regions
Calculating climatologies
Subsetting to specific time periods
Calculating spatial statistics
Creating new variables using arithmetic operations
Calculating anomalies
Calculating rolling and cumulative statistics
Horizontally and vertically remapping data
Calculating time averages
Interactive plotting of data
Calculating the correlations between variables
Calculating vertical statistics for the likes of oceanic data
Calculating ensemble statistics
Calculating phenological metrics
Under the hood nctoolkit relies on Climate Data Operates (CDO). nctoolkit is designed as a standalone package with no required understanding of CDO, but it provides expert users of CDO the ability to process data in Python with ease, and with method chaining handled automatically.
In addition to the guidance given here, tutorials for how to use nctoolkit are available at nctoolkit’s GitHub page.
Documentation¶
Getting Started
Installation¶
Python dependencies¶
How to install nctoolkit¶
The easiest way to install the package is using conda, which will install nctoolkit and all system dependencies:
$ conda install -c rwi nctoolkit
nctoolkit is available from the Python Packaging Index. To install nctoolkit using pip:
$ pip install nctoolkit
To install the development version from GitHub:
$ pip install git+https://github.com/r4ecology/nctoolkit.git
System dependencies¶
If you install using conda, system dependencies will be handled for you. Otherwise you must install them. There are two main system dependencies: Climate Data Operators, and NCO. The easiest way to install them is using conda:
$ conda install -c conda-forge cdo
$ conda install -c conda-forge nco
While CDO is necessary for the package to work, NCO is an optional dependency and does not have to be installed.
If you want to install CDO from source, you can use one of the bash scripts available here.
Introduction tutorial¶
The fundamental object of analysis in this package is a nctoolkit dataset. Each object is initialized with a single netcdf file or an ensemble of files, and it will keep track of any manipulations carried out.
Behind the scenes most of the manipulations are done using CDO. Datasets will keep track of all CDO NCO commands used. However, unless you are experienced with CDOyou can ignore all of this.
Opening netcdf data¶
I will illustrate the basic usage using a climatology of global sea surface temperature from NOAA. We can download this from here. To download using wget:
wget ftp://ftp.cdc.noaa.gov/Datasets/COBE/sst.mon.ltm.1981-2010.nc
The first step in any analysis will be to import nctoolkit, which I will call nc as shorthand. Please note I am suppressing warnings to make this notebook more readable. I do not recommend suppressing warnings….
[1]:
import nctoolkit as nc
Under the hood nctoolkit will generate temporary netcdf files. The package is designed to remove temp files that are no longer in use, and will automatically clean up any temporary files generated when Python closes. However, this is not 100% guaranteed to work during system crashes etc.
It is therefore recommended to do a deep_clean at the start of any session to remove any leftover netcdf files that might have existed in a previous sessions. Obviously, do not run this if you have multiple instances of nctoolkit running simultaneously.
[2]:
nc.deep_clean()
We can then set up the dataset, which we will use for manipulating the SST climatology.
[3]:
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
Accessing dataset attributes¶
At this point there is very little useful information in the dataset. Essentially, all it tell us is the start file. This will always remain the same.
[4]:
sst.start
[4]:
'sst.mon.ltm.1981-2010.nc'
The current state of the dataset can be found as follows.
[5]:
sst.current
[5]:
'sst.mon.ltm.1981-2010.nc'
We can access the dataset’s history as follows, which is initially empty.
[6]:
sst.history
[6]:
[]
A simple, but important first task when analyzing netcdf data is knowing the variables in the file. We can do this quickly by accessing the variables attribute
[7]:
sst.variables
[7]:
['sst', 'valid_yr_count']
Often, we will want to know the size of a dataset. This is most relevant when we are working with multiple files. We can do this by accessing the size attribute. To speed up computations, variables and size are computed lazily.
[8]:
sst.size
[8]:
'Number of files: 1\nFile size: 4.670688 MB'
In this case we can see that the file is 4 MB, and we are also told that there is only one file.
Variable selection and geographic clipping¶
We can clip netcdf files in space or time using clip. Let’s say we only cared about temperature in July for the North Atlantic. This be found very easily using the following.
netcdf files often have variables that we are not interested in. We can therefore easily select or delete variables. If we want to select variables we can use the select_variables method, which requires either a single variable or a list of variables. Here I will select sst.
[9]:
sst.select_variables("sst")
We can now see that there is only one variable in the sst dataset
[10]:
sst.variables
[10]:
['sst']
We can also that a temporary file has been created with only this variable in it
[11]:
sst.current
[11]:
'/tmp/nctoolkitmexmyssrnctoolkittmp1jbp0kvg.nc'
If we want to clip the dataset geographically we can use the clip method. All we need is the longitude and latitude range. So if we wanted to clip the SST data to the North Atlantic we would do the following.
[12]:
sst.clip(lon = [-80, 20], lat = [30, 80])
We have now carried out some manipulations on the dataset. So, the current file has now changed.
Likewise, we now have a history to look at.
[13]:
sst.history
[13]:
['cdo -L -selname,sst sst.mon.ltm.1981-2010.nc /tmp/nctoolkitmexmyssrnctoolkittmp1jbp0kvg.nc',
'cdo -L -sellonlatbox,-80,20,30,80 /tmp/nctoolkitmexmyssrnctoolkittmp1jbp0kvg.nc /tmp/nctoolkitmexmyssrnctoolkittmpbp36y2p_.nc']
This will give us the list of CDO or NCO commands used under the hood. nctoolkit is designed to be usable without any prior knowledge of CDO or NCO.
Deleting an object¶
If we want to delete a dataset we simply use the standard python del approach. nctoolkit has been designed so that it is constantly cleaning up the system using a simple rule: only keep temp files created if they are among the current files of datasets in the current session. Right now, we only have one dataset, called “sst”. So if we delete “sst” it will also delete the current temp file from that dataset. We can see this by looking at what happens to the temp file related to sst when we delete sst. Right now it exists on the system.
[14]:
import os
x = sst.current
os.path.exists(x)
[14]:
True
But if we delete sst, this file will disappear.
[15]:
del sst
os.path.exists(x)
[15]:
False
Viewing a dataset using the auto plot feature¶
nctoolkit has a built in, though slightly experimental, method for quick plotting. This will check the contents of the dataset and plot accordingly. The general approach of autoplot is very similar to ncview on the command line.
[16]:
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_months(1)
sst.reduce_dims()
sst.plot()
Data type cannot be displayed:
[16]:
Statistical operations¶
nctoolkit has a large number of built in statistical operations, largely built around the methods available in CDO.
Time averaging¶
Averaging in time is one of the most common operations required on netcdf data. nctoolkit allows users to calculate long-term time averages, monthly climatologies, seasonal summaries and many other common statistics.
In this case we are analyzing a monthly climatology of SST. However, what we really might be interested in is the annual average. This can be calculated using the simple mean method, which will calculate the mean over all time steps.
[17]:
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.mean()
sst.reduce_dims()
sst.plot()
Data type cannot be displayed:
[17]:
Instead of the annual mean, we might be interested in the range of temperatures during the year.
[18]:
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.range()
sst.reduce_dims()
sst.plot()
Data type cannot be displayed:
[18]:
Other operations, such as maximum, minimum, and standard deviation are available.
Spatial statistics¶
Let’s move on to some more advanced methods. I will illustrate these using NOAA’s long-term monthly global data set of sea surface temperatures from 1850 to the present day. You can learn more about this data set here. This file is approximately 500 MB.
To download using wget:
wget ftp://ftp.cdc.noaa.gov/Datasets/COBE/sst.mon.mean.nc
This is a long-term data set of global sea surface temperature. So, let’s find out what has happened to average global sea surface temperature since 1850. Unsurprising spoiler: it has been going up. Let’s start by setting up the dataset.
[19]:
ff = "sst.mon.mean.nc"
sst = nc.open_data(ff)
We now need to calculate the average global SST. We can do this using the spatial_mean method. This will calculate an area weighted mean for each time step.
[20]:
sst.spatial_mean()
We can now plot the time series of monthly global mean SST since 1850
[21]:
sst.plot()
Data type cannot be displayed:
[21]:
Our time series shows that, as expected SST increased during the 20th Century. However, this figure has too much noise. We do not care about month to month variations.Instead, let’s look at the rolling 20 year mean. To do this, we will need to first calculate an annual mean then calculated the rolling mean using a window of 20 years. Alternatively, we can just calculate a rolling mean on the initial data using a rolling mean of 20*12 = 240 months.
[22]:
ff = "sst.mon.mean.nc"
sst = nc.open_data(ff)
sst.spatial_mean()
To calculate the annual mean we can simply use the yearly_mean method.
[23]:
sst.annual_mean()
To calculate the rolling mean, we can use rolling_mean, with window set to 20.
[24]:
sst.rolling_mean(window = 20)
[25]:
sst.plot()
Data type cannot be displayed:
[25]:
This looks much cleaner. Please note that at present nctoolkit does not adjust the time outputs from CDO. So in this case the rolling mean is centred in the middle of the 20 year period. As nctoolkit evolves more windowing options will be provided to users.
Ensemble methods¶
Merging files with different variables¶
This notebook will outline some general methods for doing comparisons of multiple files. We will work with two different sea surface temperature data sets from NOAA and the Met Office Hadley Centre.
[1]:
import nctoolkit as nc
import pandas as pd
import xarray as xr
import numpy as np
Let’s start by downloading the files using wget. Uncomment the code below to do this (note: you will need to extract the HadISST dataset):
[2]:
# ! wget ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.mean.nc
# ! wget https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_sst.nc.gz
The first step is to get the data. We will start by creating two separate datasets for each file.
[3]:
sst_noaa = nc.open_data("sst.mon.mean.nc")
sst_hadley = nc.open_data("HadISST_sst.nc")
We can see that both variables have sea surface temperature labelled as sst. So we will need to change that.
[4]:
sst_noaa.variables
[4]:
['sst']
[5]:
sst_hadley.variables
[5]:
['time_bnds', 'sst']
[6]:
sst_noaa.rename({"sst":"noaa"})
sst_hadley.rename({"sst":"hadley"})
The data sets also cover different time periods, and only have overlapping between 1870 and 2018. so we will need to select those years
[7]:
sst_noaa.select_years(range(1870, 2019))
sst_hadley.select_years(range(1870, 2019))
We also have a problem in that there are two horizontal grids in the Hadley Centre file. We can solve this by selecting the sst variable only
[8]:
sst_hadley.select_variables("hadley")
At this point, the datasets have the same number of time steps and months covered. However, the grids are still a bit different. So we want to unify them by regridding one dataset on to the other’s grid. This can be done using regrid, or any grid of your choosing.
[9]:
sst_noaa.regrid(grid = sst_hadley)
We now have two separate datasets. Let’s create a new dataset that has both of them, and then merge them. When doing this we need to make sure nas are treated properly. In this case Hadley Centre values not being NAs as they should be, so we need to fix that. The merge method also requires a strict matching criteria for the dates in the merging files. In this case the Hadley Centre and NOAA data sets both give monthly means, but use a different day of the month. So we will set match to [“year”, “month”] this will ensure there are no mis-matches
[10]:
all_sst = nc.merge(sst_noaa, sst_hadley, match = ["year", "month"])
all_sst.set_missing([-9000, - 900])
Let’s work out what the global mean SST was over the time period. Note that this will not be totally accurate as there are some missing values here and there that might bias things.
[11]:
all_sst.spatial_mean()
all_sst.annual_mean()
all_sst.rolling_mean(10)
[12]:
all_sst.plot()
Data type cannot be displayed:
[12]:
We can also work out the difference between the two. Here we wil work out the monthly bias per cell. Then calculate the mean global difference per year, and then calculate a rolling 10 year mean.
[13]:
all_sst = nc.open_data([sst_noaa.current, sst_hadley.current])
all_sst.merge(match = ["year", "month"])
all_sst.transmute({"bias":"hadley-noaa"})
all_sst.set_missing([-9000, - 900])
all_sst.spatial_mean()
all_sst.annual_mean()
all_sst.rolling_mean(10)
all_sst.plot()
Data type cannot be displayed:
[13]:
You can see that there is a notable difference at the start of the time series.
Merging files with different times¶
TBC
Ensemble averaging¶
TBC
Speeding up code¶
Lazy evaluation¶
Under the hood nctoolkit relies mostly on CDO to carry out the specified manipulation of netcdf files. Each time CDO is called a new temporary file is generated. This has the potential to result in slower than necessary processing chains, as IO takes up far too much time.
I will demonstrate this using a netcdf file os sea surface temperature. To download the file we can just use wget:
[1]:
import nctoolkit as nc
import warnings
warnings.filterwarnings('ignore')
from IPython.display import clear_output
!wget ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.ltm.1981-2010.nc
clear_output()
We can then set up the dataset which we will use for manipulating the SST climatology.
[2]:
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
Now, let’s select the variable sst, clip the file to the northern hemisphere, calculate the mean value in each grid cell for the first half of the year, and then calculate the spatial mean.
[3]:
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
The dataset’s history is as follows:
[4]:
sst.history
[4]:
['cdo -L -selname,sst sst.mon.ltm.1981-2010.nc /tmp/nctoolkitqhgujflsnctoolkittmpipj7up1l.nc',
'cdo -L -sellonlatbox,-180,180,0,90 /tmp/nctoolkitqhgujflsnctoolkittmpipj7up1l.nc /tmp/nctoolkitqhgujflsnctoolkittmp920v1_r7.nc',
'cdo -L -selmonth,1,2,3,4,5,6 /tmp/nctoolkitqhgujflsnctoolkittmp920v1_r7.nc /tmp/nctoolkitqhgujflsnctoolkittmpbnck_dy2.nc',
'cdo -L -timmean /tmp/nctoolkitqhgujflsnctoolkittmpbnck_dy2.nc /tmp/nctoolkitqhgujflsnctoolkittmpjmzt1l67.nc',
'cdo -L -fldmean /tmp/nctoolkitqhgujflsnctoolkittmpjmzt1l67.nc /tmp/nctoolkitqhgujflsnctoolkittmpdus63y8i.nc']
In total, there are 5 operations, with temporary files created each time. However, we only want to generate one temporary file. So, can we do that? Yes, thanks to CDO’s method chaining ability. If we want to utilize this we need to set the session’s evaluation to lazy, using options. Once this is done nctoolkit will only evaluate things either when it needs to, e.g. you call a method that cannot possibly be chained, or if you evaluate it using run. This works as follows:
[5]:
ff = "sst.mon.ltm.1981-2010.nc"
nc.options(lazy = True)
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
sst.run()
We can now see that the history is much cleaner, with only one command.
[6]:
sst.history
[6]:
['cdo -L -fldmean -timmean -selmonth,1,2,3,4,5,6 -sellonlatbox,-180,180,0,90 -selname,sst sst.mon.ltm.1981-2010.nc /tmp/nctoolkitqhgujflsnctoolkittmpkdkiwey2.nc']
How does this impact run time? Let’s time the original, unchained method.
[7]:
%%time
nc.options(lazy = False)
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
CPU times: user 37.2 ms, sys: 61.6 ms, total: 98.7 ms
Wall time: 667 ms
[8]:
%%time
nc.options(lazy = True)
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
sst.run()
CPU times: user 17.3 ms, sys: 4.28 ms, total: 21.6 ms
Wall time: 161 ms
This was almost 4 times faster. Exact speed improvements, will of course depend on specific IO requirements, and some times using lazy evaluation will make negligible impact, but in others can make code over 10 times fasteExact speed improvements, will of course depend on specific IO requirements, and some times using lazy evaluation will make negligible impact, but in others can make code over 10 times faster.
Processing files in parallel¶
When processing a dataset made up of multiple files, it is possible carry out the processing in parallel for more or less all of the methods available in nctoolkit. To carry out processing in parallel with 6 cores, we would use options as follows:
[9]:
nc.options(cores = 6)
By default the number of cores in use is 1. Of course, this can result in you crashing your computer if the total RAM in use is excessive, so it’s best practise to check RAM used with one core first.
Using thread-safe libraries¶
If the CDO installation being called by nctoolkit is compiled with threadsafe hdf5, then you can achieve potentially significant speed ups with the following command:
[10]:
nc.options(thread_safe = True)
If you are not sure, if hdf5 has been built thread safe, a simple way to find this out is to run the code below. If it fails, you can be more or less certain it is not threadsafe.
[11]:
nc.options(lazy = True)
nc.options(thread_safe = True)
ff = "sst.mon.ltm.1981-2010.nc"
sst = nc.open_data(ff)
sst.select_variables("sst")
sst.clip(lat = [0,90])
sst.select_months(list(range(1,7)))
sst.mean()
sst.spatial_mean()
sst.run()
User Guide
Help & reference
An A-Z guide to nctoolkit methods¶
This guide will provide examples of how to use almost every method available in nctoolkit.
add¶
This method can add to a dataset. You can add a constant, another dataset or a NetCDF file. In the case of datasets or NetCDF files the grids etc. must be of the same structure as the original dataset.
For example, if we had a temperature dataset where temperature was in Celsius, we could convert it to Kelvin by adding 273.15.
data = nc.open_data(infile)
data.add(273.15)
If we have two sets, we add one to the other as follows:
data1 = nc.open_data(infile1)
data2 = nc.open_data(infile2)
data1.add(data2)
In the above example, all we are doing is adding infile2 to data2, so instead we could simply do this:
data1.add(infile2)
annual_anomaly¶
This method will calculate the annual anomaly for each variable (and in each grid cell) compared with a baseline. This is a standard anomaly calculation where first the mean value is calculated for the baseline period, and the difference between the values is calculated.
For example, if we wanted to calculate the anomalies in a dataset compared with a baseline period of 1900-1919 we would do the following:
data = nc.open_data(infile)
data.annual_anomaly(baseline=[1900, 1919])
We may be more interested in the rolling anomaly, in particular when there is a lot of annual variation. In the above case, if you wanted a 20 year rolling mean anomaly, you would do the following:
data = nc.open_data(infile)
data.annual_anomaly(baseline=[1900, 1919], window=20)
By default this method works out the absolute anomaly. However, in some cases the relative anomaly is more interesting. To calculate this we set the metric argument to “relative”:
data = nc.open_data(infile)
data.annual_anomaly(baseline=[1900, 1919], metric = "relative")
annual_max¶
This method will calculate the maximum value in each available year and for each grid cell of dataset.
data = nc.open_data(infile)
data.annual_max()
annual_mean¶
This method will calculate the maximum value in each available year and for each grid cell of dataset.
data = nc.open_data(infile)
data.annual_mean()
annual_min¶
This method will calculate the minimum value in each available year and for each grid cell of dataset.
data = nc.open_data(infile)
data.annual_min()
annual_range¶
This method will calculate the range of values in each available year and for each grid cell of dataset.
data = nc.open_data(infile)
data.annual_range()
append¶
This method will let you append individual or multiple files to your dataset. Usage is straightforward. Note that this will not perform any merging on the dataset.
data.append(newfile)
bottom¶
This method will extract the bottom vertical level from a dataset. This is useful for some oceanographic datasets, where the method can let you select the seabed. Note that this method will not work with all data types. For example, in ocean data with fixed depth levels, the bottom cell in the NetCDF data is not the actual seabed. See bottom_mask for these cases.
data = nc.open_data(infile)
data.bottom()
bottom_mask¶
This method will identify the bottommost level in each grid with a non-NA value.
data = nc.open_data(infile)
data.bottom_mask()
cdo_command¶
This method let’s you run a cdo command. CDO commands are generally of the form “cdo {command} infile outfile”. cdo_command therefore only requires the command portion of this. If we wanted to run the following CDO command
cdo -timmean -selmon,4 infile outfile
we would do the following:
data = nc.open_data(infile)
data.cdo_command("-timmean -selmon,4")
cell_areas¶
This method either adds the areas of each grid cell to the dataset or converts the dataset to a new dataset showing only the grid cell areas. By default it adds the cell areas (in square metres) to the dataset.
data = nc.open_data(infile)
data.cell_areas()
If we only want the cell areas we can set join to False:
data.cell_areas(join=False)
clip¶
This method will clip a region to a specified longitude and latitude box. For example, if we wanted to clip a dataset to the North Atlantic, we could do this:
data = nc.open_data(infile)
data.clip(lon = [-80, 20], lat = [40, 70])
compare_all¶
This method let’s us compare all variables in a dataset with a constant. If we wanted to identify the grid cells with values above 20, we could do the following:
data = nc.open_data(infile)
data.compare_all(">20")
Similarly, if we wanted to identify grid cells with negative values we would do this:
data = nc.open_data(infile)
data.compare_all("<0")
cor_space¶
This method calculates the correlation coefficients between two variables in space for each time step. So, if we wanted to work out the correlation between the variables var1 and var2, we would do this:
data = nc.open_data(infile)
data.cor_space("var1", "var2")
cor_time¶
This method calculates the correlation coefficients between two variables in time for each grid cell. If we wanted to work out the correlation between two variables var1 and var2 we would do the following:
data = nc.open_data(infile)
data.cor_time("var1", "var2")
cum_sum¶
This method will calculate the cumulative sum, over time, for all variables. Usage is simple:
data = nc.open_data(infile)
data.cum_sum()
daily_max_climatology¶
This method will calculate the maximum value that is observed on each day of the year over time. So, for example, if you had 100 years of daily temperature data, it will calculate the maximum value ever observed on each day.
data = nc.open_data(infile)
data.daily_max_climatology()
daily_mean_climatology¶
This method will calculate the mean value that is observed on each day of the year over time. So, for example, if you had 100 years of daily temperature data, it will calculate the mean value ever observed on each day.
data = nc.open_data(infile)
data.daily_mean_climatology()
daily_min_climatology¶
This method will calculate the minimum value that is observed on each day of the year over time. So, for example, if you had 100 years of daily temperature data, it will calculate the minimum value ever observed on each day.
data = nc.open_data(infile)
data.daily_min_climatology()
daily_range_climatology¶
This method will calculate the value range that is observed on each day of the year over time. So, for example, if you had 100 years of daily temperature data, it will calculate the difference between the maximum and minimum observed values each day.
data = nc.open_data(infile)
data.daily_range_climatology()
divide¶
This method will divide a dataset by a constant, or the values in another dataset of NetCDF file. If we wanted to divide everything in a dataset by 2, we would do the following:
data = nc.open_data(infile)
data.divide(2)
If we want to divide a dataset by another, we can do this easily. Note that the datasets must be comparable, i.e. they must have the same grid. The second dataset must have either the same number of variables or only one variable. In the latter case everything is divided by that variable. The same holds for vertical levels.
data1 = nc.open_data(infile1)
data2 = nc.open_data(infile2)
data1.divide(data2)
ensemble_max, ensemble_min, ensemble_range and ensemble_mean¶
These methods will calculate the ensemble statistic, when a dataset is made up of multiple files. Two methods are available. First, the statistic across all available time steps can be calculated. For this ignore_time must be set to False. For example:
data = nc.open_data(file_list)
data.ensemble_max(ignore_time = True)
The second method is to calculate the maximum value in each given time step. For example, if the ensemble was made up of 100 files where each file contains 12 months of data, ensemble_max will work out the maximum monthly value. By default ignore_time is False.
data = nc.open_data(file_list)
data.ensemble_max(ignore_time = False)
ensemble_percentile¶
This method works in the same way as ensemble_mean etc. above. However, it requires an additional term p, which is the percentile. For example, if we had to calculate the 75th ensemble percentile, we would do the following:
data = nc.open_data(file_list)
data = nc.ensemble_percentile(75)
invert_levels¶
This method will invert the vertical levels of a dataset.
data = nc.open_data(infile)
data.invert_levels()
mask_box¶
This method will set everything outside a specificied longitude/latitude box to NA. The code below illustrates how to mask the North Atlantic in the SST dataset.
data = nc.open_data(infile)
data.mask_box(lon = [-80, 20], lat = [40, 70])
max¶
This method will calculate the maximum value of all variables in all grid cells. If we wanted to calculate the maximum observed monthly sea surface temperature in the SST dataset we would do the following:
data = nc.open_data(infile)
data.max()
mean¶
This method will calculate the mean value of all variables in all grid cells. If we wanted to calculate the maximum observed monthly sea surface temperature in the SST dataset we would do the following:
data = nc.open_data(infile)
data.mean()
merge and merge_time¶
nctoolkit offers two methods for merging the files within a multi-file dataset. These methods operate in a similar way to column based joining and row-based binding in dataframes.
The merge method is suitable for merging files that have different variables, but the same time steps. The merge_time method is suitable for merging files that have the same variables, but have different time steps.
Usage for merge_time is as simple as:
data = nc.open_data(file_list)
data.merge_time()
Merging NetCDF files with different variables is potentially risky, as it is possible you can merge files that have the same number of time steps but have different times. nctoolkit’s merge method therefore offers some security against a major error when merging. It requires a match argument to be supplied. This ensures that the times in each file is comparable to the others. By default match = [“year”, “month”, “day”], i.e. it checks if the times in each file all have the same year, month and day. The match argument must be some subset of [“year”, “month”, “day”]. For example, if you wanted to only make sure the files had the same year, you would do the following:
data = nc.open_data(file_list)
data.merge(match = ["year", "month", "day"])
max¶
This method will calculate the maximum value of all variables in all grid cells. If we wanted to calculate the maximum observed monthly sea surface temperature in the SST dataset we would do the following:
data = nc.open_data(infile)
data.max()
mean¶
This method will calculate the mean value of all variables in all grid cells. If we wanted to calculate the mean observed monthly sea surface temperature in the SST dataset we would do the following:
data = nc.open_data(infile)
data.mean()
monthly_anomaly¶
This method will calculate the monthly anomaly compared with the mean value for a baseline period. For example, if we wanted the monthly anomaly compared with the mean for 1990-1999 we would do the below.
data = nc.open_data(infile)
data.monthly_anomaly(baseline = [1990, 1999])
monthly_max¶
This method will calculate the maximum value in the month of each year of a dataset. This is useful for daily time series. If you want to calculate the mean value in each month across all available years, use monthly_max_climatology. Usage is simple:
data = nc.open_data(infile)
data.monthly_max()
monthly_max_climatology¶
This method will calculate, for each month, the maximum value of each variable over all time steps.
data = nc.open_data(infile)
data.monthly_max_climatology()
monthly_mean¶
This method will calculate the mean value of each variable in each month of a dataset. Note that this is calculated for each year. See monthly_mean_climatology if you want to calculate a climatological monthly mean.
data = nc.open_data(infile)
data.monthly_mean()
monthly_mean_climatology¶
This method will calculate, for each month, the maximum value of each variable over all time steps. Usage is simple:
data = nc.open_data(infile)
data.monthly_mean_climatology()
monthly_min¶
This method will calculate the minimum value in the month of each year of a dataset. This is useful for daily time series. If you want to calculate the mean value in each month across all available years, use monthly_max_climatology. Usage is simple:
data = nc.open_data(infile)
data.monthly_min()
monthly_min_climatology¶
This method will calculate, for each month, the minimum value of each variable over all time steps. Usage is simple:
data = nc.open_data(infile)
data.monthly_min_climatology()
monthly_range¶
This method will calculate the value range in the month of each year of a dataset. This is useful for daily time series. If you want to calculate the value range in each month across all available years, use monthly_range_climatology. Usage is simple:
data = nc.open_data(infile)
data.monthly_range()
monthly_range_climatology¶
This method will calculate, for each month, the value range of each variable over all time steps. Usage is simple:
data = nc.open_data(infile)
data.monthly_range_climatology()
multiply¶
This method will multiply a dataset by a constant, another dataset or a NetCDF file. If multiplied by a dataset or NetCDF file, the dataset must have the same grid and can only have one variable.
If you want to multiply a dataset by 2, you can do the following:
data = nc.open_data(infile)
data.multiply(2)
If you wanted to multiply a dataset data1 by another, data2, you can do the following:
data1 = nc.open_data(infile1)
data2 = nc.open_data(infile2)
data1.multiply(data2)
mutate¶
This method can be used to generate new variables using arithmetic expressions. New variables are added to the dataset. The method requires a dictionary, where the key-value pairs are the new variables and expression required to generate it.
For example, if had a temperature dataset, with temperature in Celsius, we might want to convert that to Kelvin. We can do this easily:
data = nc.open_data(infile)
data.mutate({"temperature_k":"temperature+273.15"})
percentile¶
This method will calculate a given percentile for each variable and grid cell. This will calculate the percentile using all available timesteps.
We can calculate the 75th percentile of sea surface temperature as follows:
data = nc.open_data(infile)
data.percentile(75)
phenology¶
A number of phenological indices can be calculated. These are based on the plankton metrics listed by Ji et al. 2010. These methods require datasets or the files within a dataset to only be made up of individual years, and ideally every day of year is available. At present this method can only calculate the phenology metric for a single variable.
The available metrics are: peak - the time of year when the maximum value of a variable occurs. middle - the time of year when 50% of the annual cumulative sum of a variable is first exceeded start - the time of year when a lower threshold (which must be defined) of the annual cumulative sum of a variable is first exceeded end - the time of year when an upper threshold (which must be defined) of the annual cumulative sum of a variable is first exceeded
For example, if you wanted to calculate timing of the peak, you set metric to “peak”, and define the variable to be analyzed:
data = nc.open_data(infile)
data.phenology(metric = "peak", var = "var_chosen")
plot¶
This method will plot the contents of a dataset. It will either show a map or a time series, depending on the data type. While it should work on at least 90% of NetCDF data, there are some data types that remain incompatible, but will be added to nctoolkit over time. Usage is simple:
data = nc.open_data(infile)
data.plot()
range¶
This method calculates the range for all variables in each grid cell across all steps.
We can calculate the range of sea surface temperatures in the SST dataset as follows:
data = nc.open_data(infile)
data.range()
regrid¶
This method will remap a dataset to a new grid. This grid must be either a pandas data frame, a NetCDF file or a single file nctoolkit dataset.
For example, if we wanted to regrid a dataset to a single location, we could do the following:
import pandas as pd
data = nc.open_data(infile)
grid = pd.DataFrame({"lon":[-20], "lat":[50]})
data.regrid(grid, method = "nn")
If we wanted to regrid one dataset, dataset1, to the grid of another, dataset2, using bilinear interpolation, we would do the following:
data1 = nc.open_data(infile1)
data2 = nc.open_data(infile2)
data1.regrid(data2, method = "bil")
remove_variables¶
This method will remove variables from a dataset. Usage is simple, with the method only requiring either a str of a single variable or a list of variables to remove:
data = nc.open_data(infile)
data.remove_variables(vars)
rename¶
This method allows you to rename variables. It requires a dictionary, with key-value pairs representing the old variable names and new variables. For example, if we wanted to rename a variable old to new, we would do the following:
data = nc.open_data(infile)
data.rename({"old":"new"})
rolling_max¶
This method will calculate the rolling maximum over a specifified window. For example, if you needed to calculate the rolling maximum with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_max(window = 10)
rolling_mean¶
This method will calculate the rolling mean over a specifified window. For example, if you needed to calculate the rolling mean with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_mean(window = 10)
rolling_min¶
This method will calculate the rolling minimum over a specifified window. For example, if you needed to calculate the rolling minimum with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_min(window = 10)
rolling_range¶
This method will calculate the rolling range over a specifified window. For example, if you needed to calculate the rolling range with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_range(window = 10)
rolling_sum¶
This method will calculate the rolling sum over a specifified window. For example, if you needed to calculate the rolling sum with a window of 10, you would do the following:
data = nc.open_data(infile)
data.rolling_sum(window = 10)
run¶
This method will evaluate all of a dataset’s unevaluated commands. Usage is simple:
nc.options(lazy = True)
data = nc.open_data(infile)
data.select_years(1990)
data.run()
seasonal_max¶
This method will calculate the maximum value observed in each season. Note this is worked out for the seasons of each year. See seasonal_max_climatology for climatological seasonal maximums.
data.seasonal_max()
seasonal_max_climatology¶
This method calculates the maximum value observed in each season across all years. Usage is simple:
data = nc.open_data(infile)
data.seasonal_max_climatology()
seasonal_mean¶
This method will calculate the mean value observed in each season. Note this is worked out for the seasons of each year. See seasonal_mean_climatology for climatological seasonal means.
data = nc.open_data(infile)
data.seasonal_mean()
seasonal_mean_climatology¶
This method calculates the mean value observed in each season across all years. Usage is simple:
data = nc.open_data(infile)
data.seasonal_mean_climatology()
seasonal_min¶
This method will calculate the minimum value observed in each season. Note this is worked out for the seasons of each year. See seasonal_min_climatology for climatological seasonal minimums.
data = nc.open_data(infile)
data.seasonal_min()
seasonal_min_climatology¶
This method calculates the minimum value observed in each season across all years. Usage is simple:
data = nc.open_data(infile)
data.seasonal_min_climatology()
seasonal_range¶
This method will calculate the value range observed in each season. Note this is worked out for the seasons of each year. See seasonal_range_climatology for climatological seasonal ranges.
data = nc.open_data(infile)
data.seasonal_range()
seasonal_range_climatology¶
This method calculates the value range observed in each season across all years. Usage is simple:
data = nc.open_data(infile)
data.seasonal_range_climatology()
select_months¶
This method allows you to subset a dataset to specific months. This can either be a single month, a list of months or a range. For example, if we wanted the first half of a year, we would do the following:
data = nc.open_data(infile)
data.select_months(range(1, 7))
select_variables¶
This method allows you to subset a dataset to specific variables. This either accepts a single variable or a list of variables. For example, if you wanted two variables, var1 and var2, you would do the following:
data = nc.open(infile)
data.select_variables(["var1", "var2"])
select_years¶
This method subsets datasets to specified years. It will accept either a single year, a list of years, or a range. For example, if you wanted to subset a dataset the 1990s, you would do the following:
data = nc.open_data(infile)
data.select_years(range(1990, 2000))
set_missing¶
This method allows you to set a range to missing values. It either accepts a single variable or two variables, specifying the range to be set to missing values. For example, if you wanted all values between 0 and 10 to be set to missing, you would do the following:
data = nc.open_data(infile)
data.set_missing([0, 10])
spatial_max¶
This method will calculate the maximum value observed in space for each variable and time step. Usage is simple:
data = nc.open_data(infile)
data.spatial_max()
spatial_mean¶
This method will calculate the spatial mean for each variable and time step. If the grid cell area can be calculated, this will be an area weighted mean. Usage is simple:
data = nc.open_data(infile)
data.spatial_mean()
spatial_min¶
This method will calculate the minimum observed in space for each variable and time step. Usage is simple:
data = nc.open_data(infile)
data.spatial_min()
spatial_percentile¶
This method will calculate the percentile of variable across space for time step. For example, if you wanted to calculate the 75th percentile, you would do the following:
data = nc.open_data(infile)
data.spatial_percentile(p=75)
spatial_range¶
This method will calculate the value range observed in space for each variable and time step. Usage is simple:
data = nc.open_data(infile)
data.spatial_range()
spatial_sum¶
This method will calculate the spatial sum for each variable and time step. In some cases, for example when variables are concentrations, it makes more sense to multiply the value in each grid cell by the grid cell area, when doing a spatial sum. This method therefore has an argument by_area which defines whether to multiply the variable value by the area when doing the sum. By default by_area is False.
Usage is simple:
data = nc.open_data(infile)
data.spatial_sum()
split¶
Except for methods that begin with merge or ensemble, all nctoolkit methods operate on individual files within a dataset. There are therefore cases when you might want to be able to split a dataset into separate files for analysis. This can be done using split, which let’s you split a file into separate years, months or year/month combinations. For example, if you want to split a dataset into files of different years, you can do this:
data = nc.open_data(infile)
data.split("year")
subtract¶
This method can subtract from a dataset. You can substract a constant, another dataset or a NetCDF file. In the case of datasets or NetCDF files the grids etc. must be of the same structure as the original dataset.
For example, if we had a temperature dataset where temperature was in Kelvin, we could convert it to Celsiu by subtracting 273.15.
data = nc.open_data(infile)
data.substract(273.15)
sum¶
This method will calculate the sum of values of all variables in all grid cells. Usage is simple:
data = nc.open_data(infile)
data.sum()
surface¶
This method will extract the surface level from a multi-level dataset. Usage is simple:
data = nc.open_data(infile)
data.surface()
to_dataframe¶
This method will return a pandas dataframe with the contents of the dataset. This has a decode_times argument to specify whether you want the times to be decoded. Defaults to True. Usage is simple:
data = nc.open_data(infile)
data.to_dataframe()
to_latlon¶
This method will regrid a dataset to a regular latlon grid. The minimum and maximum longitudes and latitudes must be specified, along with the horizontal and vertical resolutions.
data = nc.open_data(infile)
data.to_latlon(lon = [-80, 20], lat = [30, 80], res = [1,1])
to_xarray¶
This method will return an xarray datasetwith the contents of the dataset. This has a decode_times argument to specify whether you want the times to be decoded. Defaults to True. Usage is simple:
data = nc.open_data(infile)
data.to_xarray()
transmute¶
This method can be used to generate new variables using arithmetic expressions. Existing will be removed from the dataset. See mutate if you want to keep existing variables. The method requires a dictionary, where the key-value pairs are the new variables and expression required to generate it.
For example, if had a temperature dataset, with temperature in Celsius, we might want to convert that to Kelvin. We can do this easily:
data = nc.open_data(infile)
data.transmute({"temperature_k":"temperature+273.15"})
var¶
This method calculates the variance of each variable in the dataset. This is calculate across all time steps. Usage is simple:
data = nc.open_data(infile)
data.var()
vertical_interp¶
This method interpolates variables vertically. It requires a list of vertical levels, for example depths, you want to interpolate. For example, if you had an ocean dataset and you wanted to interpolate to 10 and 20 metres you would do the following:
data = nc.open_data(infile)
data.vertical_interp(levels = [10, 20])
vertical_max¶
This method calculates the maximum value of each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_max()
vertical_mean¶
This method calculates the mean value of each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_mean()
vertical_min¶
This method calculates the minimum value of each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_min()
vertical_range¶
This method calculates the value range of each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_range()
vertical_sum¶
This method calculates the sum each variable across all vertical levels. Usage is simple:
data = nc.open_data(infile)
data.vertical_sum()
write_nc¶
This method allows you to write the contents of a dataset to a NetCDF file. If the target file exists and you want to overwrite it set overwrite to True. Usage is simple:
data.write_nc(outfile)
zip¶
This method will zip the contents of a dataset. This is mostly useful for processing chains where you want to minimize disk space usage by the output. Please note this method works lazily. In the code below only one file is generated, a zipped “outfile”.
nc.options(lazy = True)
data = nc.open_data(infile)
data.select_years(1990)
data.zip()
data.write_nc(outfile)
How to guide¶
This guide will show how to carry out key nctoolkit operations. We will use a sea surface temperature data set and a depth-resolved ocean temperature data set. The data set can be downloaded from here.
[1]:
import nctoolkit as nc
import os
import pandas as pd
import xarray as xr
How to select years and months¶
If we want to select specific years and months we can use the select_years and select_months method
[2]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_years(1960)
sst.select_months(1)
sst.times()
[2]:
['1960-01-01T00:00:00']
How to copy a data set¶
If you want to make a deep copy of a data set, use the built in copy method. This method will return a new data set. This method should be used because of nctoolkit’s built in methods to automatically delete temporary files that are no longer required. Behind the scenes, using copy will result in nctoolkit registering that it needs the NetCDF file for both the original dataset and the new copied one. So if you copy a dataset, and then delete the original, nctoolkit knows to not remove any NetCDF files related to the dataset.
[3]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_years(1960)
sst.select_months(1)
sst1 = sst.copy()
del sst
os.path.exists(sst1.current)
[3]:
True
How to clip to a region¶
If you want to clip the data to a specific longitude and latitude box, we can use clip, with the longitude and latitude range given by lon and lat.
[4]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_months(1)
sst.select_years(1980)
sst.clip(lon = [-80, 20], lat = [40, 70])
sst.plot()
Data type cannot be displayed:
[4]:
How to rename a variable¶
If we want to rename a variable we use the rename method, and supply a dictionary where the key-value pairs are the original and new names
[5]:
sst = nc.open_data("sst.mon.mean.nc")
sst.variables
[5]:
['sst']
The original dataset had only one variable called sst. We can now rename it, and display the new variables.
[6]:
sst.rename({"sst": "temperature"})
sst.variables
[6]:
['temperature']
How to create new variables¶
New variables can be created using arithmetic operations using either mutate or transmute. The mutate method will maintain the original variables, whereas transmute will not. This method requires a dictionary, where the key, values pairs are the names of the new variables and the arithemtic operations to perform. The example below shows how to create a new variable with
[7]:
sst = nc.open_data("sst.mon.mean.nc")
sst.mutate({"sst_k": "sst+273.15"})
sst.variables
[7]:
['sst', 'sst_k']
How to calculate a spatial average¶
You can calculate a spatial average using the spatial_mean method. There are additional methods for maximums etc.
[8]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.plot()
Data type cannot be displayed:
[8]:
How to calculate an annual mean¶
You can calculate an annual mean using the annual_mean method.
[9]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.annual_mean()
sst.plot()
Data type cannot be displayed:
[9]:
How to calculate a rolling average¶
You can calculate a rolling mean using the rolling_mean method, with the window argument providing the number of time steps to average over. There are additional methods for rolling sums etc. The code below will calculate a rolling mean of global SST using a 20 year window.
[10]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.annual_mean()
sst.rolling_mean(20)
sst.plot()
Data type cannot be displayed:
[10]:
How to calculate temporal anomalies¶
You can calculate annual temporal anomalies using the anomaly_annual method. This requires a baseline period.
[11]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.annual_anomaly(baseline = [1960, 1979])
sst.plot()
Data type cannot be displayed:
[11]:
How to split data by year etc¶
Files within a dataset can be split by year, day, year and month or season using the split method. If we wanted to split by year, we do the following:
[12]:
sst = nc.open_data("sst.mon.mean.nc")
sst.split("year")
sst.size
[12]:
'Number of files in ensemble: 169\nEnsemble size: 530.445201 MB\nSmallest file: /tmp/nctoolkitayrhmwtbnctoolkittmp72u9tn3o.1898.nc has size 3.1387289999999997 MB\nLargest file: /tmp/nctoolkitayrhmwtbnctoolkittmp72u9tn3o.1898.nc has size 3.1387289999999997 MB'
How to merge files in time¶
We can merge files based on time using merge_time. We can do this by merging the dataset that results from splitting the original sst dataset. If we split the dataset by year, we see that there are 169 files, one for each year.
[13]:
sst = nc.open_data("sst.mon.mean.nc")
sst.split("year")
sst.size
[13]:
'Number of files in ensemble: 169\nEnsemble size: 530.445201 MB\nSmallest file: /tmp/nctoolkitayrhmwtbnctoolkittmp58y4ytyj.1998.nc has size 3.1387289999999997 MB\nLargest file: /tmp/nctoolkitayrhmwtbnctoolkittmp58y4ytyj.1998.nc has size 3.1387289999999997 MB'
We can then merge them together to get a single file dataset:
[14]:
sst.merge_time()
sst.size
[14]:
'Number of files: 1\nFile size: 525.828237 MB'
How to do variables based merging¶
If we have two more more files that have the same time steps, but different variables, we can merge them using merge. The code below will first create a dataset with a netcdf file with sst in K, and it will then create a new dataset with this netcd file and the original, and then merge them.
[15]:
sst1 = nc.open_data("sst.mon.mean.nc")
sst2 = nc.open_data("sst.mon.mean.nc")
sst2.transmute({"sst_k": "sst+273.15"})
new_sst = nc.open_data([sst1.current, sst2.current])
new_sst.current
new_sst.merge()
new_sst.variables
[15]:
['sst.mon.mean.nc', '/tmp/nctoolkitayrhmwtbnctoolkittmp8vgtaa28.nc']
[15]:
['sst', 'sst_k']
In some cases we will have two or more datasets we want to merge. In this case we can use the merge function as follows:
[16]:
sst1 = nc.open_data("sst.mon.mean.nc")
sst2 = nc.open_data("sst.mon.mean.nc")
sst2.transmute({"sst_k": "sst+273.15"})
new_sst = nc.merge(sst1, sst2)
new_sst.variables
[16]:
['sst', 'sst_k']
How to horizontally regrid data¶
Variables can be regridded horizontally using regrid. This method requires the new grid to be defined. This can either be a pandas data frame, with lon/lat as columns, an xarray object, a netcdfile or a dataset. I will demonstrate all three methods by regridding SST to the North Atlantic. Let’s begin by getting a grid for the North Atlantic.
[17]:
new_grid = nc.open_data("sst.mon.mean.nc")
new_grid.clip(lon = [-80, 20], lat = [30, 70])
new_grid.select_months(1)
new_grid.select_years( 2000)
First, we will use the new dataset itself to do the regridding. I will calculate mean SST using the original data, and then regrid to the North Atlantic.
[18]:
%%time
sst = nc.open_data("sst.mon.mean.nc")
sst.mean()
sst.regrid(grid = new_grid)
sst.plot()
CPU times: user 56.4 ms, sys: 94.2 ms, total: 151 ms
Wall time: 1.38 s
Data type cannot be displayed:
[18]:
We can also do this using the netcdf, which is new_grid.current
[19]:
%%time
sst = nc.open_data("sst.mon.mean.nc")
sst.mean()
sst.regrid(grid = new_grid.current)
sst.plot()
CPU times: user 60.1 ms, sys: 38.8 ms, total: 99 ms
Wall time: 1.48 s
Data type cannot be displayed:
[19]:
In a similar way we can read the new_grid in as an xarray data set.
[20]:
%%time
na_grid = xr.open_dataset(new_grid.current)
sst = nc.open_data("sst.mon.mean.nc")
sst.mean()
sst.regrid(grid = na_grid)
sst.plot()
CPU times: user 72.7 ms, sys: 44.4 ms, total: 117 ms
Wall time: 1.49 s
Data type cannot be displayed:
[20]:
or we can use a pandas data frame. In this case I will convert the xarray data set to a data frame.
[21]:
%%time
na_grid = xr.open_dataset(new_grid.current)
na_grid = na_grid.to_dataframe().reset_index().loc[:,["lon", "lat"]]
sst = nc.open_data("sst.mon.mean.nc")
sst.mean()
sst.regrid(grid = na_grid)
sst.plot()
CPU times: user 72.6 ms, sys: 39 ms, total: 112 ms
Wall time: 1.46 s
Data type cannot be displayed:
[21]:
How to temporally interpolate¶
Temporal interpolation can be carried out using time_interp. This method requires a start date (start) of the format YYYY/MM/DD and an end date (end), and a temporal resolution (resolution), which is either 1 day (“daily”), 1 week (“weekly”), 1 month (“monthly”), or 1 year (“yearly”).
[22]:
sst = nc.open_data("sst.mon.mean.nc")
sst.time_interp(start = "1990/01/01", end = "1990/12/31", resolution = "daily")
How to calculate a monthly average from daily data¶
If you have daily data, you can calculate a month average using monthly_mean. There are also methods for maximums etc.
[23]:
sst = nc.open_data("sst.mon.mean.nc")
sst.time_interp(start = "1990/01/01", end = "1990/12/31", resolution = "daily")
sst.monthly_mean()
How to calculate a monthly climatology¶
CDO outputs the date of the final month.
[24]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_years(list(range(1990, 2000)))
sst.monthly_mean_climatology()
sst.select_months(1)
sst.plot()
Data type cannot be displayed:
[24]:
How to calculate a seasonal climatology¶
[25]:
sst = nc.open_data("sst.mon.mean.nc")
sst.seasonal_mean_climatology()
sst.select_timestep(0)
sst.plot()
Data type cannot be displayed:
[25]:
API Reference¶
Reading/copying data¶
|
Read netcdf data as a DataSet object |
|
Make a deep copy of an DataSet object |
Merging or analyzing multiple datasets¶
|
Merge datasets |
|
Calculate the temporal correlation coefficient between two datasets This will calculate the temporal correlation coefficient, in each grid cell, between two datasets |
|
Calculate the spatial correlation coefficient between two datasets This will calculate the spatial correlation coefficient, for each time step, between two datasets |
Accessing attributes¶
List variables contained in a dataset |
|
List years contained in a dataset |
|
List months contained in a dataset |
|
List times contained in a dataset |
|
List levels contained in a dataset |
|
The size of an object This will print the number of files, total size, and smallest and largest files in an DataSet object. |
|
The current file or files in the DataSet object |
|
The history of operations on the DataSet |
|
The starting file or files of the DataSet object |
Plotting¶
|
Autoplotting method. |
|
Open the current dataset’s file in ncview |
Variable modification¶
|
Create new variables using mathematical expressions, and keep original variables |
|
Create new variables using mathematical expressions, and drop original variables |
|
Rename variables in a dataset |
|
Set the missing value for a single number or a range |
|
Calculate the sum of all variables for each time step |
NetCDF file attribute modification¶
|
Set the long names of variables |
|
Set the units for variables |
Vertical/level methods¶
|
Extract the top/surface level from a dataset This extracts the first vertical level from each file in a dataset. |
|
Extract the bottom level from a dataset This extracts the bottom level from each NetCDF file. |
|
Verticaly interpolate a dataset based on given vertical levels This is calculated for each time step and grid cell |
|
Calculate the depth-averaged mean for each variable This is calculated for each time step and grid cell |
|
Calculate the vertical minimum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical maximum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical range of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical sum of variable values This is calculated for each time step and grid cell |
|
Calculate the vertical sum of variable values This is calculated for each time step and grid cell |
|
Invert the levels of 3D variables This is calculated for each time step and grid cell |
|
Create a mask identifying the deepest cell without missing values. |
Rolling methods¶
|
Calculate a rolling mean based on a window |
|
Calculate a rolling minimum based on a window |
|
Calculate a rolling maximum based on a window |
|
Calculate a rolling sum based on a window |
|
Calculate a rolling range based on a window |
Evaluation setting¶
|
Run all stored commands in a dataset |
Cleaning functions¶
Ensemble creation¶
|
Generate an ensemble |
Arithemetic methods¶
|
Create new variables using mathematical expressions, and keep original variables |
|
Create new variables using mathematical expressions, and drop original variables |
|
Add to a dataset This will add a constant, another dataset or a NetCDF file to the dataset. |
|
Subtract from a dataset This will subtract a constant, another dataset or a NetCDF file from the dataset. |
|
Multiply a dataset This will multiply a dataset by a constant, another dataset or a NetCDF file. |
|
Divide the data This will divide the dataset by a constant, another dataset or a NetCDF file. |
Ensemble statistics¶
|
Calculate an ensemble mean |
|
Calculate an ensemble min |
|
Calculate an ensemble maximum |
|
Calculate an ensemble percentile This will calculate the percentles for each time step in the files. |
|
Calculate an ensemble range The range is calculated for each time step; for example, if each file in the ensemble has 12 months of data the statistic will be calculated for each month. |
Subsetting operations¶
|
Clip to a rectangular longitude and latitude box |
|
Select variables from a dataset |
|
Remove variables This will remove stated variables from files in the dataset. |
|
Select years from a dataset This method will subset the dataset to only contain years within the list given. |
|
Select months from a dataset This method will subset the dataset to only contain months within the list given. |
|
Select season from a dataset |
|
Select timesteps from a dataset |
Time-based methods¶
|
Set the date in a dataset You should only do this if you have to fix/change a dataset with a single, not multiple dates. |
|
Select months from a dataset This method will subset the dataset to only contain months within the list given. |
|
Select season from a dataset |
|
Select years from a dataset This method will subset the dataset to only contain years within the list given. |
|
Shift times in dataset by a number of hours |
|
Shift times in dataset by a number of days |
Interpolation methods¶
|
Regrid a dataset to a target grid |
|
Regrid a dataset to a regular latlon grid |
|
Temporally interpolate variables based on date range and time resolution |
Masking methods¶
|
Mask a lon/lat box |
Summary methods¶
|
Calculate annual anomalies for each variable based on a baseline period The anomaly is derived by first calculating the climatological annual mean for the given baseline period. |
|
Calculate monthly anomalies based on a baseline period The anomaly is derived by first calculating the climatological monthly mean for the given baseline period. |
|
Calculate phenologies from a dataset Each file in an ensemble must only cover a single year, and ideally have all days. |
Statistical methods¶
|
Calculate the temporal mean of all variables |
|
Calculate the temporal minimum of all variables |
|
Calculate the temporal percentile of all variables |
|
Calculate the temporal maximum of all variables |
|
Calculate the temporal sum of all variables |
|
Calculate the temporal range of all variables |
|
Calculate the temporal variance of all variables |
|
Calculate the temporal cumulative sum of all variables |
|
Calculate the correlation correct between two variables in space This is calculated for each time step. |
|
Calculate the correlation correct in time between two variables The correlation is calculated for each grid cell, ignoring missing values. |
|
Calculate the area weighted spatial mean for all variables This is performed for each time step. |
|
Calculate the spatial minimum for all variables This is performed for each time step. |
|
Calculate the spatial maximum for all variables This is performed for each time step. |
|
Calculate the spatial sum for all variables This is performed for each time step. |
|
Calculate the spatial range for all variables This is performed for each time step. |
|
Calculate the spatial sum for all variables This is performed for each time step. |
|
Calculate the monthly mean for each year/month combination in files. |
|
Calculate the monthly minimum for each year/month combination in files. |
|
Calculate the monthly maximum for each year/month combination in files. |
|
Calculate the monthly range for each year/month combination in files. |
Calculate a daily mean climatology |
|
Calculate a daily minimum climatology |
|
Calculate a daily maximum climatology |
|
Calculate a daily mean climatology |
|
Calculate a daily range climatology |
|
Calculate the monthly mean climatologies Defined as the minimum value in each month across all years. |
|
Calculate the monthly minimum climatologies Defined as the minimum value in each month across all years. |
|
Calculate the monthly maximum climatologies Defined as the maximum value in each month across all years. |
|
Calculate the monthly range climatologies Defined as the range of value in each month across all years. |
|
|
Calculate the annual mean for each variable |
|
Calculate the annual minimum for each variable |
|
Calculate the annual maximum for each variable |
|
Calculate the annual range for each variable |
|
Calculate the seasonal mean for each year. |
|
Calculate the seasonal minimum for each year. |
|
Calculate the seasonal maximum for each year. |
|
Calculate the seasonal range for each year. |
Calculate a climatological seasonal mean |
|
Calculate a climatological seasonal min This is defined as the minimum value in each season across all years. |
|
Calculate a climatological seasonal max This is defined as the maximum value in each season across all years. |
|
Calculate a climatological seasonal range This is defined as the range of values in each season across all years. |
|
|
Calculate the zonal mean for each year/month combination in files. |
|
Calculate the zonal minimum for each year/month combination in files. |
|
Calculate the zonal maximum for each year/month combination in files. |
|
Calculate the zonal range for each year/month combination in files. |
Seasonal methods¶
|
Calculate the seasonal mean for each year. |
|
Calculate the seasonal minimum for each year. |
|
Calculate the seasonal maximum for each year. |
|
Calculate the seasonal range for each year. |
Calculate a climatological seasonal mean |
|
Calculate a climatological seasonal min This is defined as the minimum value in each season across all years. |
|
Calculate a climatological seasonal max This is defined as the maximum value in each season across all years. |
|
Calculate a climatological seasonal range This is defined as the range of values in each season across all years. |
|
|
Select season from a dataset |
Merging methods¶
|
Merge a multi-file ensemble into a single file Merging will occur based on the time steps in the first file. |
|
Time-based merging of a multi-file ensemble into a single file This method is ideal if you have the same data split over multiple files covering different data sets. |
Climatology methods¶
Calculate a daily mean climatology |
|
Calculate a daily minimum climatology |
|
Calculate a daily maximum climatology |
|
Calculate a daily mean climatology |
|
Calculate a daily range climatology |
|
Calculate the monthly mean climatologies Defined as the minimum value in each month across all years. |
|
Calculate the monthly minimum climatologies Defined as the minimum value in each month across all years. |
|
Calculate the monthly maximum climatologies Defined as the maximum value in each month across all years. |
|
Calculate the monthly range climatologies Defined as the range of value in each month across all years. |
Splitting methods¶
|
Split the dataset Each file in the ensemble will be separated into new files based on the splitting argument. |
Output methods¶
|
Save a dataset to a named file This will only work with single file datasets. |
|
Open a dataset as an xarray object |
|
Open a dataset as a pandas data frame |
|
Zip the dataset This will compress the files within the dataset. |
Miscellaneous methods¶
|
Calculate the area of grid cells. |
|
Apply a cdo command |
|
Apply an nco command |
|
Compare all variables to a constant |
|
Reduce dimensions of data This will remove any dimensions with only one value. |
|
Reduce the dataset to non-zero locations in a mask :param mask: single variable dataset or path to .nc file. |