Datasets¶

Data format requirements¶

nctoolkit requires NetCDF data that follow the GDT, COARDS or CF Conventions. Its computational backend is CDO, which be able to carry out most operations regardless of whether it is compliant with those conventions. In general, most data producers follow CF-conventions when generating NetCDF files, however if you are unclear if you are working with compliant files you can check here.

Opening datasets¶

There are 3 ways to create a dataset: open_data, open_url or open_thredds.

If the data you want to analyze is available on your computer use open_data. This will accept either a path to a single file or a list of files. It will also accept wildcards.

If you want to use data that can be downloaded from a url, just use open_url. This will download the netCDF files to a temporary folder, and it can then be analyzed.

If you want to analyze data that is available from a thredds server or OPeNDAP, then use open_thredds. The file paths should end with .nc.

[1]:

import nctoolkit as nc

nctoolkit is using the latest version of Climate Data Operators version: 2.0.5


If you want to get a quick overview of the contents of a dataset, we can use the contents attribute. This will display a dataframe showing the variables available in the dataset and details about the variable, such as the units and long names. The example below opens a sea-surface temperature dataset and displays the contents.

[2]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds

[2]:

<nctoolkit.DataSet>:
Number of files: 1
File contents:
variable  ntimes  npoints  nlevels                                                       long_name  unit data_type
0             sst      12    64800        1  Long Term Mean Monthly Means of Global Sea Surface Temperature  degC       F32
1  valid_yr_count      12    64800        1                        count of non-missing values used in mean  None       I16


Checking validity of source data¶

nctoolkit should work out of the box with most NetCDF data. However, it is possibly the format of the data could be incompatible with the system libraries used by nctoolkit or the files could be corrupt. To carry out a general check on the data use the check method as follows:

[ ]:

ds.check()

*****************************************
Checking data types
*****************************************
The variable I16 has integer data type. Consider setting data type to float 'F64' or 'F32' using set_precision.
*****************************************
Checking time data type
*****************************************
*****************************************
Running CF-compliance checks
*****************************************
Issue with variable: sst
------------------
ERROR: Invalid attribute name: _ChunkSizes

------------------
*****************************************
Checking grid consistency
*****************************************


This will carry out some basic checks on data format compatability. You should install the cfchecker package if you want check to check for CF-compliance.

If you want to check if the files in a dataset are corrupt, the following should tell you. This will simply read and write the data in the source files to a temporary file, which should be sufficient to ensure files are not corrupt.

[ ]:

ds.is_corrupt()


Modifying datasets¶

If you want to modify a dataset, you just need to use nctoolkit’s built in methods. These methods operate directly on the dataset itself. The example below selects the first time step in a sea surface temperature dataset and plots the result.

[ ]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.subset(time = 0)
ds.plot()


Underlying datasets are temporary files representing the current state of the dataset. We can access this using the current attribute:

[ ]:

ds.current


In this case, we have a single temporary file. Any temporary files will be generated and deleted, as needed, so there should be no need to manage them yourself.

Lazy evaluation by default¶

Look at the processing chain below.

[ ]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.subset(months = 1)
ds.subset(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()


What is potentially wrong with this? It carries out four operations, so we absolutely do not want to create temporary file in each step. So instead of evaluating the operations line by line, nctoolkit only evaluates them either when you tell it to or it has to. So in the code example above we have told, nctoolkit what to do to that dataset, but have not told it to actually do any of it.

We can see this if we look at the current state of the dataset. It is still the starting point:

[ ]:

ds.current


If we want to evaluate this we can use the run method or methods such as plot that require commands to be evaluated.

[ ]:

ds.run()
ds.current


This method chaining ability within nctoolkit comes from Climate Data Operators (CDO), which is the backend computational engine for nctoolkit. nctoolkit does not require you to understand CDO, but if you want to see the underlying CDO commands used, just use the history attribute. In the example, below, you can see that 4 lines of Python code have been converted to a single CDO command.

[ ]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.subset(months = 1)
ds.subset(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.history


Then if we run this, we can see the full command used:

[ ]:

ds.run()
ds.history


If you want to visualize a dataset, you just need to use plot:

[ ]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.subset(time = 0)
ds.plot()


Method chaining¶

When you start to use nctoolkit it is important to realize that it does not allow method chaining in the way pandas and xarray do. So the following will not work:

[ ]:

(
ds
.tmean()
.spatial_mean()
)


This is because this type of method chaining requires the methods to return an object. However, nctoolkit’s methods in general do not return objects. Instead they modify them.

You would need to do the following instead:

[ ]:

ds.tmean()
ds.spatial_mean()


Dataset attributes¶

You can find out key information about a dataset using its attributes. If you want to know the variables available in a dataset called ds, we would do:

[ ]:

ds.variables


If you want more details about the variables, access the contents attribute. This will tell you details such as long names, units, number of time steps etc. for each variable.

[ ]:

ds.contents


If you want to know the vertical levels available in the dataset, we use the following.

[ ]:

ds.levels


If you want to know the files in a dataset, we would do this. nctoolkit works by generating temporary files, so if you have carried out any operations, this will show a list of temporary files.

[ ]:

ds.current


If you want to find out what times are in the dataset we do this:

[ ]:

ds.times


If you want to find out what months are in the dataset:

[ ]:

ds.months


If you want to find out what years are in the dataset:

We can also access the history of operations carried out on the dataset. This will show the operations carried out by nctoolkit’s computational back-end CDO:

[ ]:

ds.history