Datasets

Data format requirements

nctoolkit requires NetCDF data that follow the GDT, COARDS or CF Conventions. Its computational backend is CDO, which be able to carry out most operations regardless of whether it is compliant with those conventions. In general, most data producers follow CF-conventions when generating NetCDF files, however if you are unclear if you are working with compliant files you can check here.

Opening datasets

There are 3 ways to create a dataset: open_data, open_url or open_thredds.

If the data you want to analyze is available on your computer use open_data. This will accept either a path to a single file or a list of files. It will also accept wildcards.

If you want to use data that can be downloaded from a url, just use open_url. This will download the netCDF files to a temporary folder, and it can then be analyzed.

If you want to analyze data that is available from a thredds server or OPeNDAP, then use open_thredds. The file paths should end with .nc.

[1]:
import nctoolkit as nc
nctoolkit is using Climate Data Operators version 1.9.10

If you want to get a quick overview of the contents of a dataset, we can use the contents attribute. This will display a dataframe showing the variables available in the dataset and details about the variable, such as the units and long names. The example below opens a sea-surface temperature dataset and displays the contents.

[2]:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.contents
[2]:
variable ntimes npoints nlevels long_name unit data_type
0 sst 12 64800 1 Long Term Mean Monthly Means of Global Sea Sur... degC F32
1 valid_yr_count 12 64800 1 count of non-missing values used in mean None I16

Checking validity of source data

nctoolkit should work out of the box with most NetCDF data. However, it is possibly the format of the data could be incompatible with the system libraries used by nctoolkit or the files could be corrupt. To carry out a general check on the data use the check method as follows:

[ ]:
ds.check()

This will carry out some basic checks on data format compatability. You should install the cfchecker package if you want check to check for CF-compliance.

If you want to check if the files in a dataset are corrupt, the following should tell you. This will simply read and write the data in the source files to a temporary file, which should be sufficient to ensure files are not corrupt.

[ ]:
ds.is_corrupt()

Modifying datasets

If you want to modify a dataset, you just need to use nctoolkit’s built in methods. These methods operate directly on the dataset itself. The example below selects the first time step in a sea surface temperature dataset and plots the result.

ds = nc.open_thredds(“https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc”) ds.select(time = 0) ds.plot()

Underlying datasets are temporary files representing the current state of the dataset. We can access this using the current attribute:

[3]:
ds.current
[3]:
['https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc']

In this case, we have a single temporary file. Any temporary files will be generated and deleted, as needed, so there should be no need to manage them yourself.

Lazy evaluation by default

Look at the processing chain below.

[4]:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()

What is potentially wrong with this? It carries out four operations, so we absolutely do not want to create temporary file in each step. So instead of evaluating the operations line by line, nctoolkit only evaluates them either when you tell it to or it has to. So in the code example above we have told, nctoolkit what to do to that dataset, but have not told it to actually do any of it.

We can see this if we look at the current state of the dataset. It is still the starting point:

[5]:
ds.current
[5]:
['https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc']

If we want to evaluate this we can use the run method or methods such as plot that require commands to be evaluated.

[6]:
ds.run()
ds.current
[6]:
['/tmp/nctoolkitrhzlwgvwnctoolkittmpzs2xhydf.nc']

This method chaining ability within nctoolkit comes from Climate Data Operators (CDO), which is the backend computational engine for nctoolkit. nctoolkit does not require you to understand CDO, but if you want to see the underlying CDO commands used, just use the history attribute. In the example, below, you can see that 4 lines of Python code have been converted to a single CDO command.

[7]:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.history
[7]:
["cdo -fldmean -L -sellonlatbox,-80,20,30,70 -selmonth,1 -aexpr,'sst=sst+273.15'"]

Then if we run this, we can see the full command used:

[8]:
ds.run()
ds.history
[8]:
["cdo -L -fldmean  -sellonlatbox,-80,20,30,70 -selmonth,1 -aexpr,'sst=sst+273.15' https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc /tmp/nctoolkitrhzlwgvwnctoolkittmpzuyxvisb.nc"]

If you want to visualize a dataset, you just need to use plot:

[9]:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.select(time = 0)
ds.plot()
Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
[9]:

Method chaining

When you start to use nctoolkit it is important to realize that it does not allow method chaining in the way pandas and xarray do. So the following will not work:

[10]:
(
    ds
    .tmean()
    .spatial_mean()
    .add(1)
)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [10], in <cell line: 2>()
      1 (
----> 2     ds
      3     .tmean()
      4     .spatial_mean()
      5     .add(1)
      6 )

AttributeError: 'NoneType' object has no attribute 'spatial_mean'

This is because this type of method chaining requires the methods to return an object. However, nctoolkit’s methods in general do not return objects. Instead they modify them.

You would need to do the following instead:

[11]:
ds.tmean()
ds.spatial_mean()
ds.add(1)

Dataset attributes

You can find out key information about a dataset using its attributes. If you want to know the variables available in a dataset called ds, we would do:

[12]:
ds.variables
[12]:
['sst', 'valid_yr_count']

If you want more details about the variables, access the contents attribute. This will tell you details such as long names, units, number of time steps etc. for each variable.

[13]:
ds.contents
[13]:
variable ntimes npoints nlevels long_name unit data_type
0 sst 1 64800 1 Long Term Mean Monthly Means of Global Sea Sur... degC F32
1 valid_yr_count 1 64800 1 count of non-missing values used in mean None I16

If you want to know the vertical levels available in the dataset, we use the following.

[14]:
ds.levels
[14]:
[0.0]

If you want to know the files in a dataset, we would do this. nctoolkit works by generating temporary files, so if you have carried out any operations, this will show a list of temporary files.

[15]:
ds.current
[15]:
['/tmp/nctoolkitrhzlwgvwnctoolkittmprfua3zvb.nc']

If you want to find out what times are in the dataset we do this:

[16]:
ds.times
[16]:
[datetime.datetime(1, 1, 1, 0, 0)]

If you want to find out what months are in the dataset:

[17]:
ds.months
[17]:
[1]

If you want to find out what years are in the dataset:

[18]:
ds.years
[18]:
[1]

We can also access the history of operations carried out on the dataset. This will show the operations carried out by nctoolkit’s computational back-end CDO:

[19]:
ds.history
[19]:
['cdo -L -seltimestep,1 https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc /tmp/nctoolkitrhzlwgvwnctoolkittmprfua3zvb.nc',
 'cdo -addc,1 -fldmean -timmean -timmean']