Datasets

Opening datasets

There are 3 ways to create a dataset: open_data, open_url or open_thredds.

If the data you want to analyze is available on your computer use open_data. This will accept either a path to a single file or a list of files. It will also accept wildcards.

If you want to use data that can be downloaded from a url, just use open_url. This will download the netCDF files to a temporary folder, and it can then be analyzed.

If you want to analyze data that is available from a thredds server or OPeNDAP, then user open_thredds. The file paths should end with .nc.

[1]:
import nctoolkit as nc
nctoolkit is using Climate Data Operators version 1.9.10

If you want to get a quick overview of the contents of a dataset, we can use the contents attribute. This will display a dataframe showing the variables available in the dataset and details about the variable, such as the units and long names. The example below opens a sea-surface temperature dataset and displays the contents.

[2]:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.contents
[2]:
variable ntimes npoints nlevels long_name unit
0 sst 12 64800 1 Long Term Mean Monthly Means of Global Sea Sur... degC
1 valid_yr_count 12 64800 1 count of non-missing values used in mean None

Modifying datasets

If you want to modify a dataset, you just need to use nctoolkit’s built in methods. These methods operate directly on the dataset itself. The example below selects the first time step in a sea surface temperature dataset and plots the result.

ds = nc.open_thredds(“https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc”) ds.select(time = 0) ds.plot()

Underlying datasets are temporary files representing the current state of the dataset. We can access this using the current attribute:

[3]:
ds.current
[3]:
['https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc']

In this case, we have a single temporary file. Any temporary files will be generated and deleted, as needed, so there should be no need to manage them yourself.

Lazy evaluation by default

Look at the processing chain below.

[4]:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()

What is potentially wrong with this? It carries out four operations, so we absolutely do not want to create temporary file in each step. So instead of evaluating the operations line by line, nctoolkit only evaluates them either when you tell it to or it has to. So in the code example above we have told, nctoolkit what to do to that dataset, but have not told it to actually do any of it.

We can see this if we look at the current state of the dataset. It is still the starting point:

[5]:
ds.current
[5]:
['https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc']

If we want to evaluate this we can use the run method or methods such as plot that require commands to be evaluated.

[6]:
ds.run()
ds.current
[6]:
['/tmp/nctoolkitaetxyejlnctoolkittmp7qcccy6y.nc']

This method chaining ability within nctoolkit comes from Climate Data Operators (CDO), which is the backend computational engine for nctoolkit. nctoolkit does not require you to understand CDO, but if you want to see the underlying CDO commands used, just use the history attribute. In the example, below, you can see that 4 lines of Python code have been converted to a single CDO command.

[7]:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.history
[7]:
["cdo -fldmean -L -sellonlatbox,-80,20,30,70 -selmonth,1 -aexpr,'sst=sst+273.15'"]

Then if we run this, we can see the full command used:

[8]:
ds.run()
ds.history
[8]:
["cdo -L -fldmean  -sellonlatbox,-80,20,30,70 -selmonth,1 -aexpr,'sst=sst+273.15' https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc /tmp/nctoolkitaetxyejlnctoolkittmpclo1evge.nc"]

If you want to visualize a dataset, you just need to use plot:

[9]:
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.select(time = 0)
ds.plot()
Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
[9]:

Dataset attributes

You can find out key information about a dataset using its attributes. If you want to know the variables available in a dataset called ds, we would do:

[ ]:
ds.variables

If you want more details about the variables, access the contents attribute. This will tell you details such as long names, units, number of time steps etc. for each variable.

[ ]:
ds.contents

If you want to know the vertical levels available in the dataset, we use the following.

[ ]:
ds.levels

If you want to know the files in a dataset, we would do this. nctoolkit works by generating temporary files, so if you have carried out any operations, this will show a list of temporary files.

[ ]:
ds.current

If you want to find out what times are in the dataset we do this:

[ ]:
ds.times

If you want to find out what months are in the dataset:

[ ]:
ds.months

If you want to find out what years are in the dataset:

[ ]:
ds.years

We can also access the history of operations carried out on the dataset. This will show the operations carried out by nctoolkit’s computational back-end CDO:

[ ]:
ds.history