# Datasets¶

## Opening datasets¶

There are 3 ways to create a dataset: open_data, open_url or open_thredds.

If the data you want to analyze is available on your computer use open_data. This will accept either a path to a single file or a list of files. It will also accept wildcards.

If you want to use data that can be downloaded from a url, just use open_url. This will download the netCDF files to a temporary folder, and it can then be analyzed.

If you want to analyze data that is available from a thredds server or OPeNDAP, then user open_thredds. The file paths should end with .nc.

[1]:

import nctoolkit as nc

nctoolkit is using Climate Data Operators version 1.9.10


If you want to get a quick overview of the contents of a dataset, we can use the contents attribute. This will display a dataframe showing the variables available in the dataset and details about the variable, such as the units and long names. The example below opens a sea-surface temperature dataset and displays the contents.

[2]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.contents

[2]:

variable ntimes npoints nlevels long_name unit
0 sst 12 64800 1 Long Term Mean Monthly Means of Global Sea Sur... degC
1 valid_yr_count 12 64800 1 count of non-missing values used in mean None

## Modifying datasets¶

If you want to modify a dataset, you just need to use nctoolkit’s built in methods. These methods operate directly on the dataset itself. The example below selects the first time step in a sea surface temperature dataset and plots the result.

ds = nc.open_thredds(“https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc”) ds.select(time = 0) ds.plot()

Underlying datasets are temporary files representing the current state of the dataset. We can access this using the current attribute:

[3]:

ds.current

[3]:

['https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc']


In this case, we have a single temporary file. Any temporary files will be generated and deleted, as needed, so there should be no need to manage them yourself.

## Lazy evaluation by default¶

Look at the processing chain below.

[4]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()


What is potentially wrong with this? It carries out four operations, so we absolutely do not want to create temporary file in each step. So instead of evaluating the operations line by line, nctoolkit only evaluates them either when you tell it to or it has to. So in the code example above we have told, nctoolkit what to do to that dataset, but have not told it to actually do any of it.

We can see this if we look at the current state of the dataset. It is still the starting point:

[5]:

ds.current

[5]:

['https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc']


If we want to evaluate this we can use the run method or methods such as plot that require commands to be evaluated.

[6]:

ds.run()
ds.current

[6]:

['/tmp/nctoolkitaetxyejlnctoolkittmp7qcccy6y.nc']


This method chaining ability within nctoolkit comes from Climate Data Operators (CDO), which is the backend computational engine for nctoolkit. nctoolkit does not require you to understand CDO, but if you want to see the underlying CDO commands used, just use the history attribute. In the example, below, you can see that 4 lines of Python code have been converted to a single CDO command.

[7]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.select(months = 1)
ds.crop(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.history

[7]:

["cdo -fldmean -L -sellonlatbox,-80,20,30,70 -selmonth,1 -aexpr,'sst=sst+273.15'"]


Then if we run this, we can see the full command used:

[8]:

ds.run()
ds.history

[8]:

["cdo -L -fldmean  -sellonlatbox,-80,20,30,70 -selmonth,1 -aexpr,'sst=sst+273.15' https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc /tmp/nctoolkitaetxyejlnctoolkittmpclo1evge.nc"]


If you want to visualize a dataset, you just need to use plot:

[9]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.select(time = 0)
ds.plot()