# Datasets¶

## Data format requirements¶

nctoolkit requires NetCDF data that follow the GDT, COARDS or CF Conventions. Its computational backend is CDO, which be able to carry out most operations regardless of whether it is compliant with those conventions. In general, most data producers follow CF-conventions when generating NetCDF files, however if you are unclear if you are working with compliant files you can check here.

## Opening datasets¶

There are 3 ways to create a dataset: open_data, open_url or open_thredds.

If the data you want to analyze is available on your computer use open_data. This will accept either a path to a single file or a list of files. It will also accept wildcards.

If you want to use data that can be downloaded from a url, just use open_url. This will download the netCDF files to a temporary folder, and it can then be analyzed.

If you want to analyze data that is available from a thredds server or OPeNDAP, then use open_thredds. The file paths should end with .nc.

[1]:

import nctoolkit as nc

nctoolkit is using Climate Data Operators version 2.0.5


If you want to get a quick overview of the contents of a dataset, we can use the contents attribute. This will display a dataframe showing the variables available in the dataset and details about the variable, such as the units and long names. The example below opens a sea-surface temperature dataset and displays the contents.

[2]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds

[2]:

<nctoolkit.DataSet>:
Number of files: 1
File contents:
variable  ntimes  npoints  nlevels                                                       long_name  unit data_type
0             sst      12    64800        1  Long Term Mean Monthly Means of Global Sea Surface Temperature  degC       F32
1  valid_yr_count      12    64800        1                        count of non-missing values used in mean  None       I16


## Checking validity of source data¶

nctoolkit should work out of the box with most NetCDF data. However, it is possibly the format of the data could be incompatible with the system libraries used by nctoolkit or the files could be corrupt. To carry out a general check on the data use the check method as follows:

[3]:

ds.check()

*****************************************
Checking data types
*****************************************
The variable I16 has integer data type. Consider setting data type to float 'F64' or 'F32' using set_precision.
*****************************************
Running CF-compliance checks
*****************************************
Issue with variable: sst
------------------
ERROR: Invalid attribute name: _ChunkSizes

------------------
*****************************************
Checking grid consistency
*****************************************


This will carry out some basic checks on data format compatability. You should install the cfchecker package if you want check to check for CF-compliance.

If you want to check if the files in a dataset are corrupt, the following should tell you. This will simply read and write the data in the source files to a temporary file, which should be sufficient to ensure files are not corrupt.

[4]:

ds.is_corrupt()

[4]:

False


## Modifying datasets¶

If you want to modify a dataset, you just need to use nctoolkit’s built in methods. These methods operate directly on the dataset itself. The example below selects the first time step in a sea surface temperature dataset and plots the result.

ds = nc.open_thredds(“https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc”) ds.select(time = 0) ds.plot()

Underlying datasets are temporary files representing the current state of the dataset. We can access this using the current attribute:

[5]:

ds.current

[5]:

['https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc']


In this case, we have a single temporary file. Any temporary files will be generated and deleted, as needed, so there should be no need to manage them yourself.

## Lazy evaluation by default¶

Look at the processing chain below.

[6]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.subset(months = 1)
ds.subset(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()


What is potentially wrong with this? It carries out four operations, so we absolutely do not want to create temporary file in each step. So instead of evaluating the operations line by line, nctoolkit only evaluates them either when you tell it to or it has to. So in the code example above we have told, nctoolkit what to do to that dataset, but have not told it to actually do any of it.

We can see this if we look at the current state of the dataset. It is still the starting point:

[7]:

ds.current

[7]:

['https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc']


If we want to evaluate this we can use the run method or methods such as plot that require commands to be evaluated.

[8]:

ds.run()
ds.current

[8]:

['/tmp/nctoolkitcvbxvolgnctoolkittmp_p_p_7rk.nc']


This method chaining ability within nctoolkit comes from Climate Data Operators (CDO), which is the backend computational engine for nctoolkit. nctoolkit does not require you to understand CDO, but if you want to see the underlying CDO commands used, just use the history attribute. In the example, below, you can see that 4 lines of Python code have been converted to a single CDO command.

[9]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.assign(sst = lambda x: x.sst + 273.15)
ds.subset(months = 1)
ds.subset(lon = [-80, 20], lat = [30, 70])
ds.spatial_mean()
ds.history

[9]:

["cdo -fldmean -L -sellonlatbox,-80,20,30,70 -selmonth,1 -aexpr,'sst=sst+273.15'"]


Then if we run this, we can see the full command used:

[10]:

ds.run()
ds.history

[10]:

["cdo -L -fldmean  -sellonlatbox,-80,20,30,70 -selmonth,1 -aexpr,'sst=sst+273.15' https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc /tmp/nctoolkitcvbxvolgnctoolkittmpi3vm6otf.nc"]


If you want to visualize a dataset, you just need to use plot:

[11]:

ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.subset(time = 0)
ds.plot()