Matchups with point data

A common challenge when working with netCDF data is matching up with point data. This is often difficult because point data is sparse both spatially and temporally, and when working in the ocean this data can be at varying depths. From version 0.4.7 on, nctoolkit includes the ability to match datasets to spatiotemporal dataframes. Here we will provide an overview of how to do this.

Matching data at specific locations

First, we will illustrate how matchpoint works for data at specific spatial locations and depths. After this we will deal with different times. The data will be ocean nitrate from NOAA’s World Ocean Atlas.

We can download part of it as follows:

[1]:
import nctoolkit as nc
ds = nc.open_thredds('https://data.nodc.noaa.gov/thredds/dodsC/ncei/woa/nitrate/all/1.00/woa18_all_n01_01.nc', checks = False)
ds.crop(lon =  [-40, 20], lat = [40, 70], nco = True)
ds.subset(variables = "n_an")
ds.run()
nctoolkit is using Climate Data Operators version 1.9.10

This is a subset of the data covering a large part of the North Atlantic, and it has nitrate values from the sea surface to the sea floor.

[2]:
ds.plot()
/home/robert/miniconda3/envs/notebook/lib/python3.9/site-packages/ncplot/plot.py:181: UserWarning: Warning: xarray could not decode times!
[2]:

Now, let’s say we had the following dataframe of 4 coordinates and depths. How would we identify the nitrate values using nctoolkit?

[3]:
import pandas as pd
df = pd.DataFrame({"lon":[-10, -12, -14, -16], "lat":[45, 50, 53, 55], "depth":[4, 2, 30, 40]})
df
[3]:
lon lat depth
0 -10 45 4
1 -12 50 2
2 -14 53 30
3 -16 55 40

Note: if we are matching datasets to dataframes, the dataframe columns must be named one of the following: ‘lon’, ‘lat’, ‘depth’, ‘year’, ‘month’ or ‘day’.

If we want to match our dataset to this dataframe we use the match_points method as follows:

[4]:
ds.match_points(df)
Depths assumed to be [0.0, 5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0, 70.0, 75.0, 80.0, 85.0, 90.0, 95.0, 100.0, 125.0, 150.0, 175.0, 200.0, 225.0, 250.0, 275.0, 300.0, 325.0, 350.0, 375.0, 400.0, 425.0, 450.0, 475.0, 500.0, 550.0, 600.0, 650.0, 700.0, 750.0, 800.0]
All variables will be used
Points will be matched for all time steps
[4]:
lon lat depth n_an day month year
0 -10 45 4 5.661312 16 1 1958
1 -12 50 2 8.932839 16 1 1958
2 -14 53 30 8.672163 16 1 1958
3 -16 55 40 6.973096 16 1 1958

We now have the matchups required. The match_points method returns a pandas dataframe with the desired matchups.

You will get messages from nctoolkit confirming some of the assumptions taken when matching up. In most cases these can be ignored. The only exception is with depths. nctoolkit will derive from these from the dataset, but some times this will not be appropriate. Just keep an eye out for the message and explicitly provide depths if necessary.

Spatial matchup approach

The approach taken to matching up data spatially is as follows. First, data is regridded horizontally using bilinear interpolation to the lon/lat pairs provided. If depths are provided the data is than interpolated vertically with 1d interpolation using scipy.

Spatiotemporal matchups

We will now illustrate how to do spatiotemporal matchups. This will be done with air temperature from the CMIP6 climate model GFDL-CM4. This is a large file, but it can be downloaded by clicking here. The dataset contains gridded daily air temperature for the earth between 1850 and 1859.

Let’s start by matching it up with the following dataframe:

[5]:
import nctoolkit as nc
import pandas as pd
df = pd.DataFrame({"lon": [50, 60], "lat": [50, 45], "year":[1850, 1852], "month":[1, 3], "day":[2, 3]})
df
[5]:
lon lat year month day
0 50 50 1850 1 2
1 60 45 1852 3 3

This only contains two data points, but for different times. We can match up our dataset as before:

[6]:
ds = nc.open_data("tas_day_GFDL-CM4_historical_r1i1p1f1_gr1_18500101-18691231.nc", checks = False)
df_match = ds.match_points(df)
df_match
All variables will be used
[6]:
lon lat tas year month day
0 50.0 50.0 252.836945 1850 1 2
1 60.0 45.0 271.053040 1852 3 3

As expected, we now have a pandas dataframe with the surface air temperature for the locations and times specified.

The match_points method works in a similar way to the pandas merge method. So, if we only specified year and month, and ignore day, we would get every day for those years and months, as follows:

[7]:
df_match = ds.match_points(df.drop(columns = "day"))
df_match
All variables will be used
[7]:
lon lat tas year month day
0 50.0 50.0 254.234680 1850 1 1
1 50.0 50.0 252.836945 1850 1 2
2 50.0 50.0 252.467865 1850 1 3
3 50.0 50.0 253.731049 1850 1 4
4 50.0 50.0 245.843506 1850 1 5
... ... ... ... ... ... ...
57 60.0 45.0 276.231720 1852 3 27
58 60.0 45.0 277.647888 1852 3 28
59 60.0 45.0 275.756226 1852 3 29
60 60.0 45.0 274.968018 1852 3 30
61 60.0 45.0 277.621979 1852 3 31

62 rows × 6 columns

We now have each day for the given times.

Optional arguments

The match_points method provided optional arguments that can refine the matchup process. These arguments are variables, tmean, top and nan.

They work as follows. If you only wanted to select a subset of variables you would use variables, as follows:

[8]:
df_match = ds.match_points(df, variables = "tas")

In some cases, you have monthly point data, but your dataset has daily resolution. In this case you might want a monthly mean output. You can do this using the tmean argument:

[9]:
df = pd.DataFrame({"lon": [50, 60], "lat": [50, 45], "year":[1850, 1852], "month":[1, 3]})
df_match = ds.match_points(df, tmean = True)
df_match
All variables will be used
[9]:
lon lat tas year month day
0 50.0 50.0 256.112976 1850 1 16
1 60.0 45.0 271.545959 1852 3 16

This works by applying the dataset tmean method to the dataset with the temporal grouping in df. In this case this is the equivalent of running ds.tmean(["year", "month"]) on the dataset.

When you have a multi-level dataset, but only want the top level, you can set top=True in match_points. Similarly, if you have a values in the dataset that should be set to missing values, you set them using the nan argument.