Matchups with point data¶

A common challenge when working with netCDF data is matching up with point data. This is often difficult because point data is sparse both spatially and temporally, and when working in the ocean this data can be at varying depths. From version 0.4.7 on, nctoolkit includes the ability to match datasets to spatiotemporal dataframes. Here we will provide an overview of how to do this.

Matching data at specific locations¶

First, we will illustrate how matchpoint works for data at specific spatial locations and depths. After this we will deal with different times. The data will be ocean nitrate from NOAA’s World Ocean Atlas.

We can download part of it as follows:

[1]:

import nctoolkit as nc
ds = nc.open_thredds('https://data.nodc.noaa.gov/thredds/dodsC/ncei/woa/nitrate/all/1.00/woa18_all_n01_01.nc', checks = False)
ds.crop(lon =  [-40, 20], lat = [40, 70], nco = True)
ds.subset(variables = "n_an")
ds.run()

nctoolkit is using Climate Data Operators version 1.9.10

This is a subset of the data covering a large part of the North Atlantic, and it has nitrate values from the sea surface to the sea floor.

[2]:

ds.plot()

/home/robert/miniconda3/envs/notebook/lib/python3.9/site-packages/ncplot/plot.py:181: UserWarning: Warning: xarray could not decode times!

[2]:

Now, let’s say we had the following dataframe of 4 coordinates and depths. How would we identify the nitrate values using nctoolkit?

[3]:

import pandas as pd
df = pd.DataFrame({"lon":[-10, -12, -14, -16], "lat":[45, 50, 53, 55], "depth":[4, 2, 30, 40]})
df

[3]:

	lon	lat	depth
0	-10	45	4
1	-12	50	2
2	-14	53	30
3	-16	55	40

Note: if we are matching datasets to dataframes, the dataframe columns must be named one of the following: ‘lon’, ‘lat’, ‘depth’, ‘year’, ‘month’ or ‘day’.

If we want to match our dataset to this dataframe we use the match_points method as follows:

[4]:

ds.match_points(df)

Depths assumed to be [0.0, 5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0, 70.0, 75.0, 80.0, 85.0, 90.0, 95.0, 100.0, 125.0, 150.0, 175.0, 200.0, 225.0, 250.0, 275.0, 300.0, 325.0, 350.0, 375.0, 400.0, 425.0, 450.0, 475.0, 500.0, 550.0, 600.0, 650.0, 700.0, 750.0, 800.0]
All variables will be used
Points will be matched for all time steps

[4]:

	lon	lat	depth	n_an	day	month	year
0	-10	45	4	5.661312	16	1	1958
1	-12	50	2	8.932839	16	1	1958
2	-14	53	30	8.672163	16	1	1958
3	-16	55	40	6.973096	16	1	1958

We now have the matchups required. The match_points method returns a pandas dataframe with the desired matchups.

You will get messages from nctoolkit confirming some of the assumptions taken when matching up. In most cases these can be ignored. The only exception is with depths. nctoolkit will derive from these from the dataset, but some times this will not be appropriate. Just keep an eye out for the message and explicitly provide depths if necessary.

Spatial matchup approach¶

The approach taken to matching up data spatially is as follows. First, data is regridded horizontally using bilinear interpolation to the lon/lat pairs provided. If depths are provided the data is than interpolated vertically with 1d interpolation using scipy.

Spatiotemporal matchups¶

We will now illustrate how to do spatiotemporal matchups. This will be done with air temperature from the CMIP6 climate model GFDL-CM4. This is a large file, but it can be downloaded by clicking here. The dataset contains gridded daily air temperature for the earth between 1850 and 1859.

Let’s start by matching it up with the following dataframe:

[5]:

import nctoolkit as nc
import pandas as pd
df = pd.DataFrame({"lon": [50, 60], "lat": [50, 45], "year":[1850, 1852], "month":[1, 3], "day":[2, 3]})
df

[5]:

	lon	lat	year	month	day
0	50	50	1850	1	2
1	60	45	1852	3	3

This only contains two data points, but for different times. We can match up our dataset as before:

[6]:

ds = nc.open_data("tas_day_GFDL-CM4_historical_r1i1p1f1_gr1_18500101-18691231.nc", checks = False)
df_match = ds.match_points(df)
df_match

All variables will be used

[6]:

	lon	lat	tas	year	month	day
0	50.0	50.0	252.836945	1850	1	2
1	60.0	45.0	271.053040	1852	3	3

As expected, we now have a pandas dataframe with the surface air temperature for the locations and times specified.

The match_points method works in a similar way to the pandas merge method. So, if we only specified year and month, and ignore day, we would get every day for those years and months, as follows:

[7]:

df_match = ds.match_points(df.drop(columns = "day"))
df_match

All variables will be used

[7]:

	lon	lat	tas	year	month	day
0	50.0	50.0	254.234680	1850	1	1
1	50.0	50.0	252.836945	1850	1	2
2	50.0	50.0	252.467865	1850	1	3
3	50.0	50.0	253.731049	1850	1	4
4	50.0	50.0	245.843506	1850	1	5
...	...	...	...	...	...	...
57	60.0	45.0	276.231720	1852	3	27
58	60.0	45.0	277.647888	1852	3	28
59	60.0	45.0	275.756226	1852	3	29
60	60.0	45.0	274.968018	1852	3	30
61	60.0	45.0	277.621979	1852	3	31

62 rows × 6 columns

We now have each day for the given times.

Optional arguments¶

The match_points method provided optional arguments that can refine the matchup process. These arguments are variables, tmean, top and nan.

They work as follows. If you only wanted to select a subset of variables you would use variables, as follows:

[8]:

df_match = ds.match_points(df, variables = "tas")

In some cases, you have monthly point data, but your dataset has daily resolution. In this case you might want a monthly mean output. You can do this using the tmean argument:

[9]:

df = pd.DataFrame({"lon": [50, 60], "lat": [50, 45], "year":[1850, 1852], "month":[1, 3]})
df_match = ds.match_points(df, tmean = True)
df_match

All variables will be used

[9]:

	lon	lat	tas	year	month	day
0	50.0	50.0	256.112976	1850	1	16
1	60.0	45.0	271.545959	1852	3	16

This works by applying the dataset tmean method to the dataset with the temporal grouping in df. In this case this is the equivalent of running ds.tmean(["year", "month"]) on the dataset.

When you have a multi-level dataset, but only want the top level, you can set top=True in match_points. Similarly, if you have a values in the dataset that should be set to missing values, you set them using the nan argument.