Matchups with point data¶
A common challenge when working with netCDF data is matching up with point data. This is often difficult because point data is sparse both spatially and temporally, and when working in the ocean this data can be at varying depths. From version 0.4.7 on, nctoolkit includes the ability to match datasets to spatiotemporal dataframes. Here we will provide an overview of how to do this.
Matching data at specific locations¶
First, we will illustrate how matchpoint works for data at specific spatial locations and depths. After this we will deal with different times. The data will be ocean nitrate from NOAA’s World Ocean Atlas.
We can download part of it as follows:
[1]:
import nctoolkit as nc
ds = nc.open_thredds('https://data.nodc.noaa.gov/thredds/dodsC/ncei/woa/nitrate/all/1.00/woa18_all_n01_01.nc', checks = False)
ds.crop(lon = [-40, 20], lat = [40, 70], nco = True)
ds.subset(variables = "n_an")
ds.run()
nctoolkit is using Climate Data Operators version 1.9.10
This is a subset of the data covering a large part of the North Atlantic, and it has nitrate values from the sea surface to the sea floor.
[2]:
ds.plot()
/home/robert/miniconda3/envs/notebook/lib/python3.9/site-packages/ncplot/plot.py:181: UserWarning: Warning: xarray could not decode times!
[2]:
Now, let’s say we had the following dataframe of 4 coordinates and depths. How would we identify the nitrate values using nctoolkit?
[3]:
import pandas as pd
df = pd.DataFrame({"lon":[-10, -12, -14, -16], "lat":[45, 50, 53, 55], "depth":[4, 2, 30, 40]})
df
[3]:
lon | lat | depth | |
---|---|---|---|
0 | -10 | 45 | 4 |
1 | -12 | 50 | 2 |
2 | -14 | 53 | 30 |
3 | -16 | 55 | 40 |
Note: if we are matching datasets to dataframes, the dataframe columns must be named one of the following: ‘lon’, ‘lat’, ‘depth’, ‘year’, ‘month’ or ‘day’.
If we want to match our dataset to this dataframe we use the match_points
method as follows:
[4]:
ds.match_points(df)
Depths assumed to be [0.0, 5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0, 70.0, 75.0, 80.0, 85.0, 90.0, 95.0, 100.0, 125.0, 150.0, 175.0, 200.0, 225.0, 250.0, 275.0, 300.0, 325.0, 350.0, 375.0, 400.0, 425.0, 450.0, 475.0, 500.0, 550.0, 600.0, 650.0, 700.0, 750.0, 800.0]
All variables will be used
Points will be matched for all time steps
[4]:
lon | lat | depth | n_an | day | month | year | |
---|---|---|---|---|---|---|---|
0 | -10 | 45 | 4 | 5.661312 | 16 | 1 | 1958 |
1 | -12 | 50 | 2 | 8.932839 | 16 | 1 | 1958 |
2 | -14 | 53 | 30 | 8.672163 | 16 | 1 | 1958 |
3 | -16 | 55 | 40 | 6.973096 | 16 | 1 | 1958 |
We now have the matchups required. The match_points
method returns a pandas dataframe
with the desired matchups.
You will get messages from nctoolkit confirming some of the assumptions taken when matching up. In most cases these can be ignored. The only exception is with depths. nctoolkit will derive from these from the dataset, but some times this will not be appropriate. Just keep an eye out for the message and explicitly provide depths if necessary.
Spatial matchup approach¶
The approach taken to matching up data spatially is as follows. First, data is regridded horizontally using bilinear interpolation to the lon/lat pairs provided. If depths are provided the data is than interpolated vertically with 1d interpolation using scipy.
Spatiotemporal matchups¶
We will now illustrate how to do spatiotemporal matchups. This will be done with air temperature from the CMIP6 climate model GFDL-CM4. This is a large file, but it can be downloaded by clicking here. The dataset contains gridded daily air temperature for the earth between 1850 and 1859.
Let’s start by matching it up with the following dataframe:
[5]:
import nctoolkit as nc
import pandas as pd
df = pd.DataFrame({"lon": [50, 60], "lat": [50, 45], "year":[1850, 1852], "month":[1, 3], "day":[2, 3]})
df
[5]:
lon | lat | year | month | day | |
---|---|---|---|---|---|
0 | 50 | 50 | 1850 | 1 | 2 |
1 | 60 | 45 | 1852 | 3 | 3 |
This only contains two data points, but for different times. We can match up our dataset as before:
[6]:
ds = nc.open_data("tas_day_GFDL-CM4_historical_r1i1p1f1_gr1_18500101-18691231.nc", checks = False)
df_match = ds.match_points(df)
df_match
All variables will be used
[6]:
lon | lat | tas | year | month | day | |
---|---|---|---|---|---|---|
0 | 50.0 | 50.0 | 252.836945 | 1850 | 1 | 2 |
1 | 60.0 | 45.0 | 271.053040 | 1852 | 3 | 3 |
As expected, we now have a pandas dataframe with the surface air temperature for the locations and times specified.
The match_points
method works in a similar way to the pandas merge
method. So, if we only specified year and month, and ignore day, we would get every day for those years and months, as follows:
[7]:
df_match = ds.match_points(df.drop(columns = "day"))
df_match
All variables will be used
[7]:
lon | lat | tas | year | month | day | |
---|---|---|---|---|---|---|
0 | 50.0 | 50.0 | 254.234680 | 1850 | 1 | 1 |
1 | 50.0 | 50.0 | 252.836945 | 1850 | 1 | 2 |
2 | 50.0 | 50.0 | 252.467865 | 1850 | 1 | 3 |
3 | 50.0 | 50.0 | 253.731049 | 1850 | 1 | 4 |
4 | 50.0 | 50.0 | 245.843506 | 1850 | 1 | 5 |
... | ... | ... | ... | ... | ... | ... |
57 | 60.0 | 45.0 | 276.231720 | 1852 | 3 | 27 |
58 | 60.0 | 45.0 | 277.647888 | 1852 | 3 | 28 |
59 | 60.0 | 45.0 | 275.756226 | 1852 | 3 | 29 |
60 | 60.0 | 45.0 | 274.968018 | 1852 | 3 | 30 |
61 | 60.0 | 45.0 | 277.621979 | 1852 | 3 | 31 |
62 rows × 6 columns
We now have each day for the given times.
Optional arguments¶
The match_points
method provided optional arguments that can refine the matchup process. These arguments are variables
, tmean
, top
and nan
.
They work as follows. If you only wanted to select a subset of variables you would use variables
, as follows:
[8]:
df_match = ds.match_points(df, variables = "tas")
In some cases, you have monthly point data, but your dataset has daily resolution. In this case you might want a monthly mean output. You can do this using the tmean
argument:
[9]:
df = pd.DataFrame({"lon": [50, 60], "lat": [50, 45], "year":[1850, 1852], "month":[1, 3]})
df_match = ds.match_points(df, tmean = True)
df_match
All variables will be used
[9]:
lon | lat | tas | year | month | day | |
---|---|---|---|---|---|---|
0 | 50.0 | 50.0 | 256.112976 | 1850 | 1 | 16 |
1 | 60.0 | 45.0 | 271.545959 | 1852 | 3 | 16 |
This works by applying the dataset tmean
method to the dataset with the temporal grouping in df. In this case this is the equivalent of running ds.tmean(["year", "month"])
on the dataset.
When you have a multi-level dataset, but only want the top level, you can set top=True
in match_points
. Similarly, if you have a values in the dataset that should be set to missing values, you set them using the nan
argument.