Ensemble methods

Merging files with different variables

This notebook will outline some general methods for doing comparisons of multiple files. We will work with two different sea surface temperature data sets from NOAA and the Met Office Hadley Centre.

[1]:
import nctoolkit as nc
import pandas as pd
import xarray as xr
import numpy as np

Let’s start by downloading the files using wget. Uncomment the code below to do this (note: you will need to extract the HadISST dataset):

[2]:
# ! wget ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.mean.nc
# ! wget https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_sst.nc.gz

The first step is to get the data. We will start by creating two separate datasets for each file.

[3]:
sst_noaa = nc.open_data("sst.mon.mean.nc")
sst_hadley = nc.open_data("HadISST_sst.nc")

We can see that both variables have sea surface temperature labelled as sst. So we will need to change that.

[4]:
sst_noaa.variables
[4]:
['sst']
[5]:
sst_hadley.variables
[5]:
['time_bnds', 'sst']
[6]:
sst_noaa.rename({"sst":"noaa"})
sst_hadley.rename({"sst":"hadley"})

The data sets also cover different time periods, and only have overlapping between 1870 and 2018. so we will need to select those years

[7]:
sst_noaa.select_years(range(1870, 2019))
sst_hadley.select_years(range(1870, 2019))

We also have a problem in that there are two horizontal grids in the Hadley Centre file. We can solve this by selecting the sst variable only

[8]:
sst_hadley.select_variables("hadley")

At this point, the datasets have the same number of time steps and months covered. However, the grids are still a bit different. So we want to unify them by regridding one dataset on to the other’s grid. This can be done using regrid, or any grid of your choosing.

[9]:
sst_noaa.regrid(grid = sst_hadley)

We now have two separate datasets. Let’s create a new dataset that has both of them, and then merge them. When doing this we need to make sure nas are treated properly. In this case Hadley Centre values not being NAs as they should be, so we need to fix that. The merge method also requires a strict matching criteria for the dates in the merging files. In this case the Hadley Centre and NOAA data sets both give monthly means, but use a different day of the month. So we will set match to [“year”, “month”] this will ensure there are no mis-matches

[10]:
all_sst = nc.merge(sst_noaa, sst_hadley, match = ["year", "month"])
all_sst.set_missing([-9000, - 900])

Let’s work out what the global mean SST was over the time period. Note that this will not be totally accurate as there are some missing values here and there that might bias things.

[11]:
all_sst.spatial_mean()
all_sst.annual_mean()
all_sst.rolling_mean(10)
[12]:
all_sst.plot()

Data type cannot be displayed:

[12]:

We can also work out the difference between the two. Here we wil work out the monthly bias per cell. Then calculate the mean global difference per year, and then calculate a rolling 10 year mean.

[13]:
all_sst = nc.open_data([sst_noaa.current, sst_hadley.current])
all_sst.merge(match = ["year", "month"])
all_sst.transmute({"bias":"hadley-noaa"})
all_sst.set_missing([-9000, - 900])
all_sst.spatial_mean()
all_sst.annual_mean()
all_sst.rolling_mean(10)
all_sst.plot()

Data type cannot be displayed:

[13]:

You can see that there is a notable difference at the start of the time series.

Merging files with different times

TBC

Ensemble averaging

TBC