Reading MDS Data¶
open_mdsdataset¶
All loading of data in xmitgcm occurs through the function open_mdsdataset
.
Its full documentation below enumerates all the possible options.
- xmitgcm.open_mdsdataset(data_dir, grid_dir=None, iters='all', prefix=None, read_grid=True, delta_t=1, ref_date=None, calendar='gregorian', levels=None, geometry='sphericalpolar', grid_vars_to_coords=True, swap_dims=None, endian='>', chunks=None, ignore_unknown_vars=False, default_dtype=None, nx=None, ny=None, nz=None, llc_method='smallchunks', extra_metadata=None, extra_variables=None)[source]¶
Open MITgcm-style mds (.data / .meta) file output as xarray datset.
- Parameters
data_dir : string
Path to the directory where the mds .data and .meta files are stored
grid_dir : string, optional
Path to the directory where the mds .data and .meta files are stored, if different from
data_dir
.iters : list, optional
The iterations numbers of the files to be read. If
None
, no data files will be read. If'all'
(default), all iterations will be read.prefix : list, optional
List of different filename prefixes to read. Default (
None
) is to read all available files.read_grid : bool, optional
Whether to read the grid data
delta_t : number, optional
The timestep used in the model. (Can’t be inferred.)
ref_date : string, optional
An iSO date string corresponding to the zero timestep, e.g. “1990-1-1 0:0:0” (See CF conventions [R1])
calendar : string, optional
A calendar allowed by CF conventions [R1]
levels : list or slice, optional
A list or slice of the indexes of the grid levels to read Same syntax as in the data.diagnostics file
geometry : {‘sphericalpolar’, ‘cartesian’, ‘llc’, ‘curvilinear’, ‘cs’}
MITgcm grid geometry specifier
grid_vars_to_coords : boolean, optional
Whether to promote grid variables to coordinate status
swap_dims : boolean, optional
Whether to swap the logical dimensions for physical ones. If
None
, will be set toFalse
forgeometry==llc
andTrue
otherwise.endian : {‘=’, ‘>’, ‘<’}, optional
Endianness of variables. Default for MITgcm is “>” (big endian)
chunks : int or dict, optional
If chunks is provided, it used to load the new dataset into dask arrays.
ignore_unknown_vars : boolean, optional
Don’t raise an error if unknown variables are encountered while reading the dataset.
default_dtype : numpy.dtype, optional
A datatype to fall back on if the metadata can’t be read.
nx, ny, nz : int, optional
The numerical dimensions of the model. These will be inferred from
XC.meta
andRC.meta
if they are not specified. Ifgeometry==llc
,ny
does not have to specified.llc_method : {“smallchunks”, “bigchunks”}, optional
Which routine to use for reading LLC data. “smallchunks” splits the file into a individual dask chunk of size (nx x nx) for each face of each level (i.e. the total number of chunks is 13 * nz). “bigchunks” loads the whole raw data file (either into memory or as a numpy.memmap), splits it into faces, and concatenates those faces together using
dask.array.concatenate
. The different methods will have different memory and i/o performance depending on the details of the system configuration.extra_metadata : dict, optional
Allow to pass information on llc type grid (global or regional). The additional metadata is typically such as :
- aste = {‘has_faces’: True, ‘ny’: 1350, ‘nx’: 270,
‘ny_facets’: [450,0,270,180,450], ‘pad_before_y’: [90,0,0,0,0], ‘pad_after_y’: [0,0,0,90,90], ‘face_facets’: [0, 0, 2, 3, 4, 4], ‘facet_orders’ : [‘C’, ‘C’, ‘C’, ‘F’, ‘F’], ‘face_offsets’ : [0, 1, 0, 0, 0, 1], ‘transpose_face’ : [False, False, False,
True, True, True]}
For global llc grids, no extra metadata is required and code will set up to global llc default configuration.
extra_variables : dict, optional
Allow to pass variables not listed in the variables.py or in available_diagnostics.log. extra_variables must be a dict containing the variable names as keys with the corresponging values being a dict with the keys being dims and attrs.
Syntax: extra_variables = dict(varname = dict(dims=list_of_dims, attrs=dict(optional_attrs))) where optional_attrs can contain standard_name, long_name, units as keys
Example: extra_variables = dict( ADJtheta = dict(dims=[‘k’,’j’,’i’], attrs=dict(
standard_name=’Sensitivity_to_theta’, long_name=’Sensitivity of cost function to theta’, units=’[J]/degC’))
)
- Returns
dset : xarray.Dataset
Dataset object containing all coordinates and variables.
References
The optional arguments are explained in more detail in the following sections and examples.
Lazy Evaluation and Dask Chunking¶
open_mdsdataset
does not actually load all of the data into memory
when it is invoked. Rather, it opens the files using
numpy.memmap,
which means that the data is not read until it is actually needed for
a computation. This is a cheap form of
lazy evaluation.
Additional performance benefits can be achieved with xarray dask chunking:
ds_chunked = open_mdsdataset(data_dir, chunks={'Z':1, 'Zl':1})
In the example above, the each horizontal slice of the model is assigned to its own chunk; dask will automatically try to parellize operations across these chunks using all your CPU cores. For this small example dataset, no performance boost is expected (due to the overhead of parallelization), but very large simulations will certainly benefit from chunking.
When chunking is applied, the data are represetned by dask arrays, and all
operations are evaluated lazily. No computation actually takes place until you
call .load()
:
# take the mean of the squared zonal velocity
(ds_chunked.U**2).mean()
>>> <xarray.DataArray 'U' ()>
dask.array<mean_ag..., shape=(), dtype=float32, chunksize=()>
# load the value and execute the dask computational graph
>>> <xarray.DataArray 'U' ()>
array(0.00020325234800111502, dtype=float32)
Note
In certain cases, chunking is applied automatically. These cases are
If there is more than one timestep to be read (see Expected Files) the time dimension is automatically chunked.
In llc simulations, (see Grid Geometries) the
face
dimension is automatically chunked.
Expected Files¶
MITgcm writes MDS output as pairs of files with the suffixes *.data
and
*.meta
. The model timestep iteration number is represented by a
ten-digit number at the end of the filename, e.g. T.0000000090.data
and
T.0000000090.meta
. MDS files without an iteration number are grid files.
xmitgcm has certain expectations regarding the files that will be present in
datadir
. In particular, it assumes datadir
is an MITgcm “run” directory.
By default, open_mdsdataset will read the grid files which describe the
geometry of the computational domain. If these files are not present in
datadir
, this behavior can be turned off by setting read_grid=False
.
In order to determine the dimensions of the model domain open_mdsdataset
needs to peek at the metadata in two grid files: XC.meta
and RC.meta
.
(even when read_grid=False
). If these files are not available, you have the
option to manually specify the parameters nx
, ny
, and nz
as keyword
arguments to open_mdsdataset
. (ny
is not required for
geometry='llc'
).
By default, open_mdsdataset
attempts to read all the data files in
datadir
. The files and iteration numbers to read are determined in the
following way:
First
datadir
is scanned to determine all iteration numbers present in the directory. To override this behavior and manually specify the iteration numbers to read, use theiters
keyword argument, e.g.iters=[10,20,30]
.To dertmine the file prefixes to read,
open_mdsdataset
looks for all*.data
filenames which match the first iteration number. To override this behavior and manually specify the file prefixes via theprefix
keyword argument, e.g.prefix=['UTave', 'VTave']
.open_mdsdataset
then looks for each file prefix at each iteration number.
This approach works for the test examples, but perhaps it does not suit your model configuration. Suggestions are welcome on how to improve the discovery of files in the form of issues and pull requests.
Warning
If you have certain file prefixes that are present at the first iteration
(e.g. T.0000000000.data
) but not at later iterations (e.g
iters=[0,10]
) but there is no T.0000000010.data
file,
open_mdsdataset
will raise an error because it can’t find the expected
files. To overcome this you need to manually specify iters
and / or
prefix
.
To determine the variable metadata, xmitgcm is able to parse the
model’s available_diagnostics.log
file. If you use diagnostic output,
the available_diagnostics.log
file corresponding with your model run should
be present in datadir
.
Note
If the available_diagnostics.log
file can’t be found, a
default version
will be used. This could lead to problems, since you may have custom
diagnostics enabled in your run that are not present in the default.
The default available_diagnostics.log
file was taken from the ECCOv4
global_oce_llc90
experiment.
For non-diagnostic output (e.g. default “state” or “timeave” output), xmitgcm assigns the variable metadata based on filenames. The additional metadata makes the internal represtation of the model variables more verbose and ensures compliance with CF Conventions.
Dimensions and Coordinates¶
One major advantage of using xarray to represent data is that the variable
dimensions are labeled, much like netCDF data structures. This labeling
enables much clearer code. For example, to take a time average of a Dataset,
one just says ds.mean(dim='time')
without having to remember which logical
axis is the time dimension.
xmitgcm distinguishes between logical dimensions and physical dimensions or
coordinates. Open open_mdsdataset
will attempt to assign physical
dimensions to the data. The physical dimensions correspond to the axes of
the MITgcm grids in cartesian
or sphericalpolar
coordinates. The
standard names have been assigned according to CF Conventions.
name |
standard_name |
---|---|
time |
time |
XC |
longitude |
YC |
latitude |
XG |
longitude_at_f_location |
YG |
latitude_at_f_location |
Zl |
depth_at_lower_w_location |
Zu |
depth_at_upper_w_location |
Z |
depth |
Zp1 |
depth_at_w_location |
layer_1RHO_center |
ocean_layer_coordinate_1RHO_center |
layer_1RHO_interface |
ocean_layer_coordinate_1RHO_interface |
layer_1RHO_bounds |
ocean_layer_coordinate_1RHO_bounds |
The physical dimensions of typical veriables are:
print(ds.THETA.dims)
>>> ('time', 'Z', 'YC', 'XC')
print(ds.UVEL.dims)
>>> ('time', 'Z', 'YC', 'XG')
print(ds.VVEL.dims)
>>> ('time', 'Z', 'YG', 'XC')
print(ds.WVEl.dims)
>>> ('time', 'Zl', 'YC', 'XG')
In order for physical dimensions to be assigned open_mdsdataset
must be
involved with read_grid=True
(default). For a more minimalistic approach,
one can use read_grid=False
and assign logcial dimensions. Logical
dimensions can also be chosen by explicitly setting swap_dims=False
, even
with read_grid=False
.
Physical dimension only work with geometry=='cartesian'
or
geometry=='sphericalpolar'
. For geometry=='llc'
or geometry=='curvilinear'
,
it is not possible to replace the logical dimensions with physical dimensions, and setting
swap_dims=False
will raise an error.
Logical dimensions follow the naming conventions of the MITgcm numerical grid. The dimension names are augmented with metadata attributes from the Comodo conventions. These logical spatial dimensions are
name |
standard_name |
axis |
c_grid_axis_shift |
---|---|---|---|
i |
x_grid_index |
X |
|
i_g |
x_grid_index_at_u_location |
X |
-0.5 |
j |
y_grid_index |
Y |
|
j_g |
y_grid_index_at_v_location |
Y |
-0.5 |
k |
z_grid_index |
Z |
|
k_l |
z_grid_index_at_lower_w_location |
Z |
-0.5 |
k_u |
z_grid_index_at_upper_w_location |
Z |
0.5 |
k_p1 |
z_grid_index_at_w_location |
Z |
(-0.5, 0.5) |
As explained in the Comodo documentation, the use of different dimensions is necessary to represent the fact that, in c-grid ocean models, different variables are staggered in different ways with respect to the model cells. For example, tracers and velocity components are all have different logical dimensions:
print(ds.THETA.dims)
>>> ('time', 'k', 'j', 'i')
print(ds.UVEL.dims)
>>> ('time', 'k', 'j', 'i_g'
print(ds.VVEL.dims)
>>> ('time', 'k', 'j_g', 'i')
print(ds.WVEl.dims)
>>> ('time', 'k_l', 'j', 'i')
xarray distinguishes between “coordinates” and “data_vars”. By default,
open_mdsdataset
will promote all grid variables to coordinates. To turn off
this behavior and treat grid variables as data_vars, use
grid_vars_to_coords=False
.
Warning
For vertical coordinates (Zl
, k_l
, etc.) the notion of “lower” and
“upper” always refers to the logical grid position, NOT the physical grid position.
This can be a major point of confusion. In array index space, the point k_l
is indeed lower than k_u
; however, for typical MITgcm ocean model configurations,
in physical space the point k_l
is physically above k_u
. This is because,
the Z axis starts from the ocean surface and increases downward.
In contrast, for atmospheric configurations which start from the ground and increase upwards,
the logical and physical positions are self-consistent.
Time¶
open_mdsdataset
attemts to determine the time dimension based on iters
.
However, addtiional input is required from the user to fully exploit this
capability. If the user specifies delta_t
, the numerical timestep used for
the MITgcm simulation, it is used to muliply iters
to determine the time
in seconds. Additionally, if the user specifies ref_date
(an ISO date
string, e.g. “1990-1-1 0:0:0”
), the time dimension will be converted into
a datetime index, exposing all sorts of
useful timeseries functionalty
within xarray.
Grid Geometries¶
The grid geometry is not inferred; it must be specified via the geometry
keyword. xmitgcm currently supports four MITgcm grid geometries: cartesian
,
sphericalpolar
, curvilinear
, and llc
. The first two are straightforward.
The curvilinear
is used for curvilinear cartesian grids. The llc
(“lat-lon-cap”) geometry is more complicated. This grid consists of four
distinct faces of the same size plus a smaller north-polar cap. Each face has a
distinct relatioship between its logical dimensions and its physical
coordinates. Because netCDF and xarray.Dataset data structures do not support
this sort of complex geometry (multiple faces of different sizes), our approach,
inspired by nc-tiles, is to split the domain into 13 equal-sized “faces”.
face
then becomes an additional dimension of the data.
To download an example llc dataset, run the following shell commands:
$ curl -L -J -O https://ndownloader.figshare.com/files/6494721
$ tar -xvzf global_oce_llc90.tar.gz
And to read it, in python:
ds_llc = open_mdsdataset('./global_oce_llc90/', iters=8, geometry='llc')
print(ds_llc['S']dims)
>>> ('time', 'k', 'face', 'j', 'i')
xmitgcm is not nearly as comprehensive as gcmfaces. It does not offer sophisticated operations involving exchanges at face boundaries, integrals across sections, etc. The goal of this package is simply to read the mds data. However, by outputing an xarray data structure, we can use all of xarray’s usual capabilities on the llc data, for example:
# calculate the mean squared salinity as a function of depth
(ds_llc.S**2).mean(dim=['face', 'j', 'i'])
>>> <xarray.DataArray 'S' (time: 1, k: 50)>
dask.array<mean_ag..., shape=(1, 50), dtype=float32, chunksize=(1, 50)>
Coordinates:
* k (k) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
iter (time) int64 8
* time (time) int64 8
Z (k) >f4 -5.0 -15.0 -25.0 -35.0 -45.0 -55.0 -65.0 -75.005 ...
PHrefC (k) >f4 49.05 147.15 245.25 343.35 441.45 539.55 637.65 735.799 ...
drF (k) >f4 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.01 10.03 10.11 ...
(Note that this simple example does not perform correct volume weighting or land masking in the average.)
open_mdsdataset
offers two different strategies for reading LLC data,
method='smallchunks'
(the default) and method='bigchunks'
. The details
and tradeoffs of these strategies are described in Performance Issues.