In massively parallel environments, an often difficult problem is
the reading and writing of data to files on disk. MPI-IO and MPI-2 IO
are moving toward providing this capability, but are currently not
widely implemented. Further, it is a rather abstruse
API.
mpp_io_mod is an attempt at a simple API encompassing a
certain variety of the I/O tasks that will be required. It does not
attempt to be an all-encompassing standard such as MPI, however, it
can be implemented in MPI if so desired. It is equally simple to add
parallel I/O capability to
mpp_io_mod based on vendor-specific
APIs while providing a layer of insulation for user codes.
The
mpp_io_mod parallel I/O API built on top of the
mpp_domains_mod and
mpp_mod API for domain decomposition and
message passing. Features of
mpp_io_mod include:
1) Simple, minimal API, with free access to underlying API for more
complicated stuff.
2) Self-describing files: comprehensive header information
(metadata) in the file itself.
3) Strong focus on performance of parallel write: the climate models
for which it is designed typically read a minimal amount of data
(typically at the beginning of the run); but on the other hand, tend
to write copious amounts of data during the run. An interface for
reading is also supplied, but its performance has not yet been optimized.
4) Integrated netCDF capability:
netCDF is a
data format widely used in the climate/weather modeling
community. netCDF is considered the principal medium of data storage
for
mpp_io_mod. But I provide a raw unformatted
fortran I/O capability in case netCDF is not an option, either due to
unavailability, inappropriateness, or poor performance.
5) May require off-line post-processing: a tool for this purpose,
mppnccombine, is available. GFDL users may use
~hnv/pub/mppnccombine. Outside users may obtain the
source
here. It
can be compiled on any C compiler and linked with the netCDF
library. The program is free and is covered by the
GPL license.
The internal representation of the data being written out is
assumed be the default real type, which can be 4 or 8-byte. Time data
is always written as 8-bytes to avoid overflow on climatic time scales
in units of seconds.
I/O modes in mpp_io_mod
The I/O activity critical to performance in the models for which
mpp_io_mod is designed is typically the writing of large
datasets on a model grid volume produced at intervals during
a run. Consider a 3D grid volume, where model arrays are stored as
(i,j,k). The domain decomposition is typically along
i or
j: thus to store data to disk as a global
volume, the distributed chunks of data have to be seen as
non-contiguous. If we attempt to have all PEs write this data into a
single file, performance can be seriously compromised because of the
data reordering that will be required. Possible options are to have
one PE acquire all the data and write it out, or to have all the PEs
write independent files, which are recombined offline. These three
modes of operation are described in the
mpp_io_mod terminology
in terms of two parameters,
threading and
fileset,
as follows:
Single-threaded I/O: a single PE acquires all the data
and writes it out.
Multi-threaded, single-fileset I/O: many PEs write to a
single file.
Multi-threaded, multi-fileset I/O: many PEs write to
independent files. This is also called
distributed I/O.
The middle option is the most difficult to achieve performance. The
choice of one of these modes is made when a file is opened for I/O, in
mpp_open.
Metadata in mpp_io_mod
A requirement of the design of
mpp_io_mod is that the file must
be entirely self-describing: comprehensive header information
describing its contents is present in the header of every file. The
header information follows the model of netCDF. Variables in the file
are divided into
axes and
fields. An axis describes a
co-ordinate variable, e.g
x,y,z,t. A field consists of data in
the space described by the axes. An axis is described in
mpp_io_mod using the defined type
axistype:
type, public :: axistype
sequence
character(len=128) :: name
character(len=128) :: units
character(len=256) :: longname
character(len=8) :: cartesian
integer :: len
integer :: sense !+/-1, depth or height?
type(domain1D), pointer :: domain
real, dimension(:), pointer :: data
integer :: id, did
integer :: type ! external NetCDF type format for axis data
integer :: natt
type(atttype), pointer :: Att(:) ! axis attributes
end type axistype
A field is described using the type
fieldtype:
type, public :: fieldtype
sequence
character(len=128) :: name
character(len=128) :: units
character(len=256) :: longname
real :: min, max, missing, fill, scale, add
integer :: pack
type(axistype), dimension(:), pointer :: axes
integer, dimension(:), pointer :: size
integer :: time_axis_index
integer :: id
integer :: type ! external NetCDF format for field data
integer :: natt, ndim
type(atttype), pointer :: Att(:) ! field metadata
end type fieldtype
An attribute (global, field or axis) is described using the
atttype:
type, public :: atttype
sequence
integer :: type, len
character(len=128) :: name
character(len=256) :: catt
real(FLOAT_KIND), pointer :: fatt(:)
end type atttype
This default set of field attributes corresponds
closely to various conventions established for netCDF files. The
pack attribute of a field defines whether or not a
field is to be packed on output. Allowed values of
pack are 1,2,4 and 8. The value of
pack is the number of variables written into 8
bytes. In typical use, we write 4-byte reals to netCDF output; thus
the default value of
pack is 2. For
pack = 4 or 8, packing uses a simple-minded linear
scaling scheme using the
scale and
add attributes. There is thus likely to be a significant loss of dynamic
range with packing. When a field is declared to be packed, the
missing and
fill attributes, if
supplied, are packed also.
Please note that the pack values are the same even if the default
real is 4 bytes, i.e
PACK=1 still follows the definition
above and writes out 8 bytes.
A set of
attributes for each variable is also available. The
variable definitions and attribute information is written/read by calling
mpp_write_meta or
mpp_read_meta. A typical calling
sequence for writing data might be:
...
type(domain2D), dimension(:), allocatable, target :: domain
type(fieldtype) :: field
type(axistype) :: x, y, z, t
...
call mpp_define_domains( (/1,nx,1,ny/), domain )
allocate( a(domain(pe)%x%data%start_index:domain(pe)%x%data%end_index, &
domain(pe)%y%data%start_index:domain(pe)%y%data%end_index,nz) )
...
call mpp_write_meta( unit, x, 'X', 'km', 'X distance', &
domain=domain(pe)%x, data=(/(float(i),i=1,nx)/) )
call mpp_write_meta( unit, y, 'Y', 'km', 'Y distance', &
domain=domain(pe)%y, data=(/(float(i),i=1,ny)/) )
call mpp_write_meta( unit, z, 'Z', 'km', 'Z distance', &
data=(/(float(i),i=1,nz)/) )
call mpp_write_meta( unit, t, 'Time', 'second', 'Time' )
call mpp_write_meta( unit, field, (/x,y,z,t/), 'a', '(m/s)', AAA', &
missing=-1e36 )
...
call mpp_write( unit, x )
call mpp_write( unit, y )
call mpp_write( unit, z )
...
In this example,
x and
y have been
declared as distributed axes, since a domain decomposition has been
associated.
z and
t are undistributed
axes.
t is known to be a
record axis (netCDF
terminology) since we do not allocate the
data element
of the
axistype.
Only one record axis may be
associated with a file. The call to
mpp_write_meta initializes
the axes, and associates a unique variable ID with each axis. The call
to
mpp_write_meta with argument
field declared
field to be a 4D variable that is a function
of
(x,y,z,t), and a unique variable ID is associated
with it. A 3D field will be written at each call to
mpp_write(field).
The data to any variable, including axes, is written by
mpp_write.
Any additional attributes of variables can be added through
subsequent
mpp_write_meta calls, using the variable ID as a
handle.
Global attributes, associated with the dataset as a
whole, can also be written thus. See the
mpp_write_meta call syntax below
for further details.
You cannot interleave calls to
mpp_write and
mpp_write_meta: the first call to
mpp_write implies that metadata specification is
complete.
A typical calling sequence for reading data might be:
...
integer :: unit, natt, nvar, ntime
type(domain2D), dimension(:), allocatable, target :: domain
type(fieldtype), allocatable, dimension(:) :: fields
type(atttype), allocatable, dimension(:) :: global_atts
real, allocatable, dimension(:) :: times
...
call mpp_define_domains( (/1,nx,1,ny/), domain )
call mpp_read_meta(unit)
call mpp_get_info(unit,natt,nvar,ntime)
allocate(global_atts(natt))
call mpp_get_atts(unit,global_atts)
allocate(fields(nvar))
call mpp_get_vars(unit, fields)
allocate(times(ntime))
call mpp_get_times(unit, times)
allocate( a(domain(pe)%x%data%start_index:domain(pe)%x%data%end_index, &
domain(pe)%y%data%start_index:domain(pe)%y%data%end_index,nz) )
...
do i=1, nvar
if (fields(i)%name == 'a') call mpp_read(unit,fields(i),domain(pe), a,
tindex)
enddo
...
In this example, the data are distributed as in the previous
example. The call to
mpp_read_meta initializes
all of the metadata associated with the file, including global
attributes, variable attributes and non-record dimension data. The
call to
mpp_get_info returns the number of global
attributes (
natt), variables (
nvar) and
time levels (
ntime) associated with the file
identified by a unique ID (
unit).
mpp_get_atts returns all global attributes for
the file in the derived type
atttype(natt).
mpp_get_vars returns variable types
(
fieldtype(nvar)). Since the record dimension data are not allocated for calls to
mpp_write, a separate call to
mpp_get_times is required to access record dimension data. Subsequent calls to
mpp_read return the field data arrays corresponding to
the fieldtype. The
domain type is an optional
argument. If
domain is omitted, the incoming field
array should be dimensioned for the global domain, otherwise, the
field data is assigned to the computational domain of a local array.
Multi-fileset reads are not supported with
mpp_read.
None.
None.
None.