The data transfer between a processor and its own memory is based
on
load and
store operations upon
memory. Shared-memory systems (including distributed shared memory
systems) have a single address space and any processor can acquire any
data within the memory by
load and
store. The situation is different for distributed
parallel systems. Specialized MPP systems such as the T3E can simulate
shared-memory by direct data acquisition from remote memory. But if
the parallel code is distributed across a cluster, or across the Net,
messages must be sent and received using the protocols for
long-distance communication, such as TCP/IP. This requires a
``handshaking'' between nodes of the distributed system. One can think
of the two different methods as involving
puts or
gets (e.g the SHMEM library), or in the case of
negotiated communication (e.g MPI),
sends and
recvs.
The difference between SHMEM and MPI is that SHMEM uses one-sided
communication, which can have very low-latency high-bandwidth
implementations on tightly coupled systems. MPI is a standard
developed for distributed computing across loosely-coupled systems,
and therefore incurs a software penalty for negotiating the
communication. It is however an open industry standard whereas SHMEM
is a proprietary interface. Besides, the
puts or
gets on which it is based cannot currently be implemented in
a cluster environment (there are recent announcements from Compaq that
occasion hope).
The message-passing requirements of climate and weather codes can be
reduced to a fairly simple minimal set, which is easily implemented in
any message-passing API.
mpp_mod provides this API.
Features of
mpp_mod include:
1) Simple, minimal API, with free access to underlying API for
more complicated stuff.
2) Design toward typical use in climate/weather CFD codes.
3) Performance to be not significantly lower than any native API.
This module is used to develop higher-level calls for
domain decomposition and
parallel I/O.
Parallel computing is initially daunting, but it soon becomes
second nature, much the way many of us can now write vector code
without much effort. The key insight required while reading and
writing parallel code is in arriving at a mental grasp of several
independent parallel execution streams through the same code (the SPMD
model). Each variable you examine may have different values for each
stream, the processor ID being an obvious example. Subroutines and
function calls are particularly subtle, since it is not always obvious
from looking at a call what synchronization between execution streams
it implies. An example of erroneous code would be a global barrier
call (see
mpp_sync below) placed
within a code block that not all PEs will execute, e.g:
if( pe.EQ.0 )call mpp_sync()
Here only PE 0 reaches the barrier, where it will wait
indefinitely. While this is a particularly egregious example to
illustrate the coding flaw, more subtle versions of the same are
among the most common errors in parallel code.
It is therefore important to be conscious of the context of a
subroutine or function call, and the implied synchronization. There
are certain calls here (e.g
mpp_declare_pelist, mpp_init,
mpp_malloc, mpp_set_stack_size) which must be called by all
PEs. There are others which must be called by a subset of PEs (here
called a
pelist) which must be called by all the PEs in the
pelist (e.g
mpp_max, mpp_sum, mpp_sync). Still
others imply no synchronization at all. I will make every effort to
highlight the context of each call in the MPP modules, so that the
implicit synchronization is spelt out.
For performance it is necessary to keep synchronization as limited
as the algorithm being implemented will allow. For instance, a single
message between two PEs should only imply synchronization across the
PEs in question. A
global synchronization (or
barrier)
is likely to be slow, and is best avoided. But codes first
parallelized on a Cray T3E tend to have many global syncs, as very
fast barriers were implemented there in hardware.
Another reason to use pelists is to run a single program in MPMD
mode, where different PE subsets work on different portions of the
code. A typical example is to assign an ocean model and atmosphere
model to different PE subsets, and couple them concurrently instead of
running them serially. The MPP module provides the notion of a
current pelist, which is set when a group of PEs branch off
into a subset. Subsequent calls that omit the
pelist optional
argument (seen below in many of the individual calls) assume that the
implied synchronization is across the current pelist. The calls
mpp_root_pe and
mpp_npes also return the values
appropriate to the current pelist. The
mpp_set_current_pelist call is provided to set the current pelist.
None.
None.
None.