Module mpp

mpp_mod, is a set of simple calls to provide a uniform interface to different message-passing libraries. It currently can be implemented either in the SGI/Cray native SHMEM library or in the MPI standard. Other libraries (e.g MPI-2, Co-Array Fortran) can be incorporated as the need arises.

The data transfer between a processor and its own memory is based on load and store operations upon memory. Shared-memory systems (including distributed shared memory systems) have a single address space and any processor can acquire any data within the memory by load and store. The situation is different for distributed parallel systems. Specialized MPP systems such as the T3E can simulate shared-memory by direct data acquisition from remote memory. But if the parallel code is distributed across a cluster, or across the Net, messages must be sent and received using the protocols for long-distance communication, such as TCP/IP. This requires a ``handshaking'' between nodes of the distributed system. One can think of the two different methods as involving puts or gets (e.g the SHMEM library), or in the case of negotiated communication (e.g MPI), sends and recvs.

The difference between SHMEM and MPI is that SHMEM uses one-sided communication, which can have very low-latency high-bandwidth implementations on tightly coupled systems. MPI is a standard developed for distributed computing across loosely-coupled systems, and therefore incurs a software penalty for negotiating the communication. It is however an open industry standard whereas SHMEM is a proprietary interface. Besides, the puts or gets on which it is based cannot currently be implemented in a cluster environment (there are recent announcements from Compaq that occasion hope).

The message-passing requirements of climate and weather codes can be reduced to a fairly simple minimal set, which is easily implemented in any message-passing API. mpp_mod provides this API.

Features of mpp_mod include:

1) Simple, minimal API, with free access to underlying API for more complicated stuff.
2) Design toward typical use in climate/weather CFD codes.
3) Performance to be not significantly lower than any native API.

This module is used to develop higher-level calls for domain decomposition and parallel I/O.

Parallel computing is initially daunting, but it soon becomes second nature, much the way many of us can now write vector code without much effort. The key insight required while reading and writing parallel code is in arriving at a mental grasp of several independent parallel execution streams through the same code (the SPMD model). Each variable you examine may have different values for each stream, the processor ID being an obvious example. Subroutines and function calls are particularly subtle, since it is not always obvious from looking at a call what synchronization between execution streams it implies. An example of erroneous code would be a global barrier call (see mpp_sync below) placed within a code block that not all PEs will execute, e.g:

   if( pe.EQ.0 )call mpp_sync()

Here only PE 0 reaches the barrier, where it will wait indefinitely. While this is a particularly egregious example to illustrate the coding flaw, more subtle versions of the same are among the most common errors in parallel code.

It is therefore important to be conscious of the context of a subroutine or function call, and the implied synchronization. There are certain calls here (e.g mpp_declare_pelist, mpp_init, mpp_malloc, mpp_set_stack_size) which must be called by all PEs. There are others which must be called by a subset of PEs (here called a pelist) which must be called by all the PEs in the pelist (e.g mpp_max, mpp_sum, mpp_sync). Still others imply no synchronization at all. I will make every effort to highlight the context of each call in the MPP modules, so that the implicit synchronization is spelt out.

For performance it is necessary to keep synchronization as limited as the algorithm being implemented will allow. For instance, a single message between two PEs should only imply synchronization across the PEs in question. A global synchronization (or barrier) is likely to be slow, and is best avoided. But codes first parallelized on a Cray T3E tend to have many global syncs, as very fast barriers were implemented there in hardware.

Another reason to use pelists is to run a single program in MPMD mode, where different PE subsets work on different portions of the code. A typical example is to assign an ocean model and atmosphere model to different PE subsets, and couple them concurrently instead of running them serially. The MPP module provides the notion of a current pelist, which is set when a group of PEs branch off into a subset. Subsequent calls that omit the pelist optional argument (seen below in many of the individual calls) assume that the implied synchronization is across the current pelist. The calls mpp_root_pe and mpp_npes also return the values appropriate to the current pelist. The mpp_set_current_pelist call is provided to set the current pelist.

`ptr`	a cray pointer, points to a dummy argument in this routine.
`newlen`	the required allocation length for the pointer ptr [integer]
`len`	the current allocation (0 if unallocated). [integer]

`a`	`real` or `integer`, of 4-byte of 8-byte kind.
`pelist`	If `pelist` is omitted, the context is assumed to be the current pelist. This call implies synchronization across the PEs in `pelist`, or the current pelist if `pelist` is absent.

Module mpp_mod

OVERVIEW

OTHER MODULES USED

PUBLIC INTERFACE

PUBLIC DATA

PUBLIC ROUTINES

mpp_error

mpp_init

mpp_exit

mpp_malloc

mpp_set_stack_size

mpp_max

mpp_sum

mpp_transmit

mpp_broadcast

mpp_chksum

DATA SETS

ERROR MESSAGES

`length`
`pelist`

`length`
`from_pe`
`pelist`

`pelist`
`var`