Next: 1.5 Summary Up: 1 Parallel Computers and Computation Previous: 1.3 A Parallel Programming Model

1.4 Parallel Algorithm Examples

We conclude this chapter by presenting four examples of parallel algorithms. We do not concern ourselves here with the process by which these algorithms are derived or with their efficiency; these issues are discussed in Chapters 2 and 3, respectively. The goal is simply to introduce parallel algorithms and their description in terms of tasks and channels.

The first two algorithms described have an SPMD structure, the third creates tasks dynamically during program execution, and the fourth uses a fixed number of tasks but has different tasks perform different functions.

1.4.1 Finite Differences

Figure 1.11: A parallel algorithm for the one-dimensional finite difference problem. From top to bottom: the one-dimensional vector X , where N=8 ; the task structure, showing the 8 tasks, each encapsulating a single data value and connected to left and right neighbors via channels; and the structure of a single task, showing its two inports and outports.

We first consider a one-dimensional finite difference problem, in which we have a vector of size N and must compute , where

That is, we must repeatedly update each element of X , with no element being updated in step t+1 until its neighbors have been updated in step t .

A parallel algorithm for this problem creates N tasks, one for each point in X . The i th task is given the value and is responsible for computing, in T steps, the values . Hence, at step t , it must obtain the values and from tasks i-1 and i+1 . We specify this data transfer by defining channels that link each task with ``left'' and ``right'' neighbors, as shown in Figure 1.11, and requiring that at step t , each task i other than task 0 and task N-1

sends its data on its left and right outports,
receives and from its left and right inports, and
uses these values to compute .

Notice that the N tasks can execute independently, with the only constraint on execution order being the synchronization enforced by the receive operations. This synchronization ensures that no data value is updated at step t+1 until the data values in neighboring tasks have been updated at step t . Hence, execution is deterministic.

1.4.2 Pairwise Interactions

Figure 1.12: Task structures for computing pairwise interactions for N=5 . (a) The unidirectional ring used in the simple, nonsymmetric algorithm. (b) The unidirectional ring with additional channels used to return accumulated values in the symmetric algorithm; the path taken by the accumulator used for task 0 is shown as a solid line.

Our second example uses a similar channel structure but requires a more complex communication algorithm. Many problems require the computation of all N(N-1) pairwise interactions , , between N data, . Interactions may be symmetric, in which case and only N(N-1)/2 interactions need be computed. For example, in molecular dynamics we may require the total force vector acting on each atom , defined as follows:

Each atom is represented by its mass and Cartesian coordinates. denotes the mutual attraction or repulsion between atoms and ; in this example, , so interactions are symmetric.

A simple parallel algorithm for the general pairwise interactions problem might create N tasks. Task i is given the datum and is responsible for computing the interactions . One might think that as each task needs a datum from every other task, N(N-1) channels would be needed to perform the necessary communications. However, a more economical structure is possible that uses only N channels. These channels are used to connect the N tasks in a unidirectional ring (Figure 1.12(a)). Hence, each task has one inport and one outport. Each task first initializes both a buffer (with the value of its local datum) and an accumulator that will maintain the result of the computation. It then repeatedly

sends the value contained in its buffer on its outport,
receives a datum on its inport into its buffer,
computes the interaction between this datum and its local datum, and
uses the computed interaction to update its local accumulator.

This send-receive-compute cycle is repeated N-1 times, causing the N data to flow around the ring. Every task sees every datum and is able to compute all N-1 interactions involving its datum. The algorithm involves N-1 communications per task.

It turns out that if interactions are symmetric, we can halve both the number of interactions computed and the number of communications by refining the communication structure. Assume for simplicity that N is odd. An additional N communication channels are created, linking each task to the task offset around the ring (Figure 1.12(b)). Each time an interaction is computed between a local datum and an incoming datum , this value is accumulated not only in the accumulator for but also in another accumulator that is circulated with . After steps, the accumulators associated with the circulated values are returned to their home task using the new channels and combined with the local accumulators. Hence, each symmetric interaction is computed only once: either as on the node that holds or as on the node that holds .

1.4.3 Search

The next example illustrates the dynamic creation of tasks and channels during program execution. Algorithm 1.1 explores a search tree looking for nodes that correspond to ``solutions.'' A parallel algorithm for this problem can be structured as follows. Initially, a single task is created for the root of the tree. A task evaluates its node and then, if that node is not a solution, creates a new task for each search call (subtree). A channel created for each new task is used to return to the new task's parent any solutions located in its subtree. Hence, new tasks and channels are created in a wavefront as the search progresses down the search tree (Figure 1.13).

Figure 1.13: Task structure for the search example. Each circle represents a node in the search tree and hence a call to the search procedure. A task is created for each node in the tree as it is explored. At any one time, some tasks are actively engaged in expanding the tree further (these are shaded in the figure); others have reached solution nodes and are terminating, or are waiting for their offspring to report back with solutions. The lines represent the channels used to return solutions.

1.4.4 Parameter Study

In so-called embarrassingly parallel problems, a computation consists of a number of tasks that can execute more or less independently, without communication. These problems are usually easy to adapt for parallel execution. An example is a parameter study, in which the same computation must be performed using a range of different input parameters. The parameter values are read from an input file, and the results of the different computations are written to an output file.

Figure 1.14: Task structure for parameter study problem. Workers (W) request parameters from the input task (I) and send results to the output task (O). Note the many-to-one connections: this program is nondeterministic in that the input and output tasks receive data from workers in whatever order the data are generated. Reply channels, represented as dashed lines, are used to communicate parameters from the input task to workers.

If the execution time per problem is constant and each processor has the same computational power, then it suffices to partition available problems into equal-sized sets and allocate one such set to each processor. In other situations, we may choose to use the task structure illustrated in Figure 1.14. The input and output tasks are responsible for reading and writing the input and output files, respectively. Each worker task (typically one per processor) repeatedly requests parameter values from the input task, computes using these values, and sends results to the output task. Because execution times vary, the input and output tasks cannot expect to receive messages from the various workers in any particular order. Instead, a many-to-one communication structure is used that allows them to receive messages from the various workers in arrival order.

The input task responds to a worker request by sending a parameter to that worker. Hence, a worker that has sent a request to the input task simply waits for the parameter to arrive on its reply channel. In some cases, efficiency can be improved by prefetching , that is, requesting the next parameter before it is needed. The worker can then perform computation while its request is being processed by the input task.

Because this program uses many-to-one communication structures, the order in which computations are performed is not necessarily determined. However, this nondeterminism affects only the allocation of problems to workers and the ordering of results in the output file, not the actual results computed.

Next: 1.5 Summary Up: 1 Parallel Computers and Computation Previous: 1.3 A Parallel Programming Model