next up previous contents
Next: Visualizing the Performance Scaling Up: Estimating the Speedup: Amdahl's Previous: Amdahl's Law   Contents

Better Estimates for the Speedup

Of course, reality is generally worse than the highly optimistic upper bound predicted by Amdahl's Law4.10. After all, imagine how long it would take you to actually build that model airplane if the entire population of the world WERE in your kitchen trying to help. ``Forever'' might be a reasonable answer as you (and your kitchen) are crushed beneath the weight of all that help. ``Forever'' is the generally correct answer for parallel processing, too.

There are lots of ways to divvy up the work, and the process of divvying up the work itself takes time. It is also entirely possible to reach fundamental limitations on how far you can subdivide a finite task made of discrete parts (imagine a billion hands on one model airplane that has, after all, only forty or fifty parts that are typically connected by glue bonds made between two objects at a time). There are technical details concerning the order in which subtasks have to be completed that can prevent the parallel work from being cleanly divisible among nodes4.11. And so on. The following analysis works through a few of the simpler and more obvious corrections that we have to consider when parallelizing a task.

One way of looking at all of these corrections is that accounting for all the extra time spent setting up and executing a parallelized task changes the serial and parallel fractions! We might expect to see new terms being added to or taken away from the $S$ fraction. Alternatively, thinking about the time it takes to complete the various chores in a team effort, we left out a bunch of times that may well be important. Indeed, in many cases (or scales) these additional times may be dominant - the most important thing that determines the rate or relative speedup.

To understand some of the times we continue with our example of building a model airplane. Note that in our original speedup estimate we ignored the time that it took you to give your friend the wing part of the kit and take back the completed wing. Well, if it takes your friend twenty minutes to build a wing and twenty seconds total to receive the parts and hand back a finished wing, that's probably ok. Your final time estimate is twenty seconds longer than you thought, but compared to twenty minutes that's not too big an error.

What if your friend lives next door and is not in the kitchen with you? You have to get up, take him the wing parts and a tube of glue, come back, and go to work. As soon as he finishes, he has to get up and bring you the finished wing. Now it might take five minutes to take him the wing parts and five minutes to come back and get to work and suddenly you've spent ten minutes setting up a parallel process designed to save you twenty minutes. That ten minute net savings might also be relative to an hour's total labor if you did it all yourself. Not too good.

At least you're still in the black - you finish your airplane faster than you would have without your friend. On the other hand, if he lived on the other side of town (a ten minute drive either way) it would take you much longer to complete the airplane with his help than it would without it. Whoa! We can actually lose ground and slow a program down by parallelizing it! Amdahl's Law (which at least permitted all speedups strictly greater than 1 for any value of $P$) is way too optimistic.

We've discovered a couple of new ideas in our corrected model airplane example, and we'll now proceed to incorporate them into our algebraic discussion of times and rates and speedups and such. The most important is the analogy of ``Inter Processor Communications'' (IPCs) - this is the communications step where one processor (you) sends part of the program and/or data (the wing parts and glue and instructions for assembly) to another processor (your friend), and later get back a finished wing. In all parallel code, SOME sort of IPC's are necessary. They can take a long time compared to the time actually required to do the parallel work or they can take a short time compared to the time to do the parallel work.

In some very important cases this communication process can be done only one or twice and may take a very short time compared to the time working in parallel, for example at the beginning and the end of a calculation. In all cases, though, the program itself and initial data has to somehow get to the processors working in parallel and the results of their parallel work have to be reunited into some finished whole. IPC's are definitely essential to the notion of parallel processes.

And4.12they take time. And4.13 for us to correctly estimate the speedup of a parallelized program, we have to insert the time required, since it may well be significant.

In fact, if we think about it, it costs us time at least TWICE, in fundamentally different ways. Parallelizing a program, we generally increase by a bit the time required to complete the serial fraction. If we have $P$ friends waiting to take various airplane parts that are originally in our possession and build them, we may well have to carry the parts and instructions and glue to each one, one after the other. The more friends, the longer this takes. Overall, this time thus scales like the number of friends (or worse) and, if you are also responsible for doing the serial work it adds directly to the serial time because you're not working on the airplane at all while you're carrying parts around and collecting finished sub-assemblies.

We also have to increase the time required to execute the parallel task on each node a bit over what it would have taken serially. It takes you a certain amount of time (already accounted for in the original serial time estimate) to put newspapers on the table4.14, get the glue open4.15, and read the overall instructions - your friend also needs to spread his own newspapers on his own table 4.16, open the glue, and read the general instructions for himself before he can get started on building his wing from the wing-specific instructions. It might take him a few minutes to do these things and we have to increase the time it takes for ``wing building in parallel by a friend'' over the time it takes for ``building each wing as part of sequentially building the whole airplane yourself''. This time usually does not scale up with the number of friends helping as they can all be spreading newspaper, etc. at the same time. At the same time, it doesn't scale down - your friends can't set up their worksites ``in parallel'' in less time than it takes to set up a worksite.

Finally, you and your friend will almost certainly have to wait for each other from time to time, as already illustrated above. If you finish one part that gets glued to a part he's working on as the next step, you have to wait for him. Or, you may well be sharing resources. You may have to wait while he uses the glue, for example, and a bit later he may have to wait for you to give it back. In the meantime, you each may have to wait idle, although this often depends on how the task is organized.

All this introduces new times and fractions and rates into our earlier statement of Amdahl's Law. Here's a table to help you keep track of this menagerie of variables:

${\bf T_s}$ The original single-processor serial time.
${\bf T_{is}}$ The (average) additional serial time spent doing things like IPC's, setup, and so forth, per processor, in all parallelized tasks.
${\bf T_p}$ The original single-processor parallizable time.
${\bf T_{ip}}$ The (average) additional time spent by each processor doing just the setup and work that it does in parallel. This may well include idle time, which is often important enough to be accounted for separately.
The $i$ subscript just reminds us that in many cases the bulk of $T_{is}$ or $T_{ip}$ are due to the burden of either IPC's or idle time, although a lot of $T_{ip}$ can also come from local subtask setup and other non-communications overhead. Note that this is a simplified description of the times - in many cases practical discussions of parallel task design split times a bit more finely and specifically and separately count communication, computation, and idle time for all processors.

Using these definitions, we can write our modified task completion time when using $P$ processors:

T_{\rm tot}(P) = T_s + P*T_{is} + T_p/P + T_{ip}.
\end{displaymath} (4.8)

In English (for the equation impaired): ``The total time required to complete a task that is parallelized on $P$ nodes is the sum of the original serial time, the average additional serial time per node times the number of nodes, the original parallelizable time divided by the number of nodes doing the parallized work, and the average additional time spent on each node to do its piece of the parallelized task''. Nothin' to it. We can do this.

To find the Amdahlian4.17speedup, we again have to evaluate $R(P)/R(1)$. This time, being lazy, we'll note that the $W$ always cancels so we're really just evaluating $(T_s + T_p)/T_{\rm tot}(P)$:

\frac{R(P)}{R(1)} = \frac{T_s + T_p}{T_s + P*T_{is} + T_p/P +
\end{displaymath} (4.9)

Now, if we were to be picky (and let's be, just this once) this result, however useful and marvelous, is still way too general (and hence incorrect) to be truly useful and marvelous. We are in a bit of a quandry, though. Every time we add a bit of detail, our speedup expression gets a bit more complex. This cannot be helped, it can only be understood, however much work it takes to understand it for your particular numerical task. In many cases, this expression will suffice to get at least a general feel for the scaling properties of a task that might be parallelized on a typical beowulf. In others, it won't, and you'll have to work much harder.

As just a single (but very important) example of the latter, it is well-known that certain numerical tasks require a pairwise exchange of information between all nodes between parallel steps. Each pairwise communication might take a time $T_c$ (where I have no idea what the $c$ subscript stands for, but it is different from $i,s$, and $p$). If there are $P$ nodes and each node can thus talk to $P-1$ other nodes and we make the unhappy assumption that they are connected by a hub that permits only one pair to talk in one direction at a time (no broadcasting allowed), we find that the total serial IPC step requires $P(P-1)$ individual pairwise communications each costing $T_c$, or

T_{c,{\rm tot}} = P(P-1)T_c = (P^2 - P)T_c
\end{displaymath} (4.10)

which scales quadratically in the number of nodes, not linearly!

This, alas, goes in the denominator of our relative rate expression, which now contains terms with powers of $P$ that range from -1 to 2. Oooo4.18. Suddenly:

\frac{R(P)}{R(1)} = \frac{T_s + T_p}{T_s + P*T_{is} + T_p/P + T_{ip}
+ (P^2 - P)T_c)}
\end{displaymath} (4.11)

and I can just hear the math-challenged among you starting to whimper4.19

This is by no means the only kind of scaling behavior possible. If $N$ represents your problem ``size'' (for example, the length of a lattice side or the number of items in a list to be sorted) then the work being split up can depend strongly on $N$ as well. There may well be (work and/or communication) times that scale like $N$, other times that scale like $N^2$, and so forth. If you parallelize the part that scales like $N$ but not the part that scales like $N^2$, you might get decent speedup for smallish $N$ but at some value of $N$ the $N^2$ will overwhelm it.

Unfortunately, we're just getting warmed up. Imagine what we might have to write if we account for all of the times spent in all the parallelized subroutines of a complex piece of code, including the effects of nonlinear determinants like cache size, memory speed, memory size, context switching, communications speed, communications latency, communications pattern, the need for points of parallel code resynchronization (called ``barriers''), and a whole lot more4.20. Each little piece cranks a new term into our relative rate equation, or modifies a term that is already there.

The final insult is that all of these times are totally algorithm dependent and completely different algorithms are often ``best'' for parallel computations than for serial computations. There are Clever Tricks that can often be used to change the communications pattern and that produce quite different scalings of the communication times and idle times. I told you that this sort of thing carries over into the realm of Real Computer Science$_{\rm tm}$ really quickly. Trying to calculate all these terms and times, in detail, a priori for complex pieces of code is well-nigh impossible, and few beowulfers (or other kinds of parallel computer programmers) do it. Real Computer Scientists do, it appears (often less for the result than for the papers they can publish describing the result), and bless them for it since then we don't have to.

However, there is no escaping the need to perform a few basic and practical steps. The Wise Beowulfer will determine the $P$-scaling and $N$-scaling of the times of at least the most important blocks of parallelizable code in your task(s) (hopefully, your task will fall into some ``generic'' category discussed in detail below, but you at least have to identify the category). The Wise Beowulfer will at least think about the interaction between your hardware and networking design and algorithm and these powers and times.

Let me conclude this chapter by showing you why you should bother with a couple of practical - if somewhat contrived - examples.

Suppose you are charged with building a beowulf to carry out an ``all pairs communicate every step'' task like the one described above that led to the (somewhat naive) $P(P-1)T_c$ time scaling for ``all pairs one at a time'' hub based communications. Studying the problem, you rapidly learn that you can achieve the same result (all pairs exchanging data) in $(P-1)T_c$ time if a switch that can support $P/2$ simultaneous bidirectional communications is used instead of a hub (and $P$ is even). Or you could try using a broadcast on the hub (if your software parallel communications library supports a true hardware broadcast, a thing that might require a prototype to validate since some libraries might implement their group ``broadcast'' function as a series of pairwise communications to the hosts in the group list), which would yield $P*T_c$ time. You also must consider that $T_c$ might be ten times smaller for a 100 Mbps hub than for a 10 Mbps switch that costs about the same amount. Then there is a 100 Mbps switch to consider, which costs a bit more. Then, we haven't really considered the algorithm. Depending on the message sizes, the latency, and a few very esoteric things, there are algorithms that might reduce the $P^2$ to (e.g.) $P\log P$ even for a hub (or parallel library) with no broadcast. Finally, we haven't worried about how to program a synchronized transfer like the one required to obtain optimum time through a switch. All the nodes have to be talking at the same time and to just the right node, in both directions, to get the improved node scaling, and this is not at all easy to arrange.

Your job, your future, your health and your happiness all ride on making the right choice here. Which one do you buy?

One is tempted to say ``the 100 Mbps switch'', and for many problems (possibly most) this would be the correct answer. That's why it is in the ``recipe'' given at the very beginning. It is relatively cheap and adequate to give decent scaling (both power and base time) for many problems. However, if you can only afford eight nodes, and you're doing a problem where $8*7*T_c \ll T_p/8$ for a 10 Mbps hub, the correct answer might be to get the cheapest damn thing you can lay your hands on. There are times when the answer could even be ``who needs a hub'' - early PC-based ``beowulfish'' computer efforts not infrequently involved a floppy disk carried around to a bunch of computers by hand4.21. Pop in the floppy, merge the data, carry it around, and when it is all shared start another week-long run on all the computers involved.

In quite a few cases, though, the correct answer would be a far more expensive gigabit (per second) switch, like Myrinet or perhaps gigabit ethernet. If you want to be able to scale up to ``lots'' of nodes (say 64 or more) it may be crucial to reduce $T_c$ by an order of magnitude and obtain the most favorable $P$-scaling!

This illustrates just a few of the realities confronting the would-be beowulf designer. In some cases you can derive an approximate scaling form for the relative rate equation appropriate for your particular algorithm and communication pattern. In other cases (most cases, really) you are far better off looking it up (along with a whole lot of other stuff) in a real book on parallel computation4.22. The reason you should consider looking things up is because there are smart and stupid ways to do simple little things like multiply matrices in parallel or send a message between all nodes in between program steps. Some of them are very non-intuitive - you'll never invent them on your own4.23. This is what Computer Scientists (the real variety) live for. C'mon, give them their moment of glory.

Once you have the scaling form of the relative speedup appropriate to your algorithm and the various network media types you are considering you can use it (and some measurements) to make estimates for the speedup possible for a parallelized chunk of code. This is less difficult than it sounds. In practice, all this mathematical work isn't so daunting - usually most of the parallelizable work is done in just a few program blocks and all the surrounding serial code can be added up at once into the irreducible serial work and the irreducible serial time. In a lot of cases, in fact, there will be just one parallelizable block. In the best cases the whole program can be converted to a parallel block, where the only required serial code is something to start a lot of programs in parallel and collect the results. These are called ``embarrassingly4.24 parallel computations''.

Although embarrassingly parallel computations are important enough to be given their own acronym (EPC) and to be considered in detail later, we'll take a moment to think about them here as well as they have an important lesson for us to learn before we leave the discussion of mathematical estimates of rates and so forth behind. To understand them we can return to Our Favorite Metaphor by thinking of building lots of identical model airplanes with our friends. One can get great parallel efficiency (another way of saying ``a speedup like $R(P)/R(1)
\approx P$'') by just getting $P$ friends into a room4.25 and giving each one their own kit and glue. If it takes you ten minutes to distribute 100 kits, and your ``nodes'' build 100 airplanes in one hour more, you've built 100 airplanes in and hour and ten minutes, for a speedup a hair less than 100. Not bad compared to the 100 hours it would have taken you to build them all one at a time, and you didn't even need to get glue on your own fingers. If you have 1000 friends4.26 and can still distribute all the kits in an hour or so (good luck), your gains get even better.

The model airplane construction, in this case, is being run as an embarrassingly parallel task. Now you know what the phrase means. One processor starts $P$ essentially identical jobs (on other processors) all at once, then kicks back for a relatively long time (perhaps sipping a metaphorical brew or two, perhaps doing a job itself) until they all complete, and then collecting the results. Repeat until finished, with near perfect scaling. Technically, we've arranged things so that $T_s$, $P*T_{is}$, and $T_{ip}$ are all much much less than $T_p/P$ (with no particularly strong additional constraints on the way tasks are started or finish) so that $R(P)/R(1)
\approx P$ as required. This is the way a compute cluster of nearly any sort can be used to get fabulous amounts of work done in parallel. Later we'll talk about the SETI project and how to turn the entire internet into a cluster supercomputer.

This embarrassingly parallel example also gives us a hint of how to improve our speedup for parallel operation, all things being equal. Suppose we can distribute one airplane kit per minute and need to build ten airplanes. Suppose it also takes only one minute to build an airplane one at a time (perhaps they are the cheap balsa ones your dentist gives to your kids as a ``reward'' for not biting her fingers). Hmmm, pretty lousy gain4.27. Now, think about the speedup if it takes two minutes to build an airplane4.28. Better, but not spectacular. What about a hundred minutes4.29? Aha! In a lot of cases we can go from pretty shabby parallel speedup scaling to spectacular astounding parallel speedup scaling by just increasing the amount of parallel work done while, of course, keeping a lid on the additional serial fraction associated with doing the additional parallel work.

As a parenthetical aside (in a work that a cruel person might consider a huge conglomeration of parenthetical asides [some including nested parenthetical comments of their own] arranged non-parenthetically), one could also be tempted to reorganize the task completely from its serial arrangement by setting up an assembly line where each friend just adds one part to a model airplane and hands it to the next person in line. As Henry Ford discovered, such an arrangement requires considerably greater effort (and capital) to set up, but actually can allow the model airplanes to be completed even faster than in the embarrassingly coarse grained parallel implementation of the serial work by actively reducing the time required to complete the parallelized work.

Naturally, similar arrangements can occur in parallel programming, especially when considering the additional costs of e.g. flushing and reloading a cache or performing a context switch (which can make it more expensive to do a series of different things in parallel than to do one thing many times). We might even discuss a few later. This is one of several circumstances where Amdahl's Law might be wrong, or at least (as previously noted) irrelevant, as there is no useful analog of an assembly line in a non-parallel work situation.

Anyway, I've now completed most of the formal algebraic analysis that I'm going to do. That's the good news. The bad news is that I didn't even try to do a complete or detailed job of the formal analysis - I've only taught you enough4.30 that you should be able to figure out how to do what you need to do for your own particular task of beowulf design. If your task is complicated enough to be beyond the power of this simple analysis to elucidate (and isn't similar to one of the ones I consider in detail later) then I guess you'll have some work to do, including obtaining and learning from more advanced resources.

There is still one important step to complete before leaving scaling laws completely. Many of you probably looked at equations like (4.9) with the patient, somewhat quizzical expression that I might have if suddenly confronted with a pair of Tibetan monks asking directions to the nearest mall (in Tibetan, of course). I'm so glad you managed to hold on (out of sheer politeness, I'm sure). In the next section we'll actually show the pictures.

next up previous contents
Next: Visualizing the Performance Scaling Up: Estimating the Speedup: Amdahl's Previous: Amdahl's Law   Contents
Robert G. Brown 2004-05-24