If you're the sort of person who is thinking that all the algebraic
analysis was just great to read about, but so confusing, fear not.
There is indeed quite a lot to understand in all of those equations
above. For example, examining our basic parallel rate equation
(4.9), we see that if it will ALWAYS prevent us
from profitably reaching
for a fixed amount of work.
What, you say? You can't see that even if goes to zero,
the
diverges, and for some value of
(which we could
easily find from calculus, if we hadn't taken a sacred oath not to put
any calculus in this discussion) the overall rate begins to degrade?
Well, then. Let's Look at it. A figure is often worth a thousand
equations. A figure can launch a thousand equations. A figure in time
saves nine equations. Grown fat on a diet of equations, gotta watch my
figure. We'll therefore examine below a whole lot of figures generated
from equation (4.9) for various values of ,
, and so
forth.
Let's begin to get a feel for real-world speedup by plotting
(4.9) for various relative values of the times. I say relative
because everything is effectively scaled to fractions when one divides
by as shown above. As we'll see, the critical gain
parameter for parallel work is
. When
is large (compared to
everything else), life could be good. When
is not so large,
in many cases we needn't bother building a beowulf at all as it just
won't be worth it.
In all the figures below, (which sets our basic scale, if you
like) and
. In the first three figures
we just vary
for
(fixed). Note that
is rather boring as it just adds to
, but it can be
important in marginal cases.
(the first figure) is the kind of scaling one sees when
communication times are negligible compared to computation. This set of
curves (with increasing
ascending on the figure) is roughly what
one expects from Amdahl's Law, which was derived with no consideration
of IPC overhead. Note that the dashed line in all figures is perfectly
linear speedup. More processors, more speed. Note also that we never
get this over the entire range of
, but we can often come close for
small
.
is a fairly typical curve for a ``real'' beowulf. In it
one can see the competing effects of cranking up the parallel fraction
(
relative to
) and can also see how even a relatively small
serial communications overhead causes the gain curves to peak well short
of the saturation predicted by Amdahl's Law in the first figure. Adding
processors past this point costs one speedup. In many cases this
can occur for quite small
. For many problems there is no
point in trying to get hundreds of processors, as one will never be
able to use them.
(the third figure) is more of the same. Even with
, the relatively large
causes the gain to peak
well before 128 processors.
Finally, the last figure is , but this time with a quadratic dependence
. This might result if the
communications required between processors is long range and constrained
in some way, as our ``all nodes communicate every step'' example above
showed. There are other ways to get nonlinear dependences of the
additional serial time on
, and they can have a profound effect on
the per-processor scaling of the speedup.
What to get from these figures? Relatively big is goooood.
is baaaad. High powers of
multiplying things like
or
are baaaaad. ``Real world'' speedup curves
typically have a peak and there is no point in designing and
purchasing a beowulf with a hundred nodes if your calculation reaches that scaling peak at ten nodes (and actually runs much, much
slower on a hundred). Simple stuff, really.
With these figures under your belt, you are now ready to start
estimating performance given various design decisions. To summarize
what we've learned, to get benefit from any beowulf or cluster
design we require a priori that . If this isn't true
we are probably wasting our time. If
is much bigger than
(as often can be arranged in a real calculation) we are in luck,
but not out of the woods. Next we have to consider
and/or
and the power law(s) satisfied by the additional
-dependent
serial overhead for our algorithm, as a function of problem size. If
there are problem sizes where these communications times are small
compared to
, we can likely get good success from a beowulf or
cluster design in these ranges of
.
The point to emphasize is that these parameters are at least partially
under your control. In many cases you can change the ratio of to
by making the problem bigger. You can change
by using a
faster communications medium. You can change the power scaling of
in the communications time by changing the topology, using a switch, or
sometimes just by rewriting the code to use a ``smarter'' algorithm.
These are some the parameters we're going to juggle in beowulf design.
There are, however, additional parameters that we have to learn about. These parameters may or may not have nice, simple scaling laws associated with them. Many of them are extremely nonlinear or even discrete in their effect on your code. Truthfully, a lot of them affect even serial code in much the same way that they affect parallel code, but a parallel environment has the potential to amplify their effect and degrade performance far faster than one expects from the scaling curves above. These are the parameters associated with bottlenecks and barriers.