Bottlenecks

Your computer has lots and lots of ``moving'' parts^5.1. They are interlocked in strange and complicated ways. The work that they do for the central processing unit (CPU) proceeds at different rates. Sometimes the CPU has to wait on these processing subsystems. Sometimes the processing subsystems have to wait on the CPU.

You can see that I'm already using the CPU as the lowest common denominator of times and speeds in a system. This is generally the correct thing to do. The CPU clock is generally the fastest one available in the system, and although there is considerable variation in CPU architectures and just how fast they accomplish things, I'm going to assume that on a good day a ``typical'' CPU executes a single instruction in a single ``clock'' (cycle) (which is the inverse of the clock frequency). This won't always be true (some instructions may require several clocks, others may complete in parallel in a single clock.

Given CPU speeds that these days range from 300 MHz to 1 GHz, I'm going to assume that typical times required to execute ``an instruction'' range from 1 to 3 nanoseconds. Linux calls this the ``bogus'' rate of instruction execution and measures it as a given number of ``bogomips'' (millions of bogus instructions per second) early in the boot process. It is as good a measure of average processor speed as any other for the purposes of this chapter^5.2.

Most of the time the CPU (being the fastest gun in the west, so to speak) is the thing that ends up waiting. Whenever we have one part of the computer waiting on another to complete something before it can proceed, we have trouble. If the computer (CPU) is really only trying to do just one thing, it ends up twiddling its metaphorical thumbs until it can go forth and calculate again. If it has other things it can do it can try to improve the shining hour by doing them while it waits. Making all this work efficiently is what multitasking, multiuser operating systems are all about.

When we make an entire network of systems into a computer, this problem is amplified beyond belief. If any one CPU is slowed down for any reason (such as waiting on some resource) it can slow down the whole calculation distributed on all the nodes. Resources that are likely to be constrained and hence form rate-limiting features of a given calculation are generically called ``bottlenecks''. It's like having a four lane road that suddenly narrows down to just one lane - the ``neck'' - traffic often goes even slower than it might have if the road had been one lane all along^5.3.

Let's learn about some of the classic bottlenecks associated with computer calculations - the CPU-memory bus, the CPU's cache (size and speed), the disk and the network itself. We'll also think a bit about how a given parallelized program might have to be written to deal with bottlenecks and (in the next chapter) the related desynchronizing of lots of program elements that can occur if a job is asymmetrically distributed (where one node is faster than another node for whatever reason). The key thing to understand here is that it makes no sense to invest in a figurative ferrari to drive in bottlenecked traffic when kid in a pair of rusty rollerblades might well be able to make faster progress, especially when the money you spent on the ferrari could have been used to widen the road.

To understand bottlenecks on a computer system, we have to first learn the meaning of two words: latency and bandwidth. Latency is the delay between the instant a CPU requests a piece of information and the time the information starts to become available. The word ``starts'' is key here, as the information requested could be quite lengthy and take a long time to deliver. Bandwidth is the rate at which the information is actually delivered once the delivery has begun. Latency is measured in units of time (typically seconds to nanoseconds) while bandwidth is measured in units of memory size/time (typically megabytes/second).

To give you a really meaningful metaphor, latency is the time it takes from right now to go to the refrigerator and get a beer. Bandwidth is the rate at which (beer in hand) you drink it, as in one beer/hour, ten beers/hour, or so forth. Go on, experiment. Measure the latency and bandwidth of the beer transfer process. Think a bit about the tradeoffs between getting one beer at a time and only chugging it when getting back to your desk (paying the latency of a trip to the fridge over and over) versus getting a whole sixpack in a single trip.

Exciting concept isn't it? Hopefully by now the hangover induced by your experimentation last night has subsided and we can focus once again on our main topic which is beowulfery and not beer (however much they have in common^5.4 ^5.5).

To start with, let's (as is our habit) take a really obvious example so that you see why we're doing this. Suppose that you have a cluster of diskless machines available and that each diskless machine has 32 MB of main ``core''^5.6 memory. On this cluster, you wish to run a job that (when running) occupies 64 MB of core (or more) but that partitions nicely into parallel segments that are (say) 20 MB each when running on four nodes.

Running on a single system, the job has to swap constantly, and in this case the only way to swap is over the network to a remote disk on the diskless server. Swapping is very, very bad for performance - disks are many powers of ten slower than direct memory access - and swapping over a network compounds the injury. If one has to constantly go to the remote disk to load/unload the contents of memory it can very, very significantly increase the time required to run the program (as I'll discuss below in considerable detail). What is more, it can increase the time required to run the serial fraction of your code as much as it increases the time required to run the parallelizable fraction, because the parallelizable fraction has to swap in and out where it is interleaved with the serial fraction.

On the other hand, when running on a four node cluster one has to pay the IPC penalty (which we'll assume is fairly small and scales linearly with

) but the node jobs no longer swap - they fit into memory with room to spare for the operating system and libraries and buffers and caches and so forth. It is entirely possible that

is reduced so much by this that Amdahl's Law is violated by the speedup^5.7.

This shouldn't be too surprising. Our derivation of Amdahl's law assumed a certain smoothness of the execution times over the serial/parallel division of a problem; in particular, we assumed that

and

themselves don't fundamentally change except by the division of the parallel work. This not at all unreasonable example shows that this assumption may be false in the real world because the system in question has finite resources or resources that are accessible on a variety of timescales.

It also illustrates an important class of jobs for which beowulfs are ideal. One does not always consider building a beowulf to achieve a ``speedup'' in the ordinary sense of the word. One can also build one to enable a job to be done at all. More realistically (since in the previous example an obvious solution is to invest $30 in another 32 MB of memory) if one has a job that runs in $\sim 5$ GB of memory it may be far cheaper to purchase ten 512 MB systems than one 5 GB system, presuming that one can find a system at any price that holds 5 GB of main memory^5.8

Speedups that violate the simple notions that went into Amdahl's Law or the slightly more realistic speedup equation (4.9) are called superlinear speedups, and a vast literature has developed on the subject. Basically, a superlinear speedup is what we call it when for any reason parallelizing a program results in a speedup that scales faster than

for any part of its range. In nearly all cases, these will occur because of the wide range of timescales available within ``a system'' for accessing data or code.

Bottlenecks (in this case the bottleneck associated with disk-based virtual memory) can clearly wreak havoc on our beautifully derived speedup expressions, for good or for ill. Let's take a quick look at the primary bottlenecks that can significantly impact our parallel performance. I'll try to include some simple code fragments that either illustrate the points or permit you to estimate their impact on your code.

We'll begin with a simple table of the bottlenecks and the sorts of (both practical and theoretical) limits associated with them:

Bottleneck	Description	Latency	Bandwidth
L1 Cache	CPU to L1 Cache	1-5 ns	-
	read/write	(1 clock cycle)
L2 Cache	L1 to L2 Cache	4-10 ns	400-1000 MB/sec
	read/write
Memory	L2 Cache to memory	40-80 ns	100-400 MB/sec
	read/write
Disk (local)	CPU to disk	5-15 ms	1-80 MB/sec
	read/write
Disk (NFS)	CPU to NFS disk	5-20 ms	0.5-70 MB/sec
	read/write
Network	CPU to remote CPU	5-50 $\mu$ s	0.5-100 MB/sec
	write

The first thing to note in this table is the times therein differ by seven or more orders of magnitude for different devices. Compare 1 clock tick (as little as 1 nanosecond for a 1 GHz CPU) latency accessing a particular address in the L1 cache to 10 or more millisecond latency accessing a particular address on a hard disk. Compare bandwidths of a megabyte per second for streaming data transfer over a slow network (often degraded to half that or even less) to rates on the order of a gigabyte per second from the L1 cache. Even this fails to encompass the full range - we don't even consider floppy drives, serial networks, or tape devices on the slow end or the internal rates of register activity within the CPU itself on the high end.

The second thing to note is the the entire reason for having a full hierarchy of ``memory'' access rates is to hide the longest times and slowest access rates from you. Your average data access rate and latency is determined by how often the data you need is available in each different kind of memory (where we will consider disk to be a kind of memory, and will often think about network reads or writes in terms of memory as well). Time for an equation. Suppose

is the probability that your program finds the next piece of data or code it is looking for in the

th kind of ``memory'' in the table above. Then the average latency (the average time the system has to wait before the data or code becomes available) is:

For example, if $p_{L1} = 0.05$ , $p_{L2} = 0.9$ , and $p_{M} = 0.05$ for a problem that fits into main memory, the average latency might be something like:

$\begin{displaymath} <L> = 0.05*1 + 0.9*8 + 0.05*50 = 9.75 \end{displaymath}$

(5.2)

The following is my own, strictly editorial opinion^5.10. To put it bluntly, the advantages of a large L2 cache are often overblown by CPU manufacturers interested in selling you larger, more expensive (and profitable!) CPUs. It is amusing to compare the actual execution times for given pieces of code on similar clock processors from the Intel Celeron/PII/Xeon/PIII family. The Celeron, Xeon and PIII have caches that run at the full speed of the CPU, so that on an (e.g.) 500 MHz CPU a clock tick is 2 nanoseconds. The PII has a cache that runs at half the speed of the CPU clock. The Celeron L2 cache is 128 KB in size; the PII and PIII caches are 512 KB in size (four times larger) and the Xeon cache comes in several sizes, but can be as much as 2 MB in size if you're willing to pay an absurd amount of money for it.

For ``most code'' (where once again I risk flames with such a fuzzy term) there is little benefit to be seen in having a CPU with the larger cache sizes. Given a factor of ten or so cost differential between the small-cache Celeron and a very large cache Xeon at equivalent clock, one can afford to buy three complete Celeron systems for what it costs to buy one 2MB cache Xeon processor and only rarely does one see as much as a 20% speed advantage associated with the larger cache. However, there are exceptions. Code/data sets that fit within a cache clearly will execute far faster than code/data that has to go to main memory (see table above).

The reason that caches tend to work so well is that in a lot of cases a relatively few instructions are executed sequentially in loops, so that loading a whole block into cache from memory just one time suffices to allow the program to run out of cache for extended periods. If one is very lucky, one's core code can live in the L1 cache and runs at ``full speed'' all the time. Similarly, a lot of time the data one works on tends to be ``localized'' in the memory space of the program so that one load works for an extended period. If you like this makes $p_{L1}+p_{L2} \gg p_M$ so that, on average, the CPU finds what it needs already in the cache with occasional long delays when it doesn't and has to reload. As we've seen, this tends to yield and average latency and access speed not too far from the bare speed of the L2 cache which keeps your CPU trucking right along doing useful things instead of waiting for data.

However, if your program does things like add up randomly selected bytes of data from a one megabyte dataspace, it may not find the next byte that it needs in cache. With a 128 KB L2 cache, the probability may be no greater than 1/10 that it does, so 90% of the time it will take some 60 nanoseconds to get the next byte to add, and 10% of the time it will only take 10 nanoseconds (or less) for an average rate of one byte in 55 nanoseconds (plus a tick or so for the add). With a 512 KB L2 cache, it might find the byte it needs (after the program has been running a while) 1/2 the time, and its average rate of access goes up to one byte in 35 nanoseconds. With a 1 GB L2 cache, the data lives entirely without the cache and the program can get the next byte in 10 nanoseconds, nearly six times faster than with a 128 KB cache.

Some tasks are ``like'' this and have a very nonlocal pattern of memory access. These tasks benefit from a large cache. As you can see from the numbers above, though, if one finds 90% or more of what one needs in L1 or L2 cache already, there is little marginal return paying quite a lot of money for having a bigger one.

Note that this kind of problem can easily exhibit a nice superlinear speedup on a beowulf. If one breaks one's 1 MB into ten pieces 100 KB each in size, they will run in cache on Celeron nodes for a speed advantage (per node) of nearly six in addition to the factor of 10 for parallelizing the sums. If the blocks can be independently sampled and summed for a long time (to minimize the relative IPC cost of transferring the blocks and collecting the partial sums at the end) one may see a speedup far greater than one might expect from just using 10 nodes to do something you were before doing on one and ignoring the cache interaction. A silly example, in that I can think of no useful purpose to summing random bytes from a given memory space, but there are likely useful things (perhaps in a simulation of some sort) with a similar access pattern.

Now, with my editorial comment completed, I do not mean at all to suggest that the speeds and latencies of the memory subsystems are unimportant to the beowulf designer - quite the contrary. There are many jobs that people run on computers (beowulf or not) that are ``memory bound'' (which just means that their speed is primarily determined by the speed with which things are retrieved from memory, not the speed of the CPU per se). Multiplying very large matrices and other sequential operations involving large blocks of memory are perfect examples^5.11. In many of these operations the system is ``always'' getting new blocks of data from memory and putting them into cache (or vice versa) so that the memory subsystem is more or less continuously busy.

In that case an important new bottleneck surfaces that is a frequent topic of discussion on the beowulf list. It is well known that the cheapest way to get CPU is in a dual packaging. Dual CPU motherboards tend to be only a tiny bit cheaper than single CPU motherboards, and a dual can share all other resources (case, memory, disk, network) so you only have to buy one set for two CPUs. In one direction, the marginal cost of a dual over a single is the cost of a second CPU plus perhaps $50-100 (in the case of Intel processors - YMMV). In the other direction, the marginal cost of a second single CPU node compared to a dual node is the cost of a case, memory, disk and network (less $50-100) or perhaps $100-300, depending on how much memory and disk you get.

If your calculation is ``CPU bound'' then a dual is optimal and your beowulf design should likely be a pile of duals. In many cases EPC's will be CPU bound - more CPUs means more work done. If it is memory bound, however, it is a true fact on Intel systems that duals more than saturate the memory subsystem. If two CPUs are trying to get things from memory at the same time as fast as they can, one CPU has to wait at least a fraction of its time. This can impact memory bound performance significantly so that instead of getting 200% the performance of a single CPU system, one gets only 140-160%. In this case, one is usually better off getting two singles (which can yield the full 200%).

If your calculation is network bound (a possibility discussed in detail in the next chapter) life becomes far more complicated. In that case, there are lots of possibilities to consider including communication pattern, putting two NIC's in one case, being effectively memory bound (one generally talks to the NICs through the memory subsystem) and the fact that a dual can in some circumstances use a network more efficiently than a single because receiving a network transmission turns out to proceed much, much more slowly if the receiver's CPU is busy running code. I therefore hesitate to give a general rule for singles versus duals in situations where your code is network bound - you're better off prototyping your code and recycling the losing hardware on desktops or as servers.

All of the arguments and discussion concerning the L1-L2-main memory bottlenecks hold true when extended to jobs that swap, only everything becomes much, much worse. When a job swaps, an even slower ``memory'' (the hard disk) is used to store part of the (virtual) memory image of the running job. Under the assumption that the code and data are reasonably local, pieces of them are loaded into real memory on demand (whenever a virtual address is requested that is on the disk instead of in real memory) usually in fairly big chunks (pages). In fact, the system does this even when it doesn't ``have'' to and typically keeps only what it is actually working with in memory to conserve memory for all sorts of buffering and caching optimizations that the operating system handles for you behind the scenes.

The reason for big chunks is that if you have to pay that hideous 5-10 millisecond latency (often twice!) to get any chunk at all, you have to get a ``big'' one to keep the average transfer rate from being absurdly low. You're also betting (literally) that your next data or code requirement is more likely to come from the big block you just read in than not.

You pay the penalty twice when you have to store some of the pages in memory to disk to liberate the space for new pages coming in from disk. This generally happens for data, hence the term ``swap'', where two data pages are exchanged. Code, on the other hand, tends to be read in single pages from the single fixed disk image of the binary and its associated libraries, where the system works hard to cache frequently accessed pages to avoid having to actually use the disk. The two kinds of virtual memory operation (page and swap) are accounted separately in /proc/stat - lots of paging is normal, lots of swapping (or even any swapping at all) is dark and evil. Even reading in large chunks, the many orders of magnitude difference in writing to and reading from the disk instead of memory is very, very costly to a program's speed.

As (5.1) shows, if one is lucky and the code and data references are indeed mostly clustered, a job can swap or page and still complete in a reasonable amount of time. The calculation looks very similar to the example above except now one has an additional term where one has to multiply the probability of having to go to swap times the rate (including the combined effects of latency, transfer bandwidth, and the size of the block requested) to the other terms. As long as the vast bulk (as in more than 99%) of requests are satisfied from CPU cache or main memory, an occasional swap or page isn't too painful.

Life starts to really suck, however, when this is not true. If we extend the random access example to swap-based virtual memory, we can arrange things to deliberately defeat the best efforts of the paging and swap algorithms and force the system to disk again and again. For example, on a system with only

bytes of RAM, one can create a job that occupies a

virtual memory space,

of which necessarily reside on disk. Sequentially adding randomly selected bytes of data from this long vector will force the system to memory (instead of cache) on almost every call and on to disk 90% of the time, paying approximately (0.9*10 = 9) milliseconds per add. Adding a mere

numbers would require some

seconds, or almost two weeks. Compare that to adding

numbers selected from a vector that fits in L1 (on the order of a few seconds) or L2 (a bit less than a minute) or memory (a couple of minutes). You can see why understanding the bottlenecks associated with the different speeds of the various memory subsystems is so important when engineering a standalone workstation, let alone a beowulf cluster.

Again you can see how a 10 node beowulf design that permits the task to execute out of main memory could yield a millionfold improvement in time to completion, which is a rather profound nonlinear speedup. This also justifies the earlier observation that a job that might run in some reasonable fraction of a day on such a beowulf could be stretched to 365000 days and end in early 3000 A.D. on a single node with swap. Disk is very, very slow compared to any sort of ``real'' memory subsystem and both tasks and the beowulfs they run on should always be designed and tuned to avoid requiring swap at all costs.

However silly this example appears at first glance, there are a number of tasks with a similar lack of locality that do, in fact, occupy very large virtual memory spaces. Sorting very long lists, database operations, certain kinds of simulations, all might perform operations on very widely separated or even randomly selected elements in a list. Here is another place where algorithm becomes almost as important as architecture - some algorithms for a sort, for example, might be far more local than others and although they may scale worse in terms of number of operations, by avoiding a killer bottleneck (tending to run in cache, for example) they may complete in far less time.

It is interesting to note in the context of beowulfery that (following the table above) it is some two to three orders of magnitude faster to transfer data from the memory of a remote system over the network than it is to transfer it from a disk (whether local or remote). Even if one's job isn't capable of being partitioned among nodes, if it requires four GB of virtual memory (and all you have is 512 MB nodes) one can obtain nearly a thousandfold speedup compared to running out of swap by putting swap spaces on remote ramdisks on the nodes that are then served to the single-threaded task execution unit over the network. In principle this can be done with current linux kernels (make a big ramdisk on a node, build a swap space on it that is provided via NFS to the execution node) but I haven't tried it. It is likely that (if it works at all) it isn't really very efficient, however much it improves on a disk based swap.

This is an area of current research by real computer scientists. The Trapeze project being conducted (in some cases by friends of mine) at Duke^5.12 is one such effort (based, alas, on FreeBSD) that uses e.g. Myrinet as the network layer. With its $\sim 5$ microsecond latency and gigabit per second bandwidth, Myrinet is fast enough to form the basis to an intermediate layer in the memory hierarchy directly integrated with the kernel, rather than operating through the usual swap or paging mechanism. Again, the point is that beowulfish architectures can provide tremendous nonlinear speedups and enable new work to be accomplished at far lower cost than (for example) buying a system equipped with four or five gigabytes of main memory.

The network is such an important bottleneck in a traditional ``real'' beowulf calculation that it deserves a section all its own. In the next chapter we'll examine the network and IPC's in a beowulfish layout. We'll also show how IPC's and the not infrequent requirement that your code proceed by parallel steps synchronously can combine to push your optimal design towards, or away, from a ``true beowulf'' configuration as opposed to a generic cluster.

There are several bottlenecks that I haven't discussed in this section that may or may not be important to your code. For example, I've only barely mentioned context switches^5.13 without telling you what they are or why they are ``bad''. They are what happens when your code bounces around in certain ways and forces the cache to be reloaded with new code, and they can occur when you call lots of widely separated subroutines or when you access the I/O subsystems a lot, among other places. So try to avoid doing this sort of thing inside your main loops, all right? I also haven't talked much about interrupts per se, partly because interrupts and context switches live in the Deep Kernel and are Not Meant for Mere Mortals to Know. Or something like that.

Actually, interrupts are pretty important to beowulfery, especially those associated with the network, again because there are all sorts of nasty latencies and bandwidths and contention issues to deal with in exotic circumstances. However, most of this will be largely beyond one's control unless one happens to be a kernel hacking kind of person with an I.Q. of around 170^5.14. So a really proper treatment of this will have to wait until I write a book on the kernel, which, given my truly astounding lack of detailed knowledge of how the kernel works, could be forever^5.15.