Expensive but Simple

Next: Cheap, Scalable, and Robust Up: Building the Beowulf Previous: Building the Beowulf Contents

Expensive but Simple

If you are a real neophyte in all senses of the word (a linux neophyte, a parallel computing neophyte, a beowulf neophyte, a network manager neophyte) and need to pretty much learn everything as you go along, you're going to want to either stick to the following recipe. If you did your work very carefully and know that you really need to build one of the cheaper designs discussed next, you should still build yourself a very small beowulf (perhaps four nodes) out of either systems and parts at hand or systems you plan to recycle into stripped nodes or server nodes according to this plan just to learn what you're doing.

That is, the following design isn't really very cost-beneficial as you buy some things your nodes don't really need and do a lot of work by hand, but it is still reasonable for small beowulfs or while you're learning. From what you learn you can understand how to implement the next design that scales a lot better in all ways.

This design has you install and configure each node basically by hand using the standard installation tools that come with whatever linux distribution you selected to use. You then put the nodes onto the chosen network(s) and configure them for parallel operation.

The other place where this design works as a prototype is if you are setting up a beowulf-style cluster that isn't really a true beowulf. In that case you'd actually configure each node to be a fully functioning standalone workstation, setting up X and sound and all that, and providing each node with a monitor and keyboard and room on a table or desk. You'd still install most of the ``beowulf'' software - PVM, MPI, MOSIX and so forth, and you'd still configure it for parallel operation.

In this ``hand crafted beowulf'' design, your nodes have to be configured to install independently. These days, that means that they probably need the following hardware:

A floppy drive.
A cheap, small (4 GB is small these days) IDE hard drive.
A CD-Rom drive
A generic SVGA card (I usually get $30 S3-Virge cards)

plus of course your NIC(s). Each node is then attached to your choice of the following:

A KVM (Keyboard, Video, Mouse) switch, which in turn is connected to a single keyboard, monitor and mouse. KVM switches are available that are cheap (but fuzz a high resolution monitor a bit and don't work for PS/2 mice) or expensive (but keep the monitor clear and can manage all kinds of mice). The latter can be purchased to support all the way up to some 64 nodes, although they might add almost as much to the marginal cost of your nodes as a monitor, keyboard and mouse for each.
A monitor, keyboard and mouse for each. That is, you're building a NOW (network of workstations) or COW (cluster of workstations) as opposed to building a "true beowulf". Big deal. It will still work like a beowulf for anything but moderately fine grained synchronous parallel code and you can use the workstations for all sorts of useful (but not particularly CPU or network intensive) things while it is doing parallel computations.
A moderately portable monitor, keyboard and mouse, perhaps on a cart. You plug this into the nodes only one at a time of course, installing one, then the next one, then the next and so on.
One of several moderately expensive specialty cards that let you use (e.g.) a serial console for the original install. Expect to pay three or four times the cost of a cheap SVGA card.

The installation procedure is then very simple. You plug your distribution CD into the CD-Rom drive, the boot floppy into the floppy drive, (if necessary attach the portable monitor and keyboard to the appropriate ports) and boot. You will generally find yourself in your distribution's standard install program.

From there, install a more or less standard linux according to the distribution instructions. You probably have more than enough hard disk space to install everything as it is hard to buy a disk nowadays with less than 4 gigabytes (which is way plenty) so don't waste too much time picking and choosing - if it looks like it might be useful install it, or just install ``everything'' if that is an option. Be moderately careful to install all the nodes the same way as you really want them to be as ``identical'' as possible.

Be sure to include general programming support (compilers, libraries, editors, debuggers, and documentation). Be sure to include the full kernel, including sources and documentation (a lot of distributions won't install the kernel source unless you ask it to). Be sure to install all networking support, including things like NFS server packages. Sure, a lot of these things will never be needed on a node (at least if you do things correctly overall), but if they are ever needed it will be a total pain in the rear to put them on later and space is cheap (your time later is expensive).

Be sure to install enough swap space to handle the node's memory if you can possibly spare the disk. A rule of thumb to follow might be to install 1-2x main memory. Again, if you are sensible (and read the chapter on the utter evil of swapping) you will avoid running the nodes so that they swap. However, in the real world memory leaks (MPI is legendary for leaking in real live beowulfs!), Joe runs his job at the same time as Mary without telling her, a forking daemon goes forking nuts and spawns a few thousand instances of itself, netscape goes berserk on a NOW workstation, and you'd just LOVE to have a tiny bit of slack to try to kill off the offending processes without wasting Mary's two week run. A system without swap that runs out of memory generally dies a ghastly death soon thereafter. It's one of the few ways to crash even linux. Be warned.

Finally, install your beowulf specific software off of a homemade CD or the net (when the network is up) or perhaps the CD that came with this book (if a CD came with this book). If you installed a distribution that uses RPM's (like Red Hat, SuSE, Caldera) this should be straightforward. Debian users will firebomb my house if I don't extend this to Debian packages as well, so I will. At this point in my life, I'd tend to avoid Slackware although we were very happy together for years. Good packaging systems scale well to lots of nodes, and scalability is key to manageability.

With all the software installed, it is time to do the system configuration. Here I cannot possibly walk you through a full course in linux systems management, and most of what you do is common to all linux or unix systems, things like installing a root password (you probably did this during the install, actually, and hopefully picked the same password for all nodes), setting up the network, setting up security and network services, setting up NFS mounts, and so forth. To learn how to do all this, you can use the documentation that came with your distribution or head on down to Barnes and Noble (or over to amazon.com) and get a few books on the subject. Be warned that the ``administration tools'' that come with most linux distributions suck wildly in so many ways^11.11 so even if you use them to get started you need to learn how to do things by hand.

There are a few things you need to do a bit differently than the out-of-the-box configuration, and I'll focus on just these.

Be sure that the latest version of the openssh package is installed on all the nodes^11.12. Keep this revision up to date as aggressively as you can manage, as there are occasional security holes found in ssh and you want to be sure you are working with the latest patched release. The latest releases of ssh are also much easier to debug when something goes wrong with your setup.
When you set up networking on a ``true beowulf'' node (one that is isolated from the main network of your organization by some sort of gateway node), use an IP number for a private internal network. Private internal networks are described in an RFC (if you know what that is or care). They are also described in the HOWTO on IP-Masquerading. I personally like the 192.168.x.x addresses, but you can also use the 10.x.x.x addresses (if you want to be lavish) or the 176.[16-31].x.x, which I can never remember. Remember not to assign the 0 address or the 255 address to nodes - that is, use only something like 192.168.1.[1-254] as a range. 0 and 255 are ``special'' addresses and can break things if used.
Set up a common /etc/hosts or some sort of nameservice. There are good things and bad things about using NIS to manage system databases like this. It is likely that the bad outweighs the good - NIS can significantly increase the overhead of certain kinds of network traffic and network traffic is the last thing that you want to slow down in a beowulf. On a ``true beowulf'' most people tend to use a tool like rsync or an scp script to distribute identical copies of /etc/passwd, /etc/group, /etc/hosts, and so forth. However, in a NOW-type cluster with lots of users (and not particularly fine grained parallel code) NIS is a reasonable enough solution.

When you are done and have rebooted the node, it should come up accessible (via ssh) over the network. Once you can login as root over the net (test this) you can move or switch the monitor and keyboard to the next node.

With all of this established, and with ssh set up to permit root access to the nodes from the head node without a password, it is time to distribute common copies of things like /etc/hosts, /etc/hosts.[allow,deny], /etc/passwd, and your preferred /root home directory (I tend to like to customize mine in various ways and miss the customizations when they aren't there).

To do this, one can use something like rsync (with the underlying shell set to ssh, of course) or just an scp. Either way, you will find it very useful to have a set of scripts on the head node that permit commands to be executed on all nodes (one at a time, in order) or files copied to all nodes (one after another, in order). Some simple scripts for doing this sort of thing are in the Software appendix (and available on the web, since I doubt that you want to type them in).

I'd strongly recommend that you arrange for all nodes to do all their logging on your head node to make it as easy as possible to monitor the nodes and to make it as easy as possible to reinstall or upgrade the nodes. If all they contain is a distribution plus some simple post-install configuration files, you don't need to back them up as reinstalling them according to your recipe will generally be faster. This is a good reason to set things up so that the nodes provide at most scratch space on disk for running calculations with the full understanding that this space is volatile and will go away if a node dies.

When you are finished with this general configuration, one should have a head node (mywulf outside and bhead inside) that is also an NFS server exporting home directory space and project space to all the nodes. You should have a common password file (and possibly /etc/shadow file) on all the nodes containing all your expected users. You should have ssh set up so all your users (and root) can transparently execute ssh commands on all nodes from the head node or each other (root might only work from the head node). That is, ``ssh b12 ls /'' should show you the contents of the root directory without a password. You should have PVM and MPI (and possibly other things like MOSIX or a queuing system) installed on all nodes (probably via an NFS mount - there is little reason to maintain N copies of the binary installation, although with RPM or a decent package manager there isn't too much reason not to).

PVM or MPI should be configured so that they are can utilize all the nodes. How to do this is beyond the scope of this book - there are lots of nice references on both of them and one can usually succeed even if one only follows the instructions provided with both of them. With PVM, for example, you'll have to tell it to use ssh instead of rsh and decide whether you want to run pvmd as root (with a preconfigured virtual machine) or let users build their own virtual machine for any given calculation, which in turn may depend on who your users are and what sort of usage policy you have. Similar decisions are required for MPI. It is a very good idea to run a few of the test examples that are provided with PVM and MPI to verify that your beowulf is functioning.

From this point on, you can declare your beowulf open for business. Your work is probably not done, as I've only described a very minimalist beginning, but from this beginning you can learn and add bells and whistles as you need them.

This approach, as we've seen, more or less builds your beowulf nodes by hand. This teaches you the most about how to build them and configure them, but it doesn't scale too well. It might take you as long as half a day to install a new node using the approach above, even after you you have mastered it (the first few nodes might take you days or weeks to get ``just right''). There has to be a better way.

Of course there is. There are several, and I'll proceed to cover at least two. The next example will be a bit Red Hat-centric in my description. This is not to endorse Red Hat over any other linux but simply because I'm most familiar with Red Hat and too lazy to experiment with alternatives (at least to the point of becoming moderately ``expert''). It is certain that a very similar solution is possible with other distributions, if you take the time to figure out how to make it work.

Next: Cheap, Scalable, and Robust Up: Building the Beowulf Previous: Building the Beowulf Contents

Robert G. Brown 2004-05-24