One lovely thing about cluster computing, including beowulfery per se is that there is a natural life cycle for cluster compute nodes. Let us meditate upon this life cycle for a moment.
Your grant is approved, your company agrees: You can Build a Beowulf. You collect quotes, select near-bleeding edge hardware, and put the whole thing together. It works! You do all sorts of fabulous research, or invent a new drug, or solve some critical problem, over the next two or three years.
Suddenly your once-new cluster looks pretty shabby. Ten percent or so of the nodes have given up the ghost altogether and been cannibilized for parts or been repaired at modest expense. Worse, Moore's Law has continued its inexorable march and it is getting hard to find nodes as slow and ill equipped with memory as your are any more, even for a paltry $500 each.
What do you do?
Well, an obvious thing to do is to buy shiny new nodes from current technology, and replace your old cluster with a new one some eight times faster at equivalent cost, but that leaves one with the problem of what to do with all the old nodes.
Welcome to the food chain. As systems age out (in any LAN or cluster environment) they gradually ``lose value'' compared to current technology, because
Consideration of the above cruel facts may, in fact, convince you that it is better to upgrade your cluster more often than every three years. A lot of folks (myself included) try to arrange to upgrade their clusters once a year, with an explicit line item in each year's budget for a new set of nodes based on the technology du jour, skimming along near the crest of Moore's law instead of being lifted up to the top of the wave every three years only to wipe out in the troughs in between.
A totally dispassionate review of the Total Cost of Ownership (TCO) of the nodes in an associated Cost-Benefit Analysis (CBA) might well dictate throwing the nodes away every twelve to eighteen months rather than operating them until they die of old age. After this period, new technology is typically roughly 2x faster at equivalent cost, the overhead for operating the older nodes is 2x as great (per unit of work done), and the human cost of waiting for (presumably valuable) work to complete is often far greater than any of the hardware or operational costs. I have seen Real Live CBA's that prove this to be the case in at least some environments.
However, the proof depends to some extent upon the assumptions made (to include the infrastructure costs or not, to include the cost of the human time spent waiting for results or not). Given a set of assumptions and an assignment of costs and benefits, I can do no better than quote Dr. Josip Loncaric, a venerable and respected beowulfer13.1:
Picking the best hardware replacement interval is an analytically solvable problem. Assuming that performance per $ doubles every N months, the most cost effective policy is to buy replacements whenever you can get 4.92155 times the performance for the same money. The Moore's law says that N=18, so the best replacement interval works out to be 3.44867 years. Using intervals of 3-4 years is almost as good.
This is the general view - most people view three years plus to be the ideal replacement cycle, and as Josip points out this is analytically justifiable. Note that this does not mean that replacing your cluster every three years is ideal - in general it will usually be better to replace 1/4-1/3 of your cluster (all three year old machines) every year, not replace the whole thing every three years. However, an equally good argument for a much shorter replacement cycle has been sent to me by a very competent list person who accounts for things like hardware reliability and so forth ignored by Josip. As always in cluster engineering, your mileage may vary according to your particular needs and cost/benefit landscape.
This still leaves one with the question of what to do with all the nodes one accumulates as they gradually age out, whether they age out in one year or five. The following are some very generic suggestions:
Note well that computers contain a variety of toxic materiels. There is typically mercury in the little battery that backs up the bios. There is arsenic in the doped silicon in the IC wafers. There may be lead, cadmium and a number of other heavy metals used in various sub-assemblies. Computers also contain some valuable metals. There is gold on the contacts, for example, and plenty of copper everywhere.
There are good sides and bad sides to all of this. Node ``recycling'' often involves third world child labor and toxic materials (such as mercury) to extract the gold, and frequently ignores the rest of the toxic metals that build up whereever they ultimately dispose of the parts once the gold is mined out of it. We don't have the technology to disassemble nodes into reusable micro components, and even the reuseable macro components (such as the case and power supply, the drives, and so forth) tend not to be reusable for more than three to five years before they no longer work with current technology at all.
It doesn't do any good to recycle nodes ``properly'' where properly means sending them off to India to provide short term jobs and a toxic future for small Indian children. However, dumping them in landfills here isn't terribly wise either. Perhaps the best approach is to recycle the mercury-laden components (the battery) by hand, and landfill the rest, accepting that the arsenic and so forth will eventually show up in the water table. I'd be happy to hear better suggestions as this document reaches more people, and will cheerfully update this chapter as better ideas emerge. [email protected], people.