Copyright Michael Karbo and ELI Aps., Denmark, Europe.
Chapter 10. The cache
In the previous chapter, I described two aspects of the ongoing development of new CPU’s – increased clock frequencies and the increasing number of transistors being used. Now it is time to look at a very different yet related technology – the processor’s connection to the RAM, and the use of the L1 and L2 caches.
The CPU works internally at very high clock frequencies (like 3200 MHz), and no RAM can keep up with these.
The most common RAM speeds are between 266 and 533 MHz. And these are just a fraction of the CPU’s working speed. So there is a great chasm between the machine (the CPU) which slaves away at perhaps 3200 MHz, and the “conveyor belt”, which might only work at 333 MHz, and which has to ship the data to and from the RAM. These two subsystems are simply poorly matched to each other.
If nothing could be done about this problem, there would be no reason to develop faster CPU’s. If the CPU had to wait for a bus, which worked at one sixth of its speed, the CPU would be idle five sixths of the time. And that would be pure waste.
The solution is to insert small, intermediate stores of high-speed RAM. These buffers (cache RAM) provide a much more efficient transition between the fast CPU and the slow RAM. Cache RAM operates at higher clock frequencies than normal RAM. Data can therefore be read more quickly from the cache.
The cache delivers its data to the CPU registers. These are tiny storage units which are placed right inside the processor core, and they are the absolute fastest RAM there is. The size and number of the registers is designed very specifically for each type of CPU.Fig. 68. Cache RAM is much faster than normal RAM.
The CPU can move data in different sized packets, such as bytes (8 bits), words (16 bits), dwords (32 bits) or blocks (larger groups of bits), and this often involves the registers. The different data packets are constantly moving back and forth:
The cache stores are a central bridge between the RAM and the registers which exchange data with the processor’s execution units.
The optimal situation is if the CPU is able to constantly work and fully utilize all clock ticks. This would mean that the registers would have to always be able to fetch the data which the execution units require. But this it not the reality, as the CPU typically only utilizes 35% of its clock ticks. However, without a cache, this utilization would be even lower.
CPU caches are a remedy against a very specific set of “bottleneck” problems. There are lots of “bottlenecks” in the PC – transitions between fast and slower systems, where the fast device has to wait before it can deliver or receive its data. These bottle necks can have a very detrimental effect on the PC’s total performance, so they must be minimised.Fig. 69. A cache increases the CPU’s capacity to fetch the right data from RAM.
The absolute worst bottleneck exists between the CPU and RAM. It is here that we have the heaviest data traffic, and it is in this area that PC manufacturers are expending a lot of energy on new development. Every new generation of CPU brings improvements relating to the front side bus.
The CPU’s cache is “intelligent”, so that it can reduce the data traffic on the front side bus. The cache controller constantly monitors the CPU’s work, and always tries to read in precisely the data the CPU needs. When it is successful, this is called a cache hit. When the cache does not contain the desired data, this is called a cache miss.
The idea behind cache is that it should function as a “near store” of fast RAM. A store which the CPU can always be supplied from.
In practise there are always at least two close stores. They are called Level 1, Level 2, and (if applicable) Level 3 cache. Some processors (like the Intel Itanium) have three levels of cache, but these are only used for very special server applications. In standard PC’s we find processors with L1 and L2 cache.
Fig. 70. The cache system tries to ensure that relevant data is constantly being fetched from RAM, so that the CPU (ideally) never has to wait for data.
Level 1 cache is built into the actual processor core. It is a piece of RAM, typically 8, 16, 20, 32, 64 or 128 Kbytes, which operates at the same clock frequency as the rest of the CPU. Thus you could say the L1 cache is part of the processor.
L1 cache is normally divided into two sections, one for data and one for instructions. For example, an Athlon processor may have a 32 KB data cache and a 32 KB instruction cache. If the cache is common for both data and instructions, it is called a unified cache.