The level 2 cache is normally much bigger (and unified), such as 256, 512 or 1024 KB. The purpose of the L2 cache is to constantly read in slightly larger quantities of data from RAM, so that these are available to the L1 cache.
In the earlier processor generations, the L2 cache was placed outside the chip: either on the motherboard (as in the original Pentium processors), or on a special module together with the CPU (as in the first Pentium II’s).
Attachment 370
Fig. 71. An old Pentium II module. The CPU is mounted on a rectangular printed circuit board, together with the L2 cache, which is two chips here. The whole module is installed in a socket on the motherboard. But this design is no longer used. As process technology has developed, it has become possible to make room for the L2 cache inside the actual processor chip. Thus the L2 cache has been integrated and that makes it function much better in relation to the L1 cache and the processor core.
The L2 cache is not as fast as the L1 cache, but it is still much faster than normal RAM.
CPU
L2 cache
Pentium, K5, K6
External, on the motherboard
Pentium Pro
Internal, in the CPU
Pentium II, Athlon
External, in a module
close to the CPU
Celeron (1st generation)
None
Celeron (later gen.),
Pentium III, Athlon XP,
Duron, Pentium 4
Internal, in the CPU
Fig. 72. It has only been during the last few CPU generations that the level 2 cache has found its place, integrated into the actual CPU.
Traditionally the L2 cache is connected to the front side bus. Through it, it connects to the chipset’s north bridge and RAM:
Attachment 371Fig. 73. The way the processor uses the L1 and L2 cache has crucial significance for its utilisation of the high clock frequencies. The level 2 cache takes up a lot of the chip’s die, as millions of transistors are needed to make a large cache. The integrated cache is made using SRAM (static RAM), as opposed to normal RAM which is dynamic (DRAM).
While DRAM can be made using one transistor per bit (plus a capacitor), it costs 6 transistors (or more) to make one bit of SRAM. Thus 256 KB of L2 cache would require more than 12 million transistors. Thus it has only been since fine process technology (such as 0.13 and 0.09 microns) was developed that it became feasible to integrate a large L2 cache into the actual CPU. In Fig. 66 on page 27, the number of transistors includes the CPU’s integrated cache.
Powerful bus
The bus between the L1 and L2 cache is presumably THE place in the processor architecture which has the greatest need for high bandwidth. We can calculate the theoretical maximum bandwidth by multiplying the bus width by the clock frequency. Here are some examples:
CPU
Intel Pentium III
AMD
Athlon XP+
AMD Athlon 64
AMD Athlon 64 FX
Intel Pentium 4
Fig. 74. Theoretical calculations of the bandwidth between the L1 and L2 cache.
Different systems
There are a number of different ways of using caches. Both Intel and AMD have saved on L2 cache in some series, in order to make cheaper products. But there is no doubt, that the better the cache – both L1 and L2 – the more efficient the CPU will be and the higher its performance.
AMD have settled on a fairly large L1 cache of 128 KB, while Intel continue to use relatively small (but efficient) L1 caches.
On the other hand, Intel uses a 256 bit wide bus on the “inside edge†of the L2 cache in the Pentium 4, while AMD only has a 64-bit bus (see Fig. 74).
Attachment 372
Attachment 373
Attachment 374
Fig. 75. Competing CPU’s with very different designs. AMD uses exclusive caches in all their CPU’s. That means that the same data can’t be present in both caches at the same time, and that is a clear advantage. It’s not like that at Intel.
However, the Pentium 4 has a more advanced cache design with Execution Trace Cache making up 12 KB of the 20 KB Level 1 cache. This instruction cache works with coded instructions, as described on page 35.
CPU
Athlon XP
Athlon XP+
Pentium 4 (I)
Pentium 4 (II, “Northwoodâ€Â)
Athlon 64
Athlon 64 FX
Pentium 4 (III, “Prescottâ€Â)
Fig. 76. The most common processors and their caches.
Latency
A very important aspect of all RAM – cache included – is latency. All RAM storage has a certain latency, which means that a certain number of clock ticks (cycles) must pass between, for example, two reads. L1 cache has less latency than L2; which is why it is so efficient.
When the cache is bypassed to read directly from RAM, the latency is many times greater. In Fig. 77 the number of wasted clock ticks are shown for various CPU’s. Note that when the processor core has to fetch data from the actual RAM (when both L1 and L2 have failed), it costs around 150 clock ticks. This situation is called stalling and needs to be avoided.
Note that the Pentium 4 has a much smaller L1 cache than the Athlon XP, but it is significantly faster. It simply takes fewer clock ticks (cycles) to fetch data:
Latency
L1 cache:
L2 cache:
Fig. 77. Latency leads to wasted clock ticks; the fewer there are of these, the faster the processor will appear to be.
In the earlier processor generations, the L2 cache was placed outside the chip: either on the motherboard (as in the original Pentium processors), or on a special module together with the CPU (as in the first Pentium II’s).
Attachment 370
Fig. 71. An old Pentium II module. The CPU is mounted on a rectangular printed circuit board, together with the L2 cache, which is two chips here. The whole module is installed in a socket on the motherboard. But this design is no longer used. As process technology has developed, it has become possible to make room for the L2 cache inside the actual processor chip. Thus the L2 cache has been integrated and that makes it function much better in relation to the L1 cache and the processor core.
The L2 cache is not as fast as the L1 cache, but it is still much faster than normal RAM.
CPU
L2 cache
Pentium, K5, K6
External, on the motherboard
Pentium Pro
Internal, in the CPU
Pentium II, Athlon
External, in a module
close to the CPU
Celeron (1st generation)
None
Celeron (later gen.),
Pentium III, Athlon XP,
Duron, Pentium 4
Internal, in the CPU
Fig. 72. It has only been during the last few CPU generations that the level 2 cache has found its place, integrated into the actual CPU.
Traditionally the L2 cache is connected to the front side bus. Through it, it connects to the chipset’s north bridge and RAM:
Attachment 371Fig. 73. The way the processor uses the L1 and L2 cache has crucial significance for its utilisation of the high clock frequencies. The level 2 cache takes up a lot of the chip’s die, as millions of transistors are needed to make a large cache. The integrated cache is made using SRAM (static RAM), as opposed to normal RAM which is dynamic (DRAM).
While DRAM can be made using one transistor per bit (plus a capacitor), it costs 6 transistors (or more) to make one bit of SRAM. Thus 256 KB of L2 cache would require more than 12 million transistors. Thus it has only been since fine process technology (such as 0.13 and 0.09 microns) was developed that it became feasible to integrate a large L2 cache into the actual CPU. In Fig. 66 on page 27, the number of transistors includes the CPU’s integrated cache.
Powerful bus
The bus between the L1 and L2 cache is presumably THE place in the processor architecture which has the greatest need for high bandwidth. We can calculate the theoretical maximum bandwidth by multiplying the bus width by the clock frequency. Here are some examples:
CPU
Bus
width
width
Clock
frequency
frequency
Theoretical bandwidth
64 bits
1400 MHz
11.2 GB/sek.
Athlon XP+
64 bits
2167 MHz
17.3 GB/sek.
64 bits
2200 MHz
17,6 GB/sek.
128 bits
2200 MHz
35,2 GB/sek.
256 bits
3200 MHz
102 GB/sek.
Fig. 74. Theoretical calculations of the bandwidth between the L1 and L2 cache.
Different systems
There are a number of different ways of using caches. Both Intel and AMD have saved on L2 cache in some series, in order to make cheaper products. But there is no doubt, that the better the cache – both L1 and L2 – the more efficient the CPU will be and the higher its performance.
AMD have settled on a fairly large L1 cache of 128 KB, while Intel continue to use relatively small (but efficient) L1 caches.
On the other hand, Intel uses a 256 bit wide bus on the “inside edge†of the L2 cache in the Pentium 4, while AMD only has a 64-bit bus (see Fig. 74).
Attachment 372
Attachment 373
Attachment 374
Fig. 75. Competing CPU’s with very different designs. AMD uses exclusive caches in all their CPU’s. That means that the same data can’t be present in both caches at the same time, and that is a clear advantage. It’s not like that at Intel.
However, the Pentium 4 has a more advanced cache design with Execution Trace Cache making up 12 KB of the 20 KB Level 1 cache. This instruction cache works with coded instructions, as described on page 35.
CPU
L1 cache
L2 cache
128 KB
256 KB
128 KB
512 KB
20 KB
256 KB
20 KB
512 KB
128 KB
512 KB
128 KB
1024 KB
28 KB
1024 KB
Fig. 76. The most common processors and their caches.
Latency
A very important aspect of all RAM – cache included – is latency. All RAM storage has a certain latency, which means that a certain number of clock ticks (cycles) must pass between, for example, two reads. L1 cache has less latency than L2; which is why it is so efficient.
When the cache is bypassed to read directly from RAM, the latency is many times greater. In Fig. 77 the number of wasted clock ticks are shown for various CPU’s. Note that when the processor core has to fetch data from the actual RAM (when both L1 and L2 have failed), it costs around 150 clock ticks. This situation is called stalling and needs to be avoided.
Note that the Pentium 4 has a much smaller L1 cache than the Athlon XP, but it is significantly faster. It simply takes fewer clock ticks (cycles) to fetch data:
Latency
Pentium II
Athlon
Pentium 4
3 cycles
3 cycles
2 cycles
18 cycles
6 cycles
5 cycles
Fig. 77. Latency leads to wasted clock ticks; the fewer there are of these, the faster the processor will appear to be.
No comments:
Post a Comment