Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Assuming an Instruction cache miss rate for a program is 1%, and a Date cache mi

ID: 3834074 • Letter: A

Question

Assuming an Instruction cache miss rate for a program is 1%, and a Date cache miss rate is 5%. If a processor has a ideal CPI of 2 without any memory stalls and the miss penalty is 100 cycles for all misses. Assume that (i) 30% of the instructions arc load & stores, (ii) 50% of the instructions are ALUs and (iii) 20% of the instructions are controls. (i) What is actual CPI with the possible cache misses? (ii) Assume that we have perfect data and instruction cache (no cache misses), what is the performance speed up of the CPU with the ideal perfect cache vs. the CPU with actual cache. What if the processor is faster, but memory system is not. Suppose we speed up the ideal CPI without any memory stalls from 2 to 1 without changing the clock rate. (i) What is the CPI of the system with cache misses? (ii) What is the performance speed up of the CPU with the ideal perfect cache vs. the CPU with actual cache? What is your observation on the effect of the memory on the system performance?

Explanation / Answer

A useful tool for solving this type of problem is to extract all of the available information from the
problem description. It is possible that not all of the information will be necessary to solve the problem,
but having it in summary form makes it easier to think about. Here is a summary:
CPU: 1.1 GHz (0.909ns equivalent), CPI of 0.7 (excludes memory accesses)
Instruction mix: 75% non-memory-access instructions, 20% loads, 5% stores
Caches: Split L1 with no hit penalty, (i.e., the access time is the time it takes to execute the
load/store instruction
L1 I-cache: 2% miss rate, 32-byte blocks (requires 2 bus cycles to fill, miss penalty is 15ns +
2 cycles
L1 D-cache: 5% miss rate, write-through (no write-allocate), 95% of all writes do not stall
because of a write buffer, 16-byte blocks (requires 1 bus cycle to fill), miss penalty is 15ns + 1 cycle
L1/L2 bus: 128-bit, 266 MHz bus between the L1 and L2 caches
L2 (unified) cache, 512 KB, write-back (write-allocate), 80% hit rate, 50% of replaced blocks
are dirty (must go to main memory), 64-byte blocks (requires 4 bus cycles to fill), miss penalty is 60ns
+ 7.52ns = 67.52ns
Memory, 128 bits (16 bytes) wide, first access takes 60ns, subsequent accesses take 1 cycle
on 133 MHz, 128-bit bus
a. The average memory access time for instruction accesses:
L1 (inst) miss time in L2: 15ns access time plus two L2 cycles ( two = 32 bytes in inst. cache
line/16 bytes width of L2 bus) = 15 + 2 × 3.75 = 22.5ns. (3.75 is equivalent to one 266 MHz L2 cache
cycle)
L2 miss time in memory: 60ns + plus four memory cycles (four = 64 bytes in L2 cache/16
bytes width of memory bus) = 60 + 4 × 7.5 = 90ns (7.5 is equivalent to one 133 MHz memory bus
cycle).
Avg. memory access time for inst = avg. access time in L2 cache + avg. access time in
memory + avg. access time for L2 write-back.
= 0.02 × 22.5 + 0.02 × (1 – 0.8) × 90 + 0.02 × (1 – 0.8) × 0.5 × 90 = 0.99ns (1.09 CPU cycles)
b. The average memory access time for data reads:
Similar to the above formula with one difference: the data cache width is 16 bytes which takes one L2
bus cycles transfer (versus two for the inst. cache), so
L1 (read) miss time in L2: 15 + 3.75 = 18.75ns
L2 miss time in memory: 90ns
Avg. memory access time for read = 0.02 × 18.75 + 0.02 × (1 – 0.8) × 90 + 0.02 × (1 – 0.8) ×
0.5 × 90 = 0.92ns (1.01 CPU cycles)
c. The average memory access time for data writes:
Assume that writes misses are not allocated in L1, hence all writes use the write buffer. Also assume
the write buffer is as wide as the L1 data cache.
L1 (write) time to L2: 15 + 3.75 = 18.75ns
L2 miss time in memory: 90ns
Avg. memory access time for data writes = 0.05 × 18.75 + 0.05 × (1 – 0.8) × 90 + 0.05 × (1 –
0.8) × 0.5 × 90 = 2.29ns (2.52 CPU cycles)
d. What is the overall CPI, including memory accesses:
Components: base CPI, Inst fetch CPI, read CPI or write CPI, inst fetch time is added to data
read or write time (for load/store instructions).
CPI = 0.7 + 1.09 + 0.2 × 1.01 + 0.05 × 2.52 = 2.19 CPI.
B.10
a. L1 cache miss behavior when the caches are organized in an inclusive hierarchy and the two caches
have identical block size:
Access L2 cache.
If L2 cache hits, supply block to L1 from L2, evicted L1 block can be stored in L2 if it is not
already there.
If L2 cache misses, supply block to both L1 and L2 from memory, evicted L1 block can be
stored in L2 if it is not already there.
In both cases (hit, miss), if storing an L1 evicted block in L2 causes a block to be evicted
from L2, then L1 has to be checked and if L2 block that was evicted is found in L1, it has to be
invalidated.
b. L1 cache miss behavior when the caches are organized in an exclusive hierarchy and the two caches
have identical block size:
Access L2 cache
If L2 cache hits, supply block to L1 from L2, invalidate block in L2, write evicted block from
L1 to L2 (it must have not been there)
If L2 cache misses, supply block to L1 from memory, write evicted block from L1 to L2 (it
must have not been there)
c. When L1 evicted block is dirty it must be written back to L2 even if an earlier copy was there
(inclusive L2). No change for exclusive case.
2.1
a. Each element is 8B. Since a 64B cacheline has 8 elements, and each column access will result in
fetching a new line for the non-ideal matrix, we need a minimum of 8x8 (64 elements) for each matrix.
Hence, the minimum cache size is 128 × 8B = 1KB.
b. The blocked version only has to fetch each input and output element once. The unblocked version
will have one cache miss for every 64B/8B = 8 row elements. Each column requires 64Bx256 of
storage, or 16KB. Thus, column elements will be replaced in the cache before they can be used again.
Hence the unblocked version will have 9 misses (1 row and 8 columns) for every 2 in the blocked
version.
c. for (i = 0; i < 256; i=i+B) {
for (j = 0; j < 256; j=j+B) {
for(m=0; m<B; m++) {
for(n=0; n<B; n++) {
output[j+n][i+m] = input[i+m][j+n];
}
}
}
}
d. 2-way set associative. In a direct-mapped cache the blocks could be allocated so that they map to
overlapping regions in the cache.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote