Consider the following loop. Assume that perfect branch prediction is used, and

ID: 669611 • Letter: C

Question

Consider the following loop.

Assume that perfect branch prediction is used, and the pipeline has full forwarding support. Also, assume that many iterations of this loop are executed before the loop exists. Show the pipeline execution diagram for the second iteration of the loop, from the cycle we fetch the first instruction of that iteration up to (but not including) the cycle in which we can fetch the first instruction of the next iteration. Show all instructions that are in the pipeline during these cycles (not just those from the third iteration).

loop: lw r1, 0(r1) and r1, r1,r2 lw r1, 0(r1) lw r1, 0(r1) beq r1, r0, loop

Explanation / Answer

documentclass[paper=a4, fontsize=11pt]{scrartcl} % A4 paper and 11pt font size

usepackage{./../usfassignment}

usepackage{./../mips}

settitle{Assignment 11}

setauthor{Wanzhang Sheng}

setcourse{CS315: Computer Architecture}

egin{document}

maketitle % Print the title

% -----------------------------------------------------------------------------

% PROBLEM 1

% -----------------------------------------------------------------------------

section{4.10.3}

egin{fancyquotes}

Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if branch outcomes are determined in the ID stage, relative to the execution where branch outcomes are determined in the EX stage?

end{fancyquotes}

egin{lstlisting}[language={[mips]Assembler}]

sw r16, 12(r6)

lw r16, 8(r6)

beq r5, r4, Label # Assume r5!=r4

add r5, r1, r4

slt r5, r15, r4

end{lstlisting}

The MIPS program has $5$ instructions and will finished in $9$ cycles if no stall nor delay.

When the outcomes are determined in ID stage, the branch instruction will cause next instruction to delay $1$ cycles and finish in $10$ cycles.

When the outcomes are determined in EX stage, the branch instruction will cause next instruction to delay $2$ cycles and finish in $11$ cycles.

So the speedup will be $(11-10)/10=10%$.

% -----------------------------------------------------------------------------

% PROBLEM 2

% -----------------------------------------------------------------------------

section{4.11}

egin{fancyquotes}

Consider the following loop.

egin{lstlisting}[language={[mips]Assembler}]

loop:

lw r1, 0(r1)

and r1, r1, r2

lw r1, 0(r1)

beq r1, r0, loop

end{lstlisting}

Assume that perfect branch prediction is used (no stalls due to control hazards), that there are no delay slots, and that the pipeline has full forwarding support. Also assume that many iterations of this loop are executed before the loop exits.

end{fancyquotes}

subsection{4.11.1}

egin{fancyquotes}

Show a pipeline execution diagram for the third iteration of this loop, from the cycle in which we fetch the first instruction of that iteration up to (but not including) the cycle in which we can fetch the first instruction of the next iteration. Show all instructions that are in the pipeline during these cycles (not just those from the third iteration).

end{fancyquotes}

egin{table}[hp]

caption{Third Iteration}label{tab:iteration}

egin{center}

egin{tabular}{cccccccccccccc}

oprule

extbf{Iter} & extbf{1} & extbf{2} & extbf{3} & extbf{4} & extbf{5}

& extbf{6} & extbf{7} & extbf{8} & extbf{9} & extbf{10} & extbf{11}

& extbf{12} & extbf{13} \

midrule

lw3 & WB \

beq & ID & EX & MEM & WB \

lw1 & IF & ID & EX & MEM & WB \

and & & & IF & ID & EX & MEM & WB \

lw2 & & & & IF & ID & EX & MEM & WB \

lw3 & & & & & & IF & ID & EX & MEM & WB \

beq & & & & & & & & & IF & ID & EX & MEM & WB \

lw1 & & & & & & & & & & IF & ID & EX & MEM \

and & & & & & & & & & & & & IF & ID \

lw2 & & & & & & & & & & & & & IF \

ottomrule

end{tabular}

end{center}

end{table}

subsection{2}

egin{fancyquotes}

How often (as a percentage of all cycles) do we have a cycle in which all five pipeline stages are doing useful work?

end{fancyquotes}

Only three extit{lw} instructions of these five instructions use all five stages.

So, if we consider that the loop will run a lot of times, then the useful cycles is $3/5=60%$ of all.

% -----------------------------------------------------------------------------

% PROBLEM 3

% -----------------------------------------------------------------------------

section{4.14}

egin{fancyquotes}

This exercise is intended to help you understand the relationship between delay slots, control hazards, and branch execution in a pipelined processor. In this exercise, we assume that the following MIPS code is executed on a pipelined processor with a 5-stage pipeline, full forwarding, and a predict-taken branch predictor:

egin{lstlisting}[language={[mips]Assembler}]

lw r2, 0(r1)

label1:

beq r2, r0, label2 # not taken once, then taken

lw r3, 0(r2)

beq r3, r0, label1 # taken

add r1, r3, r1

label2: sw r1, 0(r2)

end{lstlisting}

end{fancyquotes}

subsection{4.14.1}

egin{fancyquotes}

Draw the pipeline execution diagram for this code, assuming there are no delay slots and that branches execute in the EX stage.

end{fancyquotes}

egin{table}[hp]

caption{Iteration1}label{tab:iteration1}

egin{center}

scriptsize

egin{tabular}{ccccccccccccccccc}

oprule

extbf{Iter} & extbf{1} & extbf{2} & extbf{3} & extbf{4} & extbf{5}

& extbf{6} & extbf{7} & extbf{8} & extbf{9} & extbf{10} & extbf{11}

& extbf{12} & extbf{13} & extbf{14} & extbf{15} & extbf{16} \

midrule

lw1 & IF & ID & EX & MEM & WB \ % forward MEM

beq1 & & & IF & ID & EX & MEM & WB \ % branch delay

sw & & & & & IF \

lw2 & & & & & & IF & ID & EX & MEM & WB \ % forward MEM

beq2 & & & & & & & & IF & ID & EX & MEM & WB \ % branch delay

beq1 & & & & & & & & & & IF & ID & EX & MEM & WB \ % branch delay

sw & & & & & & & & & & & & IF & ID & EX & MEM & WB \

ottomrule

end{tabular}

end{center}

end{table}

subsection{4.14.2}

egin{fancyquotes}

Repeat part a, but assume that delay slots are used. In the given code, the instruction that follows the branch is now the delay slot instruction for that branch.

end{fancyquotes}

egin{table}[H]

caption{Iteration2}label{tab:iteration2}

egin{center}

scriptsize

egin{tabular}{cccccccccccccc}

oprule

extbf{Iter} & extbf{1} & extbf{2} & extbf{3} & extbf{4} & extbf{5}

& extbf{6} & extbf{7} & extbf{8} & extbf{9} & extbf{10} & extbf{11}

& extbf{12} & extbf{13} \

midrule

lw1 & IF & ID & EX & MEM & WB \ % forward MEM

beq1 & & & IF & ID & EX & MEM & WB \ % branch delay

sw & & & & IF \

lw2 & & & & & IF & ID & EX & MEM & WB \ % forward MEM

beq2 & & & & & & & IF & ID & EX & MEM & WB \ % branch delay

beq1 & & & & & & & & IF & ID & EX & MEM & WB \ % branch delay

sw & & & & & & & & & IF & ID & EX & MEM & WB \

ottomrule

end{tabular}

end{center}

end{table}

subsection{4.14.5}

egin{fancyquotes}

For the given code, what is the speedup achieved by moving branch execution into the ID stage? Explain your answer. In your speedup calculation, assume that the additional comparison in the ID stage does not affect clock cycle time.

end{fancyquotes}

All branch will save one more cycle, the speedup will be $3/16=18.75%$.

pagebreak

% -----------------------------------------------------------------------------

% PROBLEM 4

% -----------------------------------------------------------------------------

section{5.2}

egin{fancyquotes}

Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses.

$$3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253$$

end{fancyquotes}

32bit memory address has 4 bytes, so need 2 byte offset.

subsection{5.2.1}

egin{fancyquotes}

For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with 16 one-word blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.

end{fancyquotes}

We have 16 blocks, so index is 4 bits long. Then tag is $32-2-4=26$ bits.

% [3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253].each {|i| puts "#{i.to_s.rjust(3)} & #{i.to_s(2).rjust(8, "0")}00 & #{(i/16).to_s(2).rjust(4,"0")} & #{(i%16).to_s(2).rjust(4,"0")} & miss \\" }

egin{table}[H]

caption{Cache Table}label{tab:cache_table}

egin{center}

egin{tabular}{rrrrc}

oprule

extbf{address} & extbf{binary} & extbf{tag} & extbf{index} & extbf{hit} \

midrule

3 & 0000001100 & 0000 & 0011 & miss \

180 & 1011010000 & 1011 & 0100 & miss \

43 & 0010101100 & 0010 & 1011 & miss \

2 & 0000001000 & 0000 & 0010 & miss \

191 & 1011111100 & 1011 & 1111 & miss \

88 & 0101100000 & 0101 & 1000 & miss \

190 & 1011111000 & 1011 & 1110 & miss \

14 & 0000111000 & 0000 & 1110 & miss \

181 & 1011010100 & 1011 & 0101 & miss \

44 & 0010110000 & 0010 & 1100 & miss \

186 & 1011101000 & 1011 & 1010 & miss \

253 & 1111110100 & 1111 & 1101 & miss \

ottomrule

end{tabular}

end{center}

end{table}

pagebreak

subsection{5.2.2}

egin{fancyquotes}

For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.

end{fancyquotes}

We have 8 blocks, so index is 3 bits long. And we need 1 bit for word offset. Then tag is $32-2-1-3=26$ bits.

% [3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253].each {|i| puts "#{i.to_s.rjust(3)} & #{i.to_s(2).rjust(8, "0")}00 & #{(i/16).to_s(2).rjust(4,"0")} & #{((i/2)%8).to_s(2).rjust(3,"0")} & miss \\" }

egin{table}[H]

caption{Cache Table 2}label{tab:cache_table2}

egin{center}

egin{tabular}{rrrrc}

oprule

extbf{address} & extbf{binary} & extbf{tag} & extbf{index} & extbf{hit} \

midrule

3 & 0000001100 & 0000 & 001 & miss \

180 & 1011010000 & 1011 & 010 & miss \

43 & 0010101100 & 0010 & 101 & miss \

2 & 0000001000 & 0000 & 001 & hit \

191 & 1011111100 & 1011 & 111 & miss \

88 & 0101100000 & 0101 & 100 & miss \

190 & 1011111000 & 1011 & 111 & hit \

14 & 0000111000 & 0000 & 111 & miss \

181 & 1011010100 & 1011 & 010 & hit \

44 & 0010110000 & 0010 & 110 & miss \

186 & 1011101000 & 1011 & 101 & miss \

253 & 1111110100 & 1111 & 110 & miss \

ottomrule

end{tabular}

end{center}

end{table}

subsection{5.2.4}

egin{fancyquotes}

Calculate the total number of bits required for the cache listed above, assuming a 32-bit address. Given that total size, find the total size of the closest direct-mapped cache with 16-word blocks of equal size or greater. Explain why the second cache, despite its larger data size, might provide slower performance than the first cache.

end{fancyquotes}

egin{enumerate}

item Cache Data Size: 32 KiB

item Cache Block Size: 2 words

item Cache Access Time: 1 cycle

end{enumerate}

2-word cache: $32*1024/4/2*(1+26)/8/1024 + 32=SI{45.5}{KiB}$

16-word cache: $32*1024/4/16*(1+23)/8/1024 + 32=SI{33.5}{KiB}$

In the second case, every time when it accquire data from memory, it costs more time to transfer larger data block.

pagebreak

% -----------------------------------------------------------------------------

% PROBLEM 5

% -----------------------------------------------------------------------------

section{5.3}

egin{fancyquotes}

For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache.

Tag: 31--10, Index: 9--5, Offset: 4--0.

end{fancyquotes}

subsection{5.3.1} % (fold)

label{sub:5_3_1}

egin{fancyquotes}

What is the cache block size (in words)?

end{fancyquotes}

The $32$-bit system has $4$ byte in a word.

So the cache block size is $2^{(4-0+1)}/4=8$ words.

% subsection 5_3_1 (end)

subsection{5.3.2} % (fold)

label{sub:5_3_2}

egin{fancyquotes}

How many entries does the cache have?

end{fancyquotes}

The cache has $2^{(9-5+1)}=32$ entries.

% subsection 5_3_2 (end)

subsection{5.3.3} % (fold)

label{sub:5_3_3}

egin{fancyquotes}

What is the ratio between total bits required for such a cache implementation over the data storage bits?

Starting from power on, the following byte-addressed cache references are recorded.

Address: 0, 4, 16, 132, 232, 160, 1024, 30, 140, 3100, 180, 2180

end{fancyquotes}

The total bits required are $32 imes(1+(31-10+1)+8 imes4 imes8)=8928$ bits.

And the data storage are $32 imes8 imes4 imes8=8192$ bits.

So the ratio is $8928/8192=1.08984375$.

% [0, 4, 16, 132, 232, 160, 1024, 30, 140, 3100, 180, 2180].each {|i| puts "#{i.to_s.rjust(4)} & #{(i/4/8/32).to_s(2).rjust(5,"0")} & #{((i/4/8)%32).to_s(2).rjust(5,"0")} & miss & \\" }

egin{table}[H]

caption{Byte addressed cache}label{tab:byte_cache}

egin{center}

egin{tabular}{rcccc}

oprule

extbf{Address} & extbf{Tag} & extbf{Index} & extbf{Hit} & extbf{Replace} \

midrule

0 & 00000 & 00000 & miss & \

4 & 00000 & 00000 & hit & \

16 & 00000 & 00000 & hit & \

132 & 00000 & 00100 & miss & \

232 & 00000 & 00111 & miss & \

160 & 00000 & 00101 & miss & \

1024 & 00001 & 00000 & miss & replaced \

30 & 00000 & 00000 & miss & replaced \

140 & 00000 & 00100 & hit & \

3100 & 00011 & 00000 & miss & replaced \

180 & 00000 & 00101 & hit & \

2180 & 00010 & 00100 & miss & replaced \

ottomrule

end{tabular}

end{center}

end{table}

% subsection 5_3_3 (end)

subsection{5.3.4} % (fold)

label{sub:5_3_4}

egin{fancyquotes}

How many blocks are replaced?

end{fancyquotes}

Two blocks are replaced, $00000$ be replaced three times, and $00100$ be replaced once.

Totally $4$ times.

% subsection 5_3_4 (end)

subsection{5.3.5} % (fold)

label{sub:5_3_5}

egin{fancyquotes}

What is the hit ratio?

end{fancyquotes}

The hit ratio is $4/13=30.77%$.

% subsection 5_3_5 (end)

subsection{5.3.6} % (fold)

label{sub:5_3_6}

egin{fancyquotes}

List the final state of the cache, with each valid entry represented as a record of <index, tag, data>.

end{fancyquotes}

% 32.times{|i| puts "#{i.to_s(2).rjust(5, "0")} & 00000 & 00000 \\" }

egin{table}[H]

caption{Final Cache}label{tab:final_cache}

egin{center}

egin{tabular}{ccc}

oprule

extbf{index} & extbf{tag} & extbf{data} \

midrule

00000 & 00011 & 0(3100) \

00001 & 00000 & \

00010 & 00000 & \

00011 & 00000 & \

00100 & 00010 & 0(2180) \

00101 & 00000 & 0(160) \

00110 & 00000 & \

00111 & 00000 & 0(232) \

01000 & 00000 & \

01001 & 00000 & \

01010 & 00000 & \

01011 & 00000 & \

01100 & 00000 & \

01101 & 00000 & \

01110 & 00000 & \

01111 & 00000 & \

10000 & 00000 & \

10001 & 00000 & \

10010 & 00000 & \

10011 & 00000 & \

10100 & 00000 & \

10101 & 00000 & \

10110 & 00000 & \

10111 & 00000 & \

11000 & 00000 & \

11001 & 00000 & \

11010 & 00000 & \

11011 & 00000 & \

11100 & 00000 & \

11101 & 00000 & \

11110 & 00000 & \

11111 & 00000 & \

ottomrule

end{tabular}

end{center}

end{table}

% subsection 5_3_6 (end)

end{document}

Navigate

Consider the following loop which is an inefficient (at best) attempt to increme

Consider the following loop. loop: lw r1, 0(r1) and r1, r1, r2 1 w r1, 0(r1) 1 w

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Consider the following loop. Assume that perfect branch prediction is used, and

Question

Explanation / Answer

Related Questions

Navigate