Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. Give a code snippet of 3 y86 instructions where the third instruction has a d

ID: 3678866 • Letter: 1

Question

1. Give a code snippet of 3 y86 instructions where the third instruction has a dependency on the first instruction that results in a stall if the pipeline does not have forwarding paths, but if the pipeline had forwarding paths there would not be a stall required. (10 points)

2. Given the following information about the instruction mix of a program, determine the average cycles per instruction of that program. Think back to the way the pipeline handles these different types of instructions and the stalls that might occur. (10 points)

40% of the instructions are loads, and 30% of these are followed by an instruction that uses the data just loaded
20% of the instructions are conditional branches, and 40% of the branches are not taken

5% of the instructions are call instructions
5% of the instructions are ret instructions
30% of the instructions are arithmetic instructions

3. Give Y86 code that is equivalent to the following x86 code. (10 points)

leaq (%rax, %rax, 2), %rcx

4. Consider a 5-stage pipelined processor with forwarding paths, describe in your own words how the value that appears on the B bus for the execute unit is determined. Hint: Consider all the places where the value to be manipulated can come from. (10 points)

5. Detail what happens during the 5 stages of the pipeline discussed in class for the ret instruction: (10 points)

6. Given a nonpipelined processor A that requires 100 units of time to complete an instruction and a pipelined processor B that requires the 5 stages as shown in the table below (assume any pipeline overhead has already been added to the stage times, so they add up to more than 100 units of time) to complete an instruction but allows for overlaping execution. Detail how many units of time Processor A and Processor B require to execute 100 instructions assuming no stalls. (10 points)

Stage

1

2

3

4

5

Time Units

19

23

25

18

20

7. Given the following y86 instruction stream, draw the pipeline stage diagram (shows in which cycle each instruction enters a particular stage of the pipeline). Assume the base pipelined processor with only forwarding paths and the 5 stage pipeline discussed in class with stages F, D, E, M, W. Be sure to be clear about how the jge is handled by the pipeline.(10 points)

1: mrmovq 8(%rbp), %rbx

2: rmmovq %rbx, (%rsp)

3: subq %rax, %rbx

4: jge tag1

5: rmmovq %rbx, 8(%rbp)

tag1:
6: rmmovq %rax, -8(%rbp)

8. Give two examples of improvements (other than loop unrolling, see problem 9) that can be performed on the following code snippet that would improve the overall performance of the code? (10 points)
// assume that vector a uses the vector data type described in chapter 5 of the book

// (the structure with int len and ptr data to array of len elements) void weighted_sum(vec_ptr a, data_t *dest) {

long int i;
*dest = 1;
data_t vala;
for (i = 0; i < vec_length(a); i++) {

get_vec_element(a, i, &vala); *dest = *dest + vala*i;
}

}

9. Give the C code that implements the optimizations you mentioned in 8, AND unrolls the loop from problem 8 2 times and uses 2 accumulators. (20 points)

Stage

1

2

3

4

5

Time Units

19

23

25

18

20

Explanation / Answer

5.....

5 stages of the pipeline discussed in class for the ret instruction

1.Instruction fetch

2.Instruction decode

3.Execute

4.Memory access

5.Writeback

1.Instruction fetch:

The Instruction Cache on these machines had a latency of one cycle, meaning that if the instruction was in the cache, it would be ready on the next clock cycle. During the Instruction Fetch stage, a 32-bit instruction was fetched from the cache.

The Program Counter, or PC, is a register responsible for holding the address of the current instruction. It feeds into the PC predictor which then sends the Program Counter (PC) to the Instruction Cache to read the current instruction.

2.Instruction decode:

If the instruction decoded was a branch or jump, the target address of the branch or jump was computed in parallel with reading the register file. The branch condition is computed after the register file is read, and if the branch is taken or if the instruction is a jump, the PC predictor in the first stage is assigned the branch target, rather than the incremented PC that has been computed. It should be noted that some architectures made use of the ALU in the Execute stage, at the cost of slightly decrease instruction throughput.

3.Execute:

  The Execute stage is where the actual computation occurs. Typically this stage consists of an Arithmetic and Logic Unit, and also a bit shifter. It may also include a multiple cycle multiplier and divider.

4.Memory access

If data memory needs to be accessed, it is done so in this stage.

During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both single and two cycle instructions always write their results in the same stage of the pipeline, so that just one write port to the register file can be used, and it is always available.

For direct mapped and virtually tagged data caching, the simplest by far of the numerous data cache organizations, two SRAMs are used, one storing data and the other storing tags.

5.Writeback:

During this stage, both single cycle and two cycle instructions write their results into the register file.