You are the chief architect at a major processor manufacturing company. You are
ID: 641415 • Letter: Y
Question
You are the chief architect at a major processor manufacturing company. You are in charge of extending the current ISA for a new line of processors targeted primarily for multimedia applications. You are considering adding a fused multiply/add instruction to the ISA. The semantics of the instruction is given as d = a + b c, where the values a, b, c, and d all reside in registers. The advantage of this new instruction is that by merging adds and multiplies it can reduce the overall instruction count. The actual reduction will depend on the number of fusible adds and multiplications present in the application. The disadvantage is that its implementation requires extending the clock cycle time by 10% and increases the CPI by 15%
(a) Assuming 32-bit instructions and 32 registers in the ISA show the binary encoding for the fused multiply-add instruction. Explain the use of each field. Discuss potential pitfalls.
(b) Under what conditions does adding this instruction to the ISA lead to an overall performance increase (show calculation)?
(c) If we are using MIPS (millions of insts/sec) as a metric for system performance, what impact would the above change have on that metric? (You can give a qualitative answer).
(d) Are there other issues to be considered in implementing this instruction?
Explanation / Answer
a) ISA defines the format of an instruction (syntax) and its meaning or semantics.
• The OPCODE field says what the instruction does (e.g. ADD,) AND The OPERAND fields denote where to find inputs and outputs of the instruction.
•R-type (register) is our format since our instruction involves register operations.
Here, op is an operation code or opcode that selects a specific operation.
. rs and rt are the first and second source registers
. rd is the destination register
. shamt is “shift amount” and is only used for shift instructions
. func is used together with op to select arithmetic instructions
The semantics of the instruction is given as d = a + b * c, where the values a, b, c, and d all reside in registers.
First, let us encode the multiplication part and then the addition part of the instruction.
Multiply (without overflow) pseudoinstruction in mips:
mul rd, rs, rt
6 5 5 5 5 6
Put the low-order 32 bits of the product of rs and rt into register rd.
Our instruction is d = a + b * c, multiplication part is b*c
Multiply
mult rs, rt
6 5 5 10 6
6 5 5 10 6
Next, we come to the addition of the product of b and c (i.e. say rd1)with register a.
Addition (without overflow) pseudoinstruction format:
6 5 5 5 5 6
addu rd, rs, rt
6 5 5 5 5 6
Next, we have to store/move the result of the operation in rd2 into the register d.
move rdest, rsrc pseudoinstruction format for moving from register rsrc to rdest.
Move from hi(of register rd2 here): mfhi rd (d is destination register)
6 10 5 5 6
Move from lo(of register rd2 here): mflo rd (d is destination register)
6 10 5 5 6
The multiplyand divide unit produces its result in two additional registers, hi and lo. These instructions move values to and from these registers.
The above operations were carried out considering a normal instruction execution without fused multiply add.
It involved 4 separate instructions to achieve the final result., mult,addu,mfhi,mflo mnemonic operations.
If we introduce a fused multiply add instruction, then it will definitely speed up the operation.
b)
If we consider the multiply add store instruction sequence, it involves the following steps:
1.Create the partial product matrix of binary integers
2. Take the result of the series of additions of bits , compress it to N bits -- the result should have bit size as the 2 multiplied numbers in the registers)
3. Store the result in the final register. We also need to use the result to perform addition.
A fused multiply-add will skip the steps 2 and 3. It just take the number to be added and perform addition of all the numbers in the registers simultaneously. It is lke creating a single Carry-save adder for the whole operation instead of adding for multiplication first, then adding again to achieve the final result.
For cases of floating point multiplication, a few additional steps will be required in addition to calculating the partial product of sums of the operands.
Considering the large range and imprecision that a floating point multiplication can have, various rounding and denormal number cases need to be detected after the operation is complete.
In a computer, if we were to multiply a floating point number and then add a floating point number to the result, the IEEE floating point specification would require that the result from the multiplication must first be rounded and checked for denormal and/or exceptions (for example,results of infinity,divide by zero etc) before performing the addition.
A fused multiply-add will skip this step and just tackle on the addition before any rounding or exception checking occurs. This will save significant circuit time and hence speed up the floating point operations too.
For example, consider ,in a sample processor supporting fused multiply add , a 64-bit floating point fused multiply-add takes a total of 6 cycles.
A normal operation for a 64-bit floating point multiply takes 5 cycles and a 64-bit floating point add takes 3 cycles.
Thus, using an FMA operation would take 6 cycles as compared to 8 without an FMA operation.
c)
The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations.There are two variants:FMA4 ,FMA3
For our currentin instruction d=a+b*c , FMA4 instruction format for fused multiply add would be suitable
Mnemonic Operands Operation
VFMADDPDx xmm,xmm.xmm/m 128,xmm/m 128 a=b.c+d
Here, in our case the operation is d=b.c+a
If we consider the peak performance of the two different implementations, P1 and P2, of the same instruction set. P1 is for normal sequential multiply-add and P2 is for fused multiply add, say.
The FMA implementation requires extending the clock cycle time by 10% and increases the CPI by 15%
They have CPI(clocks per instruction) , C1 and C2(=C1+0.15CPI) .
P1= x cycles/s * 1 instruction/1 cycle =i1 instructions/sec
P2= (x+10) cycles/s * 1 instruction/1 cycle =i2 instructions/sec
Say, our normal multiply add requires 5 cycles/s to perform a single instruction and fused multiply add requires only 2 cycles/s to perform the multiply add.
MIPS=Instruction count/Execution time × 106
Execution time =Instruction count × CPI/Clock rate
Hence, MIPS =Clock rate/CPI × 106
P1 P2
CPI C1 C1+0.15
Cycle Time x*5 (x+10)*2
Instruction count 4 2
Mips for P1= (x*5/C1 )*106
Mips for P2= ((x+10)*5/(C1+0.15C1) )*106
If we consider our variables as unit 1 for easier analysis,
Mips for P1= (1*5/1 )*106 =5* 106
Mips for P2= ((1+10)*5/(1+0.15*1) )*106 =(11*5/1.5)=48*106
Hence, we can see clearly that a fused multiply add instruction would increase the mips metric by almost 48/5 =10 times when compared to a normal sequential multiply add instruction.
d)
The only drawback will be in the difference of result in case of floating point multiply add operations.Since we skip the step of rounding and denormal correction (normalization) after the multiply, a floating point FMA operation will actually give a different result than a floating point multiply and a floating point add done in sequence. However, if we ignore such minor differences, an FMA is significantly faster.
op rs rt rd shamt funcRelated Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.