Given the following C code: low = 0; VL = (n % MVL); /*find odd-size piece using

ID: 3863092 • Letter: G

Question

Given the following C code:

low = 0;

VL = (n % MVL); /*find odd-size piece using modulo op % */

for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/

for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/

Y[i] = a * X[i] + Y[i] ; /*main operation*/

low = low + VL; /*start of next vector*/

VL = MVL; /*reset the length to maximum vector length*/

}

Translate the code using our DLX vector instruction set. Assume:

Vector registers of length 8

Load unit has a startup of L clocks

Adder unit has a startup of A clocks

Multiplier unit has a startup of M clocks

For vectors of length N, compute the number of clock cycles to execute the inner loop (the vector operations) both for normal execution and then for allowing changing of loads/stores/addition/ multiplication. How much speedup do we achieve with chaining ?

Explanation / Answer

VMR is part of the architectural state , Rely on compilers to manipulate VMR explicitly and GPUs get the same effect using HW!

Invisible to SW, Both GPU and Vector architectures spend time on maski ng

DensityTime Implementation

scan mask vector and only execute elements with nonzero masks

Suppose Start-up timepipeline latency time depth of FU pipeline another sources of overhead
Operation Start-up penalty (from CRAY-1)
Vector load/store   12
Vector multiply   7
Vector add       6
   Assume convoys don't overlap; vector length = n:

Convoy   Start   1st result last result
1. LV 0 12   11+n (=12+n-1)
2. MULV, LV   12+n   12+n+7 18+2n   Multiplystartup
12+n+1   12+n+13 24+2n   Load start-up
3. ADDV 25+2n   25+2n+6 30+3n   Wait convoy 2
4. SV 31+3n   31+3n+12   42+4n   Wait convoy 3

Suppose we have   MULV   V1,V2,V3
ADDV   V4,V1,V5   ;
for chaining vector register (V1) is not as a single entity but as a group of individual registers, then pipeline forwarding can work on individual elements of a vector
Flexible chaining allows vector to chain to any other active vector operation that is more read/write ports
As long as enough HW, increases convoy size
Suppose adjacent elements not sequential in memory
do 10 i = 1,10
   do 10 j = 1,100
   A(i,j) = 0.0
   do 10 k = 1,100
10           A(i,j) = A(i,j)+B(i,k)*C(k,j)
Either B or C accesses not adjacent (800 bytes between)
stride: distance separating elements that are to be merged into a single vector (caches do unit stride)=> LVWS (load vector with stride) instruction
Think of addresses per vector element

Navigate

Given the following C code: long rfun (int x) if ( x-0 ) return 0; int nx-x >> 2

Given the following C program, write all of the functions used in the program so

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Given the following C code: low = 0; VL = (n % MVL); /*find odd-size piece using

Question

Explanation / Answer

Related Questions

Navigate