Write C console programs of the “axpy” algorithm for integer data types. Instrum
ID: 3588300 • Letter: W
Question
Write C console programs of the “axpy” algorithm for integer data types. Instrument and monitor and measure execution time as a function of problem size. For each of the 2 data types:
Write it as a straight forward, single processor, sequential execution program, which you can scale the size of the axpy computation for N = 1, 5, 10 , 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000 and 1,000,000 vector elements . Use a random number generator to fill the vectors. Set the vector dimension using a DEFINE statement then recompile & execution each vector length.
Your program should print out the vector length, and wall clock execution time in the console window, and computation time per axpy vector for each vector length used in your program code, and the number of accuracy improvement loops you run the daxpy computation to improve accuracy.
Your console program should print out using formatted printf commands
Vector length(number of vector elements)
The number of accuracy improvement loops you run the daxpy computation to improve accuracy
Total computation time
Computation time per axpy vector
Computation time per vector element
Number of machine cycles per arithmetic operation
Explanation / Answer
AXPY algorithm means to multiply a scalar value to a vector X and then add it to vector Y
Y = a*X + Y
c code for axpy algorithm is
#include <stdio.h>
#include <stdlib.h> // for random number generator rand()
#include <time.h> // to calculate time works on all platform
#define DIM 1000000
static __inline__ unsigned long long rdtsc(void) // inline function to calculate the CPU clock cycles
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); //executes x86 instruction, it has opcode of 0x0f, 0x31 (0F31). A processor always keeps track of cpu clock cycles and the function RDTSC reads that value
return x;
}
void fillIntVector(int *a)
{
int i=0;
for(i=0;i<DIM;i++)
a[i]=(rand()%100)+1; // generates a random number variable in the range of 1 to 100. You can change it by changing the number in modulo operation
}
void performAXPYOperationInt(int *x,int *y, int a)
{
int i=0;
for(i=0;i<DIM;i++)
y[i]=a*x[i] + y[i]; // performs AXPY operation
}
void performAXPYOperationIntImproved(int *x,int *y, int a)
{
int i=0;
for(i=0;i<DIM;)
{
y[i]=a*x[i] + y[i];
y[i+1]=a*x[i+1] + y[i+1];
y[i+2]=a*x[i+2] + y[i+2];
i=i+3; // mentioned the logic below
}
}
void fillDoubleVector(double *a)
{
int i=0;
for(i=0;i<DIM;i++)
{
double d=RAND_MAX / 100;
a[i]= 1.0 + (rand() / d); // generates the random number in double in the range of 1.0 to 100.0
}
}
void performAXPYOperationDouble(double *x, double *y, int a)
{
int i=0;
for(i=0;i<DIM;i++)
y[i]=a*x[i]+y[i]; // the AXPY operation for double
}
int main()
{
int a=2;
clock_t t; // clock object
int x[DIM]={0};
int y[DIM]={0};
double xd[DIM]={0};
double yd[DIM]={0};
fillIntVector(x);
fillIntVector(y);
t=clock(); // clock starts (wall clock)
unsigned long long cycles = rdtsc(); //current value of CPU cycle
performAXPYOperationInt(x,y,a);
cycles = rdtsc() - cycles; // current value of CPU cycles - previous cycles. So we got total cycles needed to compute AXPY operation.
t=clock()-t; // clock ends
double time_taken_int = ((double)t)/CLOCKS_PER_SEC; // wall clock time
unsigned long long cyclesi=rdtsc(); // cycle starts
performAXPYOperationIntImproved(x,y,a);
cyclesi = rdtsc() - cyclesi; // cycle ends
t=clock(); // clock starts
performAXPYOperationDouble(xd,yd,a);
t=clock()-t; // clock ends
double time_taken_double = ((double)t)/CLOCKS_PER_SEC; // total time in seconds
printf("Vector length : %d , execution time (wall clock) for int : %f , number of CPU cycles : %d , After improvement CPU cycles : %d, saved cycles : %d, execution time(wall clock) for double : %f",DIM,time_taken_int, (unsigned)cycles,(unsigned)cyclesi,(unsigned)(cycles-cyclesi), time_taken_double);
return 0;
}
Here I have done everything for int. You can do same for double.
Accuracy Improvement in AXPY algorithm
For that we will combine c programming with MIPS.
The main loop to perform AXPY operation is
for(i=0;i<DIM;i++)
y[i] +=a*x[i];
Convert this code to assembly language
When you draw a timing diagram for this assembly code you will find 3 stalls (CPU cycles gets free because of dependency). So unroll the loop 3 times
for eg
for(i=0;i<DIM;)
{
y[i]=a*x[i] + y[i];
y[i+1]=a*x[i+1] + y[i+1];
y[i+2]=a*x[i+2] + y[i+2];
i=i+3;
}
When you convert this code to assembly language and then calculate the timing diagram, then there will be no stall. So to improve the accuracy unroll the loop for 3 times
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.