Consider the following code that adds two matrices A and B and stores the result
ID: 3774581 • Letter: C
Question
Consider the following code that adds two matrices A and B and stores the result in a matrix C:
for (i= 0 to 15) {
for (j= 0 to 63) {
C[i][j] = A[i][j] + B[i][j];
}
}
If we had a quadcore multiprocessor, where the elements of the matrices A, B, C are stored in row major order, which one of the following two parallelizations is better and why? What about when they are stored in column major order?
(a) For each Pk in {0, 1, 2, 3}:
for (i= 0 to 15) {
for (j= Pk*15 + Pk to (Pk+1)*15 + Pk) {
// Inner Loop Parallelization
C[i][j] = A[i][j] + B[i][j];
}
}
(b) For each Pk in {0, 1, 2, 3}:
for (i= Pk*3 + Pk to (Pk+1)*3 + Pk) {
// Outer Loop Parallelization
for (j= 0 to 63) {
C[i][j] = A[i][j] + B[i][j];
}
}
Explanation / Answer
First we will take case of row major matrix,case(a) is always better because it accesses elements row by row,hence thread can take advantage of spatial locality property of cache.Similary,when we access column by column in column major matrix there will be more number of cache miss because elements of column are separeted by number of row size.This happens in memory also and in cache also.Hence for column major matrix case(b) is better
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.