Consider the following GPU that consists of 8 multiprocessors clocked at 1.5 GHz
ID: 3572908 • Letter: C
Question
Consider the following GPU that consists of 8 multiprocessors clocked at 1.5
GHz, each of which contains 8 multithreaded single-precision floating-point units and
integer processing units. It has a memory system that consists of 8 partitions of 1GHz
Graphics DDR3DRAM, each 8 bytes wide and with 256 MB of capacity. Making reasonable
assumptions (state them), and a naive matrix multiplication algorithm, compute how much
time the computation C = A * B would take. A, B, and C are n * n matrices and n is
determined by the amount of memory the system has.
Explanation / Answer
Assuming it has a single-precision FP multiply-add instruction,
Single-precision FP multiply-add performance =
#MPs * #SP/MP * #FLOPs/instr/SP * #instr/clock * #clocks/sec =
8 * 8 * 2 * 1 * 1.5 G = 192 GFlops / second
Total DDR3RAM memory size = 8 * 256 MB = 2048 MB
The peak DDR3 bandwidth =
#Partitions * #bytes/transfer * #transfers/clock * #clocks/sec =
8 * 8 * 2 * 1G = 128 GB/sec
Modern computers have 32-bit single precision
So, if we want 3 n*n SP matrices, maximum n is
3n^2 * 4 <= 2048 * 1024 * 1024
nmax = 13377 = n
The number of operations that a naive mm algorithm (triply nested loop) needs is calculated as follows:
For each element of the result, we need n multiply-adds
For each row of the result, we need n * n multiply-adds
For the entire result matrix, we need n * n * n multiply-adds
Thus, 2393 GFlops.
Assuming no cache, we have loading of 2 matrices and storing of 1 to the graphics memory. That is 3 * n^2 = 512 GB of data.
This process will take 512 / 128 = 4 seconds
Also, the processing will take 2393 / 192 = 12.46 seconds
Thus the entire matrix multiplication will take 16.46 seconds
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.