Consider two relations A and B. A is of size 10,000 disk pages, and B is of size
ID: 3593388 • Letter: C
Question
Consider two relations A and B. A is of size 10,000 disk pages, and B is of size 1,000 pages Consider the following SQL statement: SELECT FROM A, B WHERE A.a = B.a We wish to evaluate an equijoin between A and B, with an equality condition A.a B.a. There are 502 buffer pages available for this operation. Both relations are stored as simple heap files. Neither relation has any indexes built on it. Consider alternative join strategies described below and calculate the cost of each alternative Evaluate the algorithms using the number of disk I/O's as the cost. For each strategy, provide the formulae you use to calculate your cost estimates a) Page-oriented Nested Loops Join. Consider A as the outer relation. (1 mark) b) Block-oriented Nested Loops Join. Consider A as the outer relation. (1 mark) c) Sort-Merge Join (1 mark) d) Hash Join (1 mark) e) What would the lowest possible I/O cost be for joining A and B using any join algorithm and how much buffer space would be needed to achieve this cost? Explain briefly. (1 mark)Explanation / Answer
For each tuple r in R do
for each tuple s in S do
if ri == sj then add <r, s> to result
• Page-oriented Nested Loops join:
==================================
For each page of R, get each page of S, and write out matching pairs of tuples <r, s>, where r is in R-page
and S is in S-page.
Cost: M + M*N = 1000 + 1000*500 (1.4 hours)
For each tuple r in R do
for each tuple s in S do
if ri == sj then add <r, s> to result
Block-Nested Loops Join :
========================
(Case I):
– Load the complete smaller R relation to memory (assuming it fits)
– Use one page as an output buffer
– Use remaining pages (even 1 page is adequate) to load the larger S in memory and perform the join.
COST=M+N
(Case II):
– Load the complete smaller R relation to memory and Build a Hashtable
– Use one page as an output buffer
– Use remaining pages (even 1 page is adequate) to load the larger S in memory and perform the join (by using the in-memory Hashtable).
COST=M+N (CPU cost is lower)
(Case III):
– Scan B-2 pages of smaller R to memory (named R-block) (additionally, could build a hash table for this in-memory table)
– Use 1 page as an output buffer and 1 page to scan S relation to memory a page-at-a-time (named S-page) and perform the join.
– Need to repeat the above M/(B-2) times (i.e., Number of Rblocks)
COST=M + N * [M/(B-2)]
EXAMPLE:
========
Reserves (R) as outer and B=102
• Cost = 1000 + 500 * 1000/100 = 1000 + 500*10 = 6000 I/Os
Sort-Merge Join:
=================
Cost: M log M + N log N + (M+N)
EXAMPLE:
==========
–The cost of scanning, M+N, could be M*N With 35, 100, or 300 buffer pages,
both Reserves and Sailors can be sorted in 2 passes; total join cost: 7500
(BNL cost: 2500 to 15000 I/Os)
HASH JOIN:
==========
Hash phase:
------------
– Scan R, compute hash table, writing full blocks to disk immediately
– Scan S, compute hash table, writing full blocks to disk immediately
– Notice: Probably better to use some n
Merge phase:
-------------
– Iteratively, load same bucket of R and of S in memory
– Compute join
Total cost
– Hash phase costs 2*b(R)+2*b(S)
– Merge phase costs b(R) + b(S)
– Total: 3*(b(R)+b(S))
Summary:
==========
Page-oriented Nexted Loop: Block-oriented Nexted Loop with block size = 1 for both relations
Block-oriented NL: Read in multiple pages of outer relation
Sort-Merge join: great if things already sorted (or need to be later on….)
Hash join:As always, we may save sorting if good hash function available
• Assume a very good hash function
– Distributes hash values almost uniformly over hash table
– If we have good histograms (later), a simple interval
-based hash function will usually work
• Page-oriented Nested Loops join:
==================================
For each page of R, get each page of S, and write out matching pairs of tuples <r, s>, where r is in R-page
and S is in S-page.
Cost: M + M*N = 1000 + 1000*500 (1.4 hours)
For each tuple r in R do
for each tuple s in S do
if ri == sj then add <r, s> to result
Block-Nested Loops Join :
========================
(Case I):
– Load the complete smaller R relation to memory (assuming it fits)
– Use one page as an output buffer
– Use remaining pages (even 1 page is adequate) to load the larger S in memory and perform the join.
COST=M+N
(Case II):
– Load the complete smaller R relation to memory and Build a Hashtable
– Use one page as an output buffer
– Use remaining pages (even 1 page is adequate) to load the larger S in memory and perform the join (by using the in-memory Hashtable).
COST=M+N (CPU cost is lower)
(Case III):
– Scan B-2 pages of smaller R to memory (named R-block) (additionally, could build a hash table for this in-memory table)
– Use 1 page as an output buffer and 1 page to scan S relation to memory a page-at-a-time (named S-page) and perform the join.
– Need to repeat the above M/(B-2) times (i.e., Number of Rblocks)
COST=M + N * [M/(B-2)]
EXAMPLE:
========
Reserves (R) as outer and B=102
• Cost = 1000 + 500 * 1000/100 = 1000 + 500*10 = 6000 I/Os
Sort-Merge Join:
=================
Cost: M log M + N log N + (M+N)
EXAMPLE:
==========
–The cost of scanning, M+N, could be M*N With 35, 100, or 300 buffer pages,
both Reserves and Sailors can be sorted in 2 passes; total join cost: 7500
(BNL cost: 2500 to 15000 I/Os)
HASH JOIN:
==========
Hash phase:
------------
– Scan R, compute hash table, writing full blocks to disk immediately
– Scan S, compute hash table, writing full blocks to disk immediately
– Notice: Probably better to use some n
Merge phase:
-------------
– Iteratively, load same bucket of R and of S in memory
– Compute join
Total cost
– Hash phase costs 2*b(R)+2*b(S)
– Merge phase costs b(R) + b(S)
– Total: 3*(b(R)+b(S))
Summary:
==========
Page-oriented Nexted Loop: Block-oriented Nexted Loop with block size = 1 for both relations
Block-oriented NL: Read in multiple pages of outer relation
Sort-Merge join: great if things already sorted (or need to be later on….)
Hash join:As always, we may save sorting if good hash function available
• Assume a very good hash function
– Distributes hash values almost uniformly over hash table
– If we have good histograms (later), a simple interval
-based hash function will usually work
• Page-oriented Nested Loops join:
==================================
For each page of R, get each page of S, and write out matching pairs of tuples <r, s>, where r is in R-page
and S is in S-page.
Cost: M + M*N = 1000 + 1000*500 (1.4 hours)
For each tuple r in R do
for each tuple s in S do
if ri == sj then add <r, s> to result
Block-Nested Loops Join :
========================
(Case I):
– Load the complete smaller R relation to memory (assuming it fits)
– Use one page as an output buffer
– Use remaining pages (even 1 page is adequate) to load the larger S in memory and perform the join.
COST=M+N
(Case II):
– Load the complete smaller R relation to memory and Build a Hashtable
– Use one page as an output buffer
– Use remaining pages (even 1 page is adequate) to load the larger S in memory and perform the join (by using the in-memory Hashtable).
COST=M+N (CPU cost is lower)
(Case III):
– Scan B-2 pages of smaller R to memory (named R-block) (additionally, could build a hash table for this in-memory table)
– Use 1 page as an output buffer and 1 page to scan S relation to memory a page-at-a-time (named S-page) and perform the join.
– Need to repeat the above M/(B-2) times (i.e., Number of Rblocks)
COST=M + N * [M/(B-2)]
EXAMPLE:
========
Reserves (R) as outer and B=102
• Cost = 1000 + 500 * 1000/100 = 1000 + 500*10 = 6000 I/Os
Sort-Merge Join:
=================
Cost: M log M + N log N + (M+N)
EXAMPLE:
==========
–The cost of scanning, M+N, could be M*N With 35, 100, or 300 buffer pages,
both Reserves and Sailors can be sorted in 2 passes; total join cost: 7500
(BNL cost: 2500 to 15000 I/Os)
HASH JOIN:
==========
Hash phase:
------------
– Scan R, compute hash table, writing full blocks to disk immediately
– Scan S, compute hash table, writing full blocks to disk immediately
– Notice: Probably better to use some n
Merge phase:
-------------
– Iteratively, load same bucket of R and of S in memory
– Compute join
Total cost
– Hash phase costs 2*b(R)+2*b(S)
– Merge phase costs b(R) + b(S)
– Total: 3*(b(R)+b(S))
Summary:
==========
Page-oriented Nexted Loop: Block-oriented Nexted Loop with block size = 1 for both relations
Block-oriented NL: Read in multiple pages of outer relation
Sort-Merge join: great if things already sorted (or need to be later on….)
Hash join:As always, we may save sorting if good hash function available
• Assume a very good hash function
– Distributes hash values almost uniformly over hash table
– If we have good histograms (later), a simple interval
-based hash function will usually work
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.