Friday, May 16, 2014

Join Algorithms (Join Processing in Database with Large Main Memories)

Joins are very common and very expensive in databases.
Equi joins are very common because of normalization and primary key foreign key relationship.

Assume,
R -> Smaller relation
S -> Bigger relation

Lets discuss different joining algorithms
Block Nested Join
  • Bring ‘b’ part of smaller relation into memory
  • Bring each page of larger relation and scan ‘b’ and produce output
  • Generally b = B – 2, 2 pages are left 1 for S and 1 for output tuple
  • Once whole S is scanned then bring next ‘b’ pages of small relation.
  • Ideal: if whole R can fit into memory then join can be done in 1pass. And IO will be |R| + |S|
  • I/O cost
    • |R / b| * S + R
  • It has very good sequential behavior
Sort Merge Join
  • Create sorted runs of R and S and then merge
  • It works in 2 passes
    • Pass1: If we use tournament sort then we can produce sorted runs of size 2B.
    • Pass 2: Use N way merge to produce output
  • # of runs after pass 1 = (R/2B) + (S/2B)
  • as S > R so we can say at the max we will have (S/2B) + (S/2B) sorted runs after pass 1 i.e. S/B runs
  • So if you want to finish joins in 2 pass then S/B <= B i.e. If B >= sqrt(S) = sqrt(large relation) then join can be done in 2 pass.
  • I/O Cost
    • 3|R + S|
  • Output is sorted order so you can easily perform group by operation