DB Join

1. Join

Join is an RDB operation mainly used when querying related data spread across different tables together in a single query result. [Figure 1] shows an example dataset for explaining join operations. The Departments table and Employees table exist, and join is performed using the id column of the Departments table and the dept_id column of the Employees table as join keys.

1.1. Join Type

Join operations can be classified into Inner Join, Outer Join, and Cross Join by operation type.

1.1.1. Inner Join

SELECT   d.id, d.dept_name, e.name
FROM     departments d
INNER JOIN employees e
  ON d.id = e.dept_id;

[Query 1] Inner Join

Inner Join is the most basic join operation, and only rows where join keys of the two tables match are included in the result. [Query 1] shows an SQL query that performs Inner Join, and [Figure 2] shows the Inner Join result. In [Figure 2], you can see that rows with NULL values are excluded, and only rows where the id column of the Departments table and the dept_id column of the Employees table match are included in the result.

1.1.2. Outer Join

Outer Join also includes rows where join keys of the two tables do not match. The reason for using Outer Join is to include all rows of the left, right, or both tables in the result. Reasons for including all rows include identifying missing data, verifying data consistency, or performing aggregate operations. Depending on which table’s rows are all included in the result, it is classified into Left Outer Join, Right Outer Join, and Full Outer Join.

SELECT   d.id, d.dept_name, e.name
FROM     departments d
LEFT JOIN employees e
  ON d.id = e.dept_id;

[Query 2] Left Outer Join

Left Outer Join, as can be inferred from the name, uses the left table of the join query as the base, includes all rows of the left table in the result, and fills rows where join keys do not match with rows having NULL values. [Query 2] shows an SQL query that performs Left Outer Join, and [Figure 3] shows the Left Outer Join result. In [Figure 3], you can see that all rows of the Departments table are included in the result, and the row with d.id = 40 where the join key does not match is filled with a row having NULL values.

SELECT   d.id, d.dept_name, e.name
FROM     departments d
RIGHT JOIN employees e
  ON d.id = e.dept_id;

[Query 3] Right Outer Join

Right Outer Join, as can be inferred from the name, uses the right table of the join query as the base, includes all rows of the right table in the result, and fills rows where join keys do not match with rows having NULL values. [Query 3] shows an SQL query that performs Right Outer Join, and [Figure 4] shows the Right Outer Join result. In [Figure 4], you can see that all rows of the Employees table are included in the result, and the row with name = Ssup where the join key does not match is filled with a row having NULL values.

SELECT   d.id, d.dept_name, e.name
FROM     departments d
FULL OUTER JOIN employees e
  ON d.id = e.dept_id;

[Query 4] Full Outer Join

Full Outer Join, as can be inferred from the name, includes all rows of both tables. [Query 4] shows an SQL query that performs Full Outer Join, and [Figure 5] shows the Full Outer Join result. In [Figure 5], you can see that all rows of both tables are included in the result, and rows where join keys do not match are filled with rows having NULL values.

1.1.3. Cross Join

SELECT   d.id, d.dept_name, e.name, e.dept_id
FROM     departments d
CROSS JOIN employees e;

[Query 5] Cross Join

Cross Join generates results by combining all rows of two tables. Therefore, the number of rows generated by Cross Join is the product of the row counts of the two tables. Because of this characteristic, it is also called Cartesian Product. [Query 5] shows an SQL query that performs Cross Join, and [Figure 6] shows the Cross Join result. In [Figure 6], you can see that results are generated by combining all rows of the two tables.

1.2. Join Algorithm

To perform a join operation, join target tables must be traversed at least once. The table that is traversed only once as the base is called the Outer Table, and the table that is traversed repeatedly is called the Inner Table. Which table is chosen as the Outer Table can greatly affect join algorithm performance, and in general the DB Optimizer automatically selects the table with fewer rows as the Outer Table based on statistics.

There are three algorithms depending on traversal method: Nested Loop Join, Sort Merge Join, and Hash Join.

1.2.1. Nested Loop Join

Nested Loop Join is the most basic join algorithm, operating by traversing all rows of the Inner Table while traversing each row of the Outer Table.

[Outer: Engineering (id=10)]
  → Alice (dept_id=10)   O
  → Bob   (dept_id=10)   O
  → Coral (dept_id=20)   X
  → Dave  (dept_id=30)   X
  → Eve   (dept_id=20)   X
  → Ssup  (dept_id=NULL) X

[Outer: Marketing (id=20)]
  → Alice (dept_id=10)   X
  → Bob   (dept_id=10)   X
  → Coral (dept_id=20)   O
  → Dave  (dept_id=30)   X
  → Eve   (dept_id=20)   O
  → Ssup  (dept_id=NULL) X

[Outer: HR (id=30)]
  → Alice (dept_id=10)   X
  → Bob   (dept_id=10)   X
  → Coral (dept_id=20)   X
  → Dave  (dept_id=30)   O
  → Eve   (dept_id=20)   X
  → Ssup  (dept_id=NULL) X

[Outer: NULL (dept_id=40)]
  → Alice (dept_id=10)   X
  → Bob   (dept_id=10)   X
  → Coral (dept_id=20)   X
  → Dave  (dept_id=30)   X
  → Eve   (dept_id=20)   X
  → Ssup  (dept_id=NULL) X

[Text 1] Nested Loop Join Process / Outer Table: Departments, Inner Table: Employees, dept_id Index: X

[Text 1] shows the processing during Nested Loop Join when the Departments table is the Outer Table, the Employees table is the Inner Table, and there is no index on the dept_id column. You can see that it operates by traversing each row of the Outer Table Departments while traversing all rows of the Inner Table Employees and comparing values.

When there is no index on the dept_id column, the location of rows with a specific dept_id value cannot be known immediately, so to find rows satisfying the join condition (Employees.dept_id = Departments.id), the entire Employees table must be read and compared. The Departments table has 4 records and the Employees table has 6 records, so a total of 24 row scans occur. Time complexity is O(N * M).

[Outer: Engineering (id=10)]
  → Alice (dept_id=10)   O
  → Bob   (dept_id=10)   O

[Outer: Marketing (id=20)]
  → Coral (dept_id=20)   O
  → Eve   (dept_id=20)   O

[Outer: HR (id=30)]
  → Dave  (dept_id=30)   O

[Outer: NULL (id=40)]

[Text 2] Nested Loop Join Process / Outer Table: Departments, Inner Table: Employees, dept_id Index: O

When there is an index on the dept_id column, only Employees rows with the corresponding dept_id can be found through Index Lookup for each row of the Departments table. [Text 2] shows the processing during Nested Loop Join in this case. Because only rows matching the join condition are accessed without scanning all records of the Employees table, only 5 row scans are performed as in [Text 2], not 24, and in this case only 4 Index Lookups are needed (10, 20, 30, 40).

[Outer: Alice (dept_id=10)]
  → Engineering (id=10)   O

[Outer: Bob (dept_id=10)]
  → Engineering (id=10)   O

[Outer: Coral (dept_id=20)]
  → Marketing   (id=20)   O

[Outer: Dave (dept_id=30)]
  → HR          (id=30)   O

[Outer: Eve (dept_id=20)]
  → Marketing   (id=20)   O

[Outer: Ssup (dept_id=NULL)]

[Text 3] Nested Loop Join Process / Outer Table: Employees, Inner Table: Departments

[Text 3] shows the processing during Nested Loop Join when the Employees table is the Outer Table, the Departments table is the Inner Table, and there is an index on the dept_id column. You can see that it operates by traversing each row of the Outer Table Employees and directly searching only matching rows through the index of the Inner Table Departments. In this case, 5 row scans are performed as in [Text 2], but 5 Index Lookups are also needed (10, 10, 20, 30, 20). Ssup has dept_id of NULL, so Index Lookup is not performed and it is immediately excluded from the join.

[Text 2] and [Text 3] are the same join operation, but the number of Index Lookups can differ depending on which table is chosen as the Outer Table, and this can greatly affect join algorithm performance. To reduce the number of Index Lookups, the Outer Table should have fewer rows than the Inner Table, and therefore the DB Optimizer generally automatically selects the table with fewer rows as the Outer Table.

1.2.2. Sort Merge Join

Sort Merge Join, as can be inferred from the name, sorts data by join key and then performs the join. To perform Sort Merge Join, both tables must be sorted by join key, and after sorting completes, join keys are compared and the join is performed by moving pointers forward from the front of each table. In this case, the Inner Table is also traversed only once, unlike Nested Loop Join.

[Figure 7] shows only the Employees table sorted by the dept_id column from the dataset in [Figure 1] for Sort Merge Join. Departments.id is already sorted by the Primary Key Clustered Index, so it maintains the same order as [Figure 1].

Engineering(10)  · Alice(10)      → O     →            `ptr_e++`
Engineering(10)  · Bob(10)        → O     →            `ptr_e++`
Engineering(10)  · Coral(20)      → X     → `ptr_d++`
Marketing(20)    · Coral(20)      → O     →            `ptr_e++`
Marketing(20)    · Eve(20)        → O     →            `ptr_e++`
Marketing(20)    · Dave(30)       → X     → `ptr_d++`
HR(30)           · Dave(30)       → O     →            `ptr_e++`
HR(30)           · Ssup(NULL)     → X     → `ptr_d++`
(Unassigned)40   · Ssup(NULL)     → X

[Text 4] Sort Merge Join Process

[Text 4] shows the merge phase of Sort Merge Join in [Figure 7]. Unlike Nested Loop Join, it does not traverse the entire Inner Table for each row of the Outer Table, and proceeds by reading forward with two pointers where ptr_d points to the current row of the Departments table and ptr_e points to the current row of the Employees table. When join keys match, the next Employee is examined with ptr_e++. When join keys differ, the next Department is examined with ptr_d++. The join ends when both pointers reach the end by moving pointers in this way.

In Sort Merge Join, the sort criterion is the join key, and index presence affects sort cost. When an index exists on the join key (Clustered Index, Non-clustered Index), sorted input by join key can be created more easily, reducing sort cost or in some cases omitting sorting. When sorting is needed, the DB includes a Sort operation in the query execution plan and performs sorting at execution time. Sort work is basically processed in the Sort Buffer (Work Memory) in memory, and when memory is insufficient, spill occurs and disk in the temporary area is used.

Nested Loop Join must search the Inner Table with Index Lookup for each row of the Outer Table. This approach does not repeat full scans of the Inner Table, but lookups are repeated for the number of Outer rows, so when join key values are scattered, much Random I/O can occur. On the other hand, Sort Merge Join sorts both inputs by join key and then performs the join by scanning each only once in sorted order, so the access pattern in the merge phase is close to Sequential I/O. Therefore, in environments where random I/O cost is high with large-scale data, it can be more advantageous than Nested Loop Join even considering sort cost, and when sorted input can be created through Index Scan on the join key, sort cost can be reduced or omitted.

Sort Merge Join is mainly used for Equality Join (=). As in [Text 4], when join keys match, matching occurs and advancement is done with ptr_e++, and when join keys differ, only the pointer on the side with the smaller join key advances; this approach is tailored to Equality Join. Non-equality joins such as <, >, and BETWEEN are difficult to process in a single sequential merge satisfying the condition, and pointers may need to be rewound or one table may need to be scanned repeatedly. In this case, the advantage of scanning each of the two tables sequentially only once in Sort Merge Join disappears, so a method that evaluates conditions like Nested Loop Join is generally chosen.

1.2.3. Hash Join

Hash Join applies a hash function to the join key to create a Hash Table and then performs the join. Generally, the table with fewer rows is set as the Build Table to construct the hash table, then the table with more rows is set as the Probe Table and traversed only once while looking up join keys in the hash table. Unlike Sort Merge Join, it has the advantage that join key sorting is not required, but has the disadvantage that it can only be used for Equality Join (=).

[Build] Departments → Hash Table (key: id)
  10 → Engineering
  20 → Marketing
  30 → HR
  40 → Unassigned

[Probe] Employees (dept_id)
  Alice(10)   → lookup 10     → O
  Bob(10)     → lookup 10     → O
  Coral(20)   → lookup 20     → O
  Eve(20)     → lookup 20     → O
  Dave(30)    → lookup 30     → O
  Ssup(NULL)  → lookup NULL   → X

[Text 5] Hash Join Process

[Text 5] shows the Hash Join process assuming the Departments table as the Build Table and the Employees table as the Probe Table based on the dataset in [Figure 1]. In the build phase, a hash table is created with Departments.id as the hash key, and in the probe phase, the hash table is looked up with dept_id for each row of Employees. Only the hash table is newly created in the build phase; the Probe Table is not created separately and the original table is scanned once.

The hash table is basically created in memory, and when memory is insufficient, the disk temporary area can be used similarly to spill in Sort Merge Join. Hash Join is close to O(N + M) with each table scanned once in build and probe, but hash table creation cost and memory usage are added. Because Hash Join matches through hash lookup without sorting, it is often selected for large-scale table joins in Equality Join, and the Optimizer decides Build Table/Probe Table and whether to use Hash Join based on statistics.

1.2.4. Join Algorithm Comparison

	Nested Loop Join	Sort Merge Join	Hash Join
Join Key Sort	Not required	Required	Not required
Join Condition	Equality, Non-equality	Mainly Equality (`=`)	Equality (`=`) only
Table Traversal	Inner repeated per Outer row	Each table once	Build table once + Probe table once
I/O Pattern	Random I/O with Index Nested Loop	Sequential I/O in merge phase	Build·Probe Sequential Scan
Time Complexity	`O(N × M)` (improved with Index)	`O(N log N + M log M)` + `O(N + M)`	`O(N + M)`
Additional Cost	Inner table index dependency	Sort, Spill	Hash table creation, Memory, Spill
Favorable Cases	Small tables, well-indexed	Large scale, sorted input (Index Scan) usable	Large-scale Equality Join

[Table 1] Join Algorithm Comparison

[Table 1] compares join algorithms. The DB Optimizer generally tends to choose Nested Loop Join for small table joins, Hash Join for large-scale Equality Join, and Sort Merge Join for large-scale joins where sorted input can be utilized. Actual selection varies depending on DB type, statistics, memory settings, and join conditions.

DB Isolation Level DB N+1 Problem