A way to think about EXIST
is by comparing it to JOIN
. The difference between them is that EXIST
is SQL’s way of expressing semijoins (and antijoins). I tried reading a few resources but nothing really clicked for me, and so I wrote some imperative Python code to illustrate these concepts:
managers = [('Bob', 1), ('Sally', 2)]
employees = [('Andy', 1), ('Mark', 1), ('Henry', None)]
def join(t1, t2):
for u in t1:
for v in t2:
if u[1] == v[1]:
yield (u, v)
def semijoin(t1, t2):
for u in t1:
for v in t2:
if u[1] == v[1]:
yield u
break
def antijoin(t1, t2):
for u in t1:
matched = False
for v in t2:
if u[1] == v[1]:
matched = True
break
if not matched:
yield u
Here, we have a manager
table which contains a list of managers and their IDs, and a employee
table which contains a list of employees and their manager IDs. (we can ignore the fact that managers are also employees).
There are two managers, Bob and Sally. Bob has two employees under him, Sally has no employees under her. Henry has no manager currently assigned to him.
Let’s see what happens when we do this:
print(list(join(managers, employees)))
>>> [(('Bob', 1), ('Andy', 1)), (('Bob', 1), ('Mark', 1))]
An (inner) join returns a list of all manager-employee pairs.
What about:
print(list(semijoin(managers, employees)))
>>> [('Bob', 1)]
A semijoin returns all managers that have at least one employee.
The reason this is the case here is because semijoins short-circuit on the predicate. The predicate here is manager.id = employee.manager_id
, which means that as soon as we find an employee whose manager is the current one we’re considering, we can output the current manager and move on to the next manager. This concept extends similarly for any arbitrary predicate.
print(list(antijoin(managers, employees)))
>>> [('Sally', 2)]
An antijoin return all managers that do not have at least one employee.
An antijoin works in the exactly opposite way from a semijoin. Antijoins short-circuit on the negation of the predicate — as soon as we find an employee whose manager is the current one we’re considering, we skip the manager. If we’ve checked all the employees and no employee’s manager is the current manager, we can output the current manager.
In this way, the set of rows output by a semijoin is the exact complementary of the set of rows output by an antijoin.
Note that joins are commutative, but semijoins and antijoins are not. If you flip the tables around:
print(list(semijoin(employees, managers)))
>>> [('Andy', 1), ('Mark', 1)]
This returns all employees that have a manager.
print(list(antijoin(employees, managers)))
>>> [('Henry', None)]
This returns all employees that do not have a manager.
Also note that joins can do the work of a semijoin:
# let's implement semijoin with join
def semijoinviajoin(t1, t2):
results = join(t1, t2)
return set([i[0] for i in results])
print(list(semijoinviajoin(employees, managers)))
>>> [('Bob', 1)]
What we did here was to project only the columns from the first table and deduplicate the results. Of course, this involves more memory to store the entire Cartesian product of the join and more time to deduplicate the results.
These are the equivalent SQL queries for join
, semijoin
, and antijoin
:
SELECT manager.id, employee.id
FROM manager
JOIN employee
ON manager.id = employee.manager_id;
SELECT id
FROM manager
WHERE EXISTS (
SELECT 1 FROM employee
WHERE manager.id = employee.manager_id
);
SELECT id
FROM manager
WHERE NOT EXISTS (
SELECT 1 FROM employee
WHERE manager.id = employee.manager_id
);
A good rule of thumb from a comment on one of the articles I linked above:
I usually follow another rule which might be helpful. If you do not need columns from a table, then move it from FROM clause to WHERE clause.
Although this explanation is not as rigorous, I hope it’s still useful as a companion piece alongside other more accurate and elaborate articles.