This is a question I seem to answer for people quite often, so I thought I’d write a blog post about it.
Suppose you’re joining between two tables (or more, but we’ll only consider two to keep things simple) in the AdventureWorks sample database on SQL Server 2005.
SELECT *
FROM Production.ProductSubcategory s
LEFT JOIN
Production.Product p
ON p.ProductSubcategoryID = s.ProductSubcategoryID;
I’m doing a LEFT JOIN, because in my results, I want to make sure that I have all the subcategories listed. For fun, you might want to insert a subcategory into Production.ProductSubcategory, so that this query returns a NULL Product row for you – try INSERT Production.ProductSubcategory (Name, ProductCategoryID) VALUES (1,’Empty Subcategory’);
But suppose we only want to consider Products that have ListPrice > $1000. Because this doesn’t feel like part of the join context, the obvious place to put it is the WHERE clause, right? Well, only if we want to filter out the empty subcategory, and any others that only have cheap items.
Instead, it belongs in the ON clause, like this:
SELECT *
FROM Production.ProductSubcategory s
LEFT JOIN
Production.Product p
ON p.ProductSubcategoryID = s.ProductSubcategoryID
AND p.ListPrice > 1000;
I know it doesn’t seem like a join condition, because it’s only involves one of the tables in the join, but with outer joins we need to make it very clear where the join ends (and thus the non-matches are put back in) compared to where the filtering is done. If you don’t want to filter out the subcategories, you can’t be filtering in the WHERE clause. It is part of the join condition, because we only consider a match if the product is in the subcategory AND the product is expensive.
This Post Has 5 Comments
It also tends to be faster too, as the result set that is being joined is smaller.
This is an important rule to follow whenever you have to apply criteria (except for “Is Null” criteria) to a table that is outer joined, otherwise you end up converting the join to an inner join without knowing it.
Thanks for publishing it!
Yep, we’re planning updates most of our queries (mostly using a large number of INNER JOINs to large tables) specifically to reduce the join sets. Most of these queries currently do all the filtering in the WHERE clause and they take forever for moderately large tables (e.g. Table1: 2000 records, Table2: 20,000 records, Table3: 20,000 records).
Of course the example above dictates that would have been the logical thing to do in the first place 🙂
From a performance perspective, the system should be able to filter in the most appropriate place regardless. This method is about making sure that the correct results are achieved.
Rob
A helpful way to think of it in pretty much any similar situation is to try and work through the query in a logical order. Assuming the price condition is listed in the where clause…
Step 1) Rows are returned by the equality condition in the join. If the two keys match there will be data for all fields, if not then the “Product” data will be null.
Step 2) For all rows with null product data, the comparison of “is null > $1000” will evaluate to null.
Step 3) All of these rows (along with any rows that have products < $1K) will be eliminated from the final result set because it can't be established that null is greater than $1K.