JOIN simplification in SQL Server

November 9, 2008

This is another of my tips/techniques that I demonstrated recently. It is not a new SQL Server 2008 feature – it works just fine in older versions – I demonstrated it running on SQL Server 2005 but using SQL Server 2008 Management Studio. [Note: It’s ended up appearing earlier than another post that I wrote on the plane. This one is Tip #4.]

When you have a single-table query and then introduce a JOIN, there are a number of ways in which that JOIN impacts your query. I want to talk about four such JOIN effects that I’ve identified. Understanding these four effects will help you take advantage of the simplification I am about to show you. Don’t focus on the simplification, focus on the effects. I will show you later how the concept can be applied to help database developers.

1. Extra columns

This seems fairly straight forward – we have access to additional columns (the one in the new table). It’s the main reason we tend to join to other tables – we simply don’t have all the data in our original table.

2. Duplicated rows

Not necessarily duplicated completely, but certainly a row in our original table may appear multiple times in our result set, if it matches with multiple rows in the new table. Notice that this doesn’t occur if the join-fields (those used in the ON clause) in the new table are known to be unique (as is the case with a Foreign-Key lookup).

3. Eliminated rows

In a similar manner, if a row in our original table doesn’t match with any rows in the new table, it might get eliminated from the results. Notice that this doesn’t occur if a LEFT JOIN or FULL JOIN is being used.

4. Added NULLs

This effect is somewhat rarer, but still worth mentioning. If a RIGHT JOIN or FULL JOIN is being used, then there may be NULLs added to our result set to reflect rows in our new table that don’t have matches in our original one. This doesn’t happen unless a RIGHT JOIN or FULL JOIN is being used.

Let’s examine some queries. They all use the AdventureWorks sample database.

SELECT p.ProductID, p.Name, p.Color, p.Size
FROM Production.Product p;

This is our starting place. Our control query if you like. Its execution plan is very simple – a clustered index scan on the Product table. There are no other tables that need to be involved. It returns 504 rows.

Now let’s involve another table – Production.ProductSubcategory.

SELECT p.ProductID, p.Name, p.Color, p.Size, s.Name as SubcatName
FROM Production.Product p
JOIN
Production.ProductSubcategory s
ON p.ProductSubcategoryID = s.ProductSubcategoryID;

This is a fairly standard lookup query. It’s essentially still about the Product table, but we’re hooking into the Subcategory table to be able to fetch some more information. Clearly we want effect #1 to take place.

But what about the other effects?

We don’t have duplicated rows, as s.ProductSubcategoryID is known to be unique (it’s the Primary Key).

We’re not introducing NULLs on the Product table side of the query, as we’re not using RIGHT JOIN or FULL JOIN.

And because this is a foreign key relationship, our constraint should stop us from eliminating rows. Only valid values can be used in p.ProductSubcategoryID – but this field can take NULLs, so in fact, the Eliminated Rows effect is taking place. Looking at the results of the query show that only 295 rows are returned. This can be avoided by using a LEFT JOIN instead of the INNER JOIN.

SELECT p.ProductID, p.Name, p.Color, p.Size, s.Name as SubcatName
FROM Production.Product p
LEFT JOIN
Production.ProductSubcategory s
ON p.ProductSubcategoryID = s.ProductSubcategoryID;

Now we are only seeing the first JOIN effect. As expected, the execution plan involves indexes on both queries.

But look what happens if you remove s.Name from the query.

SELECT p.ProductID, p.Name, p.Color, p.Size
FROM Production.Product p
LEFT JOIN
Production.ProductSubcategory s
ON p.ProductSubcategoryID = s.ProductSubcategoryID;

Now, none of our four effects are coming into play. And when we look at the execution plan, we see something surprising.

The query has simplified back down to to a single-table query. The query optimizer has deemed that the JOIN wasn’t having any effect on the query and therefore removed it.

At the moment, this may seem quite academic, but this can be leveraged to make life a lot easier for the average database developer.

It turns out that about half the time I need to refer to Products, I need to perform a bunch of lookups. I need to look up ProductType information, ProductSubcategory information, ProductCategory information, and so on. But each of these lookups makes my query-writing a little slower. I might be quite quick at writing these queries, but my queries typically have ten lines or more that are just providing the lookup logic.

It would be so much nicer to have a view called ProductsPlus, which was a slightly denormalised version of the Product table – including all my lookup information.

The problem is that without an understanding of the four join-effects, the system is probably going to be performing all the lookups every time you use the ProductsPlus view, even if you’re not interested in all the lookup columns. Examining the join-effects, we can eliminate them, just like in the previous example.

Let’s consider the contents of our ProductsPlus view. We have foreign key relationships in place to assist. All the FK columns can take NULL, so we are using LEFT JOIN. If they didn’t allow NULL, INNER JOIN would be okay.

CREATE VIEW dbo.ProductsPlus
AS
SELECT p.ProductID, p.Name, p.Color, p.Size, s.Name AS SubcatName, c.Name AS CatName, m.Name AS ModelName, su.Name AS SizeUnit
FROM Production.Product p
LEFT JOIN
Production.ProductSubcategory s
ON p.ProductSubcategoryID = s.ProductSubcategoryID
LEFT JOIN
Production.ProductCategory c
ON s.ProductCategoryID = c.ProductCategoryID
LEFT JOIN
Production.ProductModel m
ON m.ProductModelID = p.ProductModelID
LEFT JOIN
Production.UnitMeasure su
ON su.UnitMeasureCode = p.SizeUnitMeasureCode
;

When I query this view using SELECT * FROM dbo.ProductsPlus, I get all the tables involved, of course.

But because we took care when creating the view to avoid effects #2, #3 and #4, when we only need certain columns, the other tables are completely ignored.