This month, Deepthi Goguri (@dbanuggets) asks us about our favourite new feature in SQL Server 2022 or Azure. And while there are always a few, I’m going to write about why I have a particular fondness of “IS [NOT] DISTINCT FROM“, despite the fact that it’s overly wordy and the functionality isn’t actually new at all.
People understand my point that it’s a little wordy. Typing “IS NOT DISTINCT FROM” instead of “=” doesn’t sound fun to anyone, and I think “==” or “IS” ought to be fine. The fact that the functionality isn’t new… well that statement seems to raise a few eyebrows.
The ISNULL() function is why a lot of people won’t bother using “IS [NOT] DISTINCT FROM”, which is really disappointing because there are important reasons why the two are very different. Let’s recap and explain what’s going on.
(My starting example is going to seem a little contrived, but I’ll show you a more realistic example further down.)
Suppose we have a variable that we’re wanting to use to filter a column. We’ll use AdventureWorks.
1 2 3 4 5 |
DECLARE @subcat int = NULL; SELECT COUNT(*) FROM Production.Product p WHERE p.ProductSubcategoryID = @subcat; |
This is great. No surprises here. We get a result saying 0.
Except that it’s wrong. Kinda. The AdventureWorks database has 209 rows in Production.Product without a subcategory. We could have said “WHERE p.ProductSubcategoryID IS NULL”, but “IS” doesn’t support a variable. Comparing a value to NULL results in NULL. The predicate is not satisfied, and the row doesn’t get counted in my query. This is not a new thing. It’s “first week of writing T-SQL” stuff.
And as long as this has been a problem, people have solved it using the ISNULL() function. Like this:
1 2 3 4 5 |
DECLARE @subcat int = NULL; SELECT COUNT(*) FROM Production.Product p WHERE ISNULL(p.ProductSubcategoryID,-1) = ISNULL(@subcat,-1); |
But this sucks. It’s essentially correct, although it does mean that if we have a row where ProductSubcategoryID = -1, then our query doesn’t differentiate between it and NULL, and we can search for NULL values by setting the variable to -1 (and vice-versa). Of course, you’d only use this if the value of -1 couldn’t be used.
Critically for me, it doesn’t tend to use indexes well. AdventureWorks doesn’t have an index on Production.Product.ProductSubcategoryID out of the box, so I’ve made one (called ixRF_Subcat), but even with this index, this query runs by scanning the table, as you can see in this estimated plan.
Even though all the NULLs are grouped together in my index, applying a function like ISNULL() in my predicate means that the index isn’t so useful. It’s not like I indexed ISNULL(ProductSubcategoryID,-1), which would’ve been great for my query. If I’ve indexed one thing and searched on another, we’re not likely to see a Seek. The Query Optimizer chooses to scan. It checks every row, ignorant of the fact that it really doesn’t need to.
Alternatively, people use OR.
1 2 3 4 5 6 |
DECLARE @subcat int = NULL; SELECT COUNT(*) FROM Production.Product p WHERE p.ProductSubcategoryID = @subcat OR p.ProductSubcategoryID IS NULL and @subcat IS NULL; |
But this sucks too, and not only because I’ve been lazy with my use of brackets (parens) here (because I know that AND is equivalent to ‘multiply’ in BODMAS or PEDMAS or PEMDAS or however you learned it in school – and that OR is like ‘addition’ – point being that the AND happens before the OR).
You might be thinking it sucks because ORs cause Scans too. And you’d almost be right. That’s essentially what my problem is with using OR here, except that it’s not actually true.
It’s smarter than that. And I get the same plan whether I’m passing in an actual number or passing in NULL. This is the behaviour I want from the query. It’s basically doing a NULL-aware filter on the column. By ‘NULL-aware’, I mean one where it finds the NULL values if it’s filtering on NULL, rather than treating them in the typical NULL way.
The reason I don’t like the OR version is simply that I don’t want to condone the use of OR. The Query Optimizer is horribly fickle about OR and will often decide to give you a Scan just because it’s had a hard day and the thing you’re asking for is just too much in that moment and it wants to go and have a lie down.
(I’m sounding melodramatic here, but it’s not too far from the truth to suggest that if you give the Query Optimizer a large query, it’s less likely to work out the best plan, just because there are so many subtrees it has to analyse, and it might just get something adequate and figure it’s done.)
So I don’t want you to use OR like this. I want you to use IS NOT DISTINCT FROM instead.
1 2 3 4 5 |
DECLARE @subcat int = null; SELECT COUNT(*) FROM Production.Product p WHERE p.ProductSubcategoryID IS NOT DISTINCT FROM @subcat; |
The query plan for this is exactly the same as the one when I used OR. It’s totally NULL-aware, and it uses the index perfectly. Even the estimates inside the tooltip are identical to that lovely plan that I got from the OR query.
So we can see that using “IS NOT DISTINCT FROM” provides the same behaviour as what we’ve always been able to write. That’s why I say the functionality isn’t new. It’s not.
I don’t use OR to produce this behaviour though, and I’m not about to start, even on pre-2022 servers.
I will use “IS [NOT] DISTINCT FROM” when I need a single NULL-aware equality predicate. I do wish it was shorter to type (and honestly, if we could use ‘==’ for it, I’d probably start using it everywhere instead of ‘=’) but I’ll use it for when I only need it in one predicate. But when I need lots of them… I don’t want to be writing code like this:
1 2 3 4 |
WHERE old.Col1 IS DISTINCT FROM new.Col1 OR old.Col2 IS DISTINCT FROM new.Col2 OR old.Col3 IS DISTINCT FROM new.Col3 --etc |
I really don’t. Even shortcut keys to let me type that quicker aren’t going to make me want to produce that code.
Instead, I’ll keep doing what I do at the moment, which is to use EXISTS and EXCEPT/INTERSECT. Like this, which is satisfied if there is a difference in any of three columns from two different tables:
1 2 3 4 5 |
WHERE EXISTS ( SELECT new.Col1, new.Col2, new.Col3 EXCEPT SELECT old.Col1, old.Col2, old.Col3 ) |
Using this pattern in our original example looks somewhat obtuse, but the pattern works just the same, and still gives the ideal query plan:
1 2 3 4 5 |
DECLARE @subcat int = null; SELECT COUNT(*) FROM Production.Product p WHERE EXISTS (SELECT p.ProductSubcategoryID INTERSECT SELECT @subcat); |
I think you’ll agree that it’s a tad clunky when I’m using it to compare values in a single column, but that it looks a lot more elegant when there are several values involved.
It’s not perfect – it’s not like I can use it for anything except equality comparisons. But I can get creative with WHERE clauses in that subquery, and really significantly, I don’t need to think about whether columns are going to have NULL values. It even helps me sort out the kind of pain you can get with NULLs and the NOT IN construct (that pain comes from NULL comparisons, which simply goes away with INTERSECT/EXCEPT).
I like ‘IS [NOT] DISTINCT FROM’ a lot. Not so much for the functionality, as I had ways to handle that already. But because of raising awareness of the NULL problem. I’m not even going to mind seeing queries that use long lists of “IS NOT DISTINCT FROM”, because it’ll be way better than having them use ISNULL(), and I’m all for correctness and good use of indexes.