I don’t write about situations at my customers. It’s just a rule I have. Regular readers may have heard my say it before. So if I’m writing about a problem, it’s much more likely that it’s something I’ve come across elsewhere, or a general thing. This month’s topic for T-SQL Tuesday (hosted by Kerry Tyler (@airbornegeek) is about “crash reports” – the stuff that we’ve learned from disasters that can teach us (and others) about how to avoid disaster in the future.
So it’s a bit of a tricky topic without writing about something that’s happened at a customer. I’m still not going to write about what happened at a customer, but I am going to write about a problem that I’ve seen plenty of times, and what I do to avoid it.
It’s about not sufficiently protecting the data that is being populated into a warehouse. I’ve been guilty of this before, and I continue to try to do better.
The scenario is that there’s a source system, and there’s a table in a warehouse that is being used to report on it. Maybe it’s being populated by Integration Services or Data Factory. Maybe it’s being populated by T-SQL. I don’t really care. What I care about is whether the data in the warehouse is a true representation of what’s in the source system.
If it’s not a true representation, then we have all kinds of problems.
Mostly, that our warehouse is rubbish.
Seriously – if the data in the warehouse can’t be shown to be correct, then what’s the point of it? People won’t want to use it, because they won’t want their reports to be wrong. If they use it because it’s the best they’ve got, knowing the data is sometimes wrong, then anything that comes from it will need further analysis and the inaccuracies will be a constant frustration. A warehouse needs to be correct.
It should be easy to validate too. It should be easy to pull totals from the source system (which could be a staging environment – it’s doesn’t have to be production, it just has to be trusted to be correct), to see if they match what’s in the warehouse. It should be able to be tested following refresh of the warehouse data, and alerts should be raised if there’s a problem.
So there’s a problem, and you probably have a good idea about how to detect it. But to be a “crash report”, we need to explain what caused it in the first place. There could be several factors involved, but in the situation(s) I’m remembering, it came down to not protecting the base table.
An ETL process gets data from source system(s) into a warehouse, doing some amount of transformation to get it into the right shape, because the things that we’re counting might need information from other tables in the source to work out how they should be represented in the warehouse tables, etc., etc., but the whole way along, we need to be protecting that data. And particularly from JOINs.
A join isn’t a bad thing. It lets us get to the relevant data from other tables. It’s very useful and you learn about it very early on in a T-SQL class. If you’re using ADF or SSIS to do your ETL, you might have T-SQL in a source query. Or you might be pushing data through a Lookup or a Merge. You’re almost certainly going to be needing information from another table. And you need to take care with that join.
You see, joins don’t just pull data from other tables. They can also filter your data, if you’re doing an inner join. They might make extra copies of the rows in your base table, if you hit multiple rows on the other side. You might even introduce rows full of nulls if you’re doing a right outer join or full outer join.
This is a very real problem that I’ve had to deal with.
In an ideal world, every row in my base table goes through to my warehouse as exactly one row. But these three things that joins do… well, they can trip us up.
It’s probably quite easy to stop pulling in null rows, because you’re probably not using many right outer joins or full outer joins.
As for the filtering – well, it’s easy to use left joins everywhere… except that you will need to be careful that you don’t accidentally eliminate any nulls that you’ve grabbed from that by using them in another join or whatever. You might figure that you have foreign keys in place to make sure that you can’t filter any rows with your join, but you’ll need to make sure that the column in your base table doesn’t allow nulls, even if it’s currently always populated.
Hitting multiple rows should be easy to handle, because you can make sure the join predicates you’re using can only hit one row. But this isn’t always straightforward either. If you’re referring to a primary (or other enforced-to-be-unique) key, then great. But will you be told if that changes? Or what if it’s more complicated than that? Suppose you’re joining to a row that describes a date range (like a dimension table with startdate and enddate), but somehow you’ve got overlapping ranges?
I’ve seen too many crash reports that cover this. Some have been of my own making. It’s not always in warehouses, sometimes it’s in basic queries used for other things, and I’ve failed to protect the base table.
What I try to do is to make sure that it’s impossible for my joins to impact the number of rows in my base table. This means knowing what I consider the base table to be in the first place even, but then being cautious about the way I’m joining. I often find myself replacing joins with “outer apply (select top (1) …)”, even when I’m pretty sure that it should be okay. And this might impact the performance of my query – but correctness is critical. Plus, depending on how it’s being used, I can often find ways to ensure that performance is even better this way (because the Query Optimizer will know that the number of rows isn’t being affected by the join).
Disasters are no fun. I’m all about making sure that my customers can trust their data. If my query gives incorrect results, they lose confidence in future queries that I write, and possibly in their data environment in general. It’s critical that I get it right.
My “crash report” today is stopping before we reach the point of the customer losing confidence in things, but I can assure you that I’ve seen this happen. Today, I’m choosing those times when we’ve noticed what’s going on and mitigated things before the disaster takes hold. But I implore you – protect your base tables. It really is very important.