I’m feeling the pressure for this month’s T-SQL Tuesday, probably because I’m also the host. I’ll be posting a roll-up for it soon too, which I’m sure will be great fun researching.
Given that the topic is on relationships, and the main SQL Server database is a relational engine, relationships are incredibly relevant to databases. The idea behind RDBMSs is that keys are used to refer to entities. This leads to foreign keys, such that a particular column(s) is constrained to values which appear in another table, thus referential integrity is enforced. And yet there are so many database designs out there that do not have relationships defined between tables. I shudder when I find them, but that doesn’t make them any less commonplace.
In data warehouses, designed primarily for reporting systems rather than transactional systems, tables are often designed in a more denormalized manner, to avoid needing to perform so many joins to access the data required for reports. This involves having tables with many extra columns, containing data that would otherwise be stored in other tables. Fewer tables are used, and therefore the system has fewer relationships.
I sometimes wonder how this should affect relational database design. I have written before about the fact that the Query Optimizer can leverage foreign key relationships to be able to simplify queries by noticing that joins can be redundant and simplified out of plans, but to summarise:
A foreign key relationship is between a column (or set of columns) in a table to a unique key (typically the primary key) in another table (which could even be the same one). Because of this uniqueness, the relationship can map to at most one record on the other side. And because the foreign key relationship enforces referential integrity, it must map to exactly one record on the other side (a caveat being that the foreign key column(s) could be configured to allow NULLs, which won’t map). Therefore, a join that doesn’t actually select data from any of the columns in the second table might be able to be simplified out of the query completely, as if the query didn’t have the join at all. But read my other post for more on that.
Thinking about many of the queries that I’ve written over the years, I know that I often only want one field from the table I’m joining. For example, I might want to get a ProductName when I have a ProductID, or I might want the Login from a UserID. A standard Lookup situation, which in a data warehouse would often be handled by storing the Name in a dimension table rather than the ID.
So my surmising leads me to this question:
If there is a unique index on the field that I typically want to lookup from a table, does this make it a better candidate for foreign key relationships than the primary key, so that the system can avoid needing as many joins?
This screams against everything I ever learned about relational databases. It would obviously make for a larger row, but if this were offset by performance gains in querying, could it be worthwhile? Maintaining referential integrity based on a string field may be more costly than on an integer field, but I’m wondering if the impact on SELECT queries through the reduction of the number of joins required might not make it a worthy consideration.
Please note: I’m not saying it’s necessarily a good idea – I’m surmising here, and would love to hear comments about whether or not other people have tried it, what circumstances they were trying to handle, and whether or not the idea worked.
But back to the Query Optimizer…
The QO needs information to be able to work out how to run your query. It needs statistical information about the columns that are filtered; it needs to be able to figure out which parts of the query can utilise indexes effectively (see my recent posts about SARGability); but also very significantly, it needs to have information about the relationships between tables.
Any time you write a query that involves a join (which I imagine most of your queries do), the system can use information about the relationship between the two tables. If it is defined as a foreign key relationship that doesn’t allow NULL values, then the system knows that each record in the ‘child’ table must match exactly one record in the other table. The Query Optimizer can use this information, and it should be available. Any time you’re thinking of not putting enforced relationships in your system, consider the fact that you’re blinding the Query Optimizer and ultimately making it do more work.
Put your relationships into your database design. Put constraints where they can apply, mark items as unique. The Query Optimizer will thank you for it, expressing its thanks as better performance.