When you have data stored across several servers, skew becomes very significant.
In SQL Server Parallel Data Warehouse (PDW), part of the Analytics Platform System (APS), data is stored in one of two ways – distributed or replicated. Replicated data is copied in full across every compute node (those servers which actually store user data), while distributed data is spread as evenly as possible across all the compute nodes, with the decision about where each row of data is stored is dependent on the hash of a column’s value. This column is known as the distribution key.
For performance reasons, though, data warehouse designers will typically choose a distribution key that avoids data movement, by choosing a column which is typically used in joins or aggregations. This performance is traded off against skew, which itself can have a performance impact. If a disproportionate amount of data is stored on a single compute node (which is really a question of ‘distributions’ which correspond to filegroups – eight per compute node), then storage is effectively used up more quickly (after all, you only need one disk to fill up to have problems), and queries run slower because one distribution has to churn through more data than the others.
It’s not even just monitoring it that is important, but trending it. You see, your data can change over time.
Imagine you sell products, to customers, in stores, over time. You could distribute your data by store, product, customer, or day (or even by some surrogate key, but then you need to generate that surrogate key). And whatever you choose might start off as just fine – until your business changes a little… such as a product comes out that outsells the others by a long way, or when you start selling online, to anonymous users, and offer sales on certain days of the year. Skew might not be there at the start, but it might become a factor over time.
So how can you measure skew? Well, using DBCC PDW_SHOWSPACEUSED.
You might have created a table called dbo.Sales, but if you have six compute nodes, then your data is actually stored across 48 tables. DBCC PDW_SHOWSPACEUSED("dbo.Sales") will provide the number of rows, and the space used and reserved for each of these 48 tables, letting me see what the skew is like.
If I have 480 million rows, then ideally I have 10 million rows in each distribution. But if my data shows that 4 of my distributions have 15 million rows, and the other 44 have about 9.5 million, then it might not seem like much skew – most of them are 95% of what I had hoped for. But if the disks storing those 15 million-row tables are starting to fill up, then I’m essentially missing out on the opportunity to store 5.5 million rows on each of the 44 emptier ones – 240 million rows’ worth! To put it another way – if they were all as full as my fullest distribution, I could’ve had 48 * 15 million rows, which is 720 million rows! This is not efficient use of my storage.
Applying a bit of maths like this can help me work out the wastage – looking to see what the maximum number of rows (or storage allocated) is, and multiplying that by the number of distributions that I have, to see what I could be storing if I didn’t have skew. This wastage could be a stored as a percentage easily enough, and this can be tracked over time.
You could very easily run this command across your distributed tables on a regular basis, storing the results in a table, and produce a report which shows both the overall skew across the appliance and individually across each table. This can then become part of your capacity planning process, as I’m sure you’re considering across all your SQL Server instances.