The Impact of Compression Delay in Real-time Operational Analytics

August 9, 2016

I have a session coming up at both the PASS Summit in October and the 24HOP Summit Preview event in September, on Operational Analytics. Actually, my session is covering the benefits of combining both In-Memory and R into the Operational Analytics story, to be able to see even greater benefits…

…but I thought I’d do some extra reading on Real-Time Operational Analytics, which also suits this month’s T-SQL Tuesday topic, hosted by Jason Brimhall (@sqlrnnr). He’s challenged us all to sharpen our skills in some area, and write about the experience.

Now, you can’t look at Real-Time Operational Analytics without exploring Sunil Agarwal (@S_u_n_e_e_l) ’s excellent blog series. He covers a few things, but the one that I wanted to write about here is Compression Delay.

I’ve played with Compression Delay a little, but I probably haven’t appreciated the significance of it all that much. Sure, I get how it works, but I have always figured that the benefits associated with Compression Delay would be mostly realised by having Columnstore in general. So I was curious to read Sunil’s post where he looks at the performance numbers associated with Compression Delay. You can read this yourself if you like – it’s here – but I’m going to summarise it, and add some thoughts of my own.

The thing with Operational Analytics is that the analytical data, reporting data, warehouse-style data, is essentially the same data as the transactional data. Now, it doesn’t look quite the same, because it’s not been turned into a star-schema, or have slowly changing dimension considerations, but for the purposes of seeing what’s going on, it’s data that’s capable of handling aggregations over large amounts of data. It’s columnstore.

Now, columnstore data isn’t particularly suited to transactional data. Finding an individual row within columnstore data can be tricky, and it’s much more suited to rowstore. So when data is being manipulated quite a lot, it’s not necessarily that good to be using columnstore. Rowstore is simply better for this.

But with SQL 2016, we get updateable non-clustered columnstore indexes. Data which is a copy of the underlying table (non-clustered data is a copy – clustered data or heap data is the underlying table). This alone presents a useful opportunity, as we can be maintaining a columnstore copy of the data for analytics, while handling individual row updates in the rowstore.

Except that it’s a bit more complicated than that. Because every change to the underlying rowstore is going to need the same change made in columnstore. We’re not actually benefiting much.

Enter the filtered index. With a flag to indicate that frequent changes for that row have finished, we can choose to have the columnstore copy of the data only on those rows which are now relatively static. Excellent. Plus, the Query Optimizer does some clever things to help with queries in this situation.

But many systems don’t have a flag like that. What then?

Well, one nice option is to consider using Compression Delay.

Compression Delay tells our columnstore index to delay compressing the data for some period of time. That is, to not turn it into proper columnstore data for a while. Remember I said that columnstore data doesn’t enjoy being updated much – this is to prevent that pain, by leaving it as rowstore data for a while.

I haven’t really explored this much myself yet. I have a few simulations to run, to see what kind of performance gains can be had from this. But Sunil’s experiments saw a 15% improvement on the OLTP workload by choosing an appropriate Compression Delay, and that sounds pretty good to me.

I feel like there’s so much more to be explored with these new technologies. Having that flag to indicate when a row can be pulled into a filtered columnstore index seems really useful. Compression Delay seems great too, and in many ways feels like a nicer solution than ending up with a filtered index that might not catch everything. Compression Delay to me feels like having a filtered columnstore index that uses getdate() (which I think would be a lovely feature), although it’s not quite same.

So I’m going to keep playing with this, and will be teaching you plenty of information at both the upcoming events. I could present a lot of it now, but I would prefer to deepen my understanding more before I have to stand in front of you all. For me, the best preparation for presentations is to try to know every tiny detail about the technology – but that’s a path I’m still on, as I continue to sharpen.

@rob_farley