Twenty years ago, in January 1998, was when I touched SQL Server for the first time as a consultant – that was SQL Server 6.0, and I’d be involved in an upgrade to 6.5 at that same customer only a couple of months later. I would go on to spend quite a bit of time with this customer, and helped them implement a lot of things over time. I was a programmer back then, and saw the database as just part of the application. It was only later that I started to see applications as peripheral to the data rather than the other way around.
One major problem that was happening at this customer was the reporting. Every month they would need to provide reports to a part of the government, and these reports were simply painful to produce. Let me explain…
The basic system was designed to monitor the health of machines that were dotted all around the city – machines that would send a signal every few moments to indicate their status. Normally these were “nothing wrong” messages, but occasionally there would be a problem with a particular part of the machine, and our application would manage getting the message out to someone who could go and fix it. Later, a “fixed’ message would come through, I’m sure you get the picture. There were also manual reports of downtime that didn’t have automated signals, that would need to be included in the mix, for those times when people would phone up and say that there was a problem.
The reporting was about the overall uptime of each machine, including whether there was a machine available at each location at a given time, whether the combination of components that were unavailable meant that the entire machine was considered down, or whether so many machines at a location were down that the location itself needed to be marked as unavailable, and so on. The data for a single machine was like:
Machine | Time | Message |
100 | 19980109 12:33:05 | OK |
100 | 19980109 12:34:23 | HEAT WARNING |
100 | 19980109 12:34:26 | COMP1 ERROR |
100 | 19980109 12:34:29 | TOUCHPAD ERROR |
100 | 19980109 12:35:12 | COMP1 NOERROR |
100 | 19980109 12:35:19 | HEAT NORMAL |
100 | 19980109 12:30:00 | Report – TOUCHPAD ERROR |
100 | 19980109 12:35:00 | Report – TOUCHPAD FIXED |
…and so on. Timestamps were available for when phone calls came in, but apparently the reporting needed to show the time that the person said they saw it was unavailable, not the time the alert came through. Go figure.
The reporting would need to have looked at this data, and figured out that the touchpad was unavailable between 12:30 and 12:35, and that Component 1 was unavailable for 48 seconds. There would’ve been tables available to show that the touchpad being down meant the machine would’ve been considered only 50% available, and that Component 1 being down meant 60% available, but a combination of both touchpad and component 1 meant completely down. So this data would mean: 4 minutes 26 seconds of 50% available, 34 seconds of 0% available, and 12 seconds of 60% available. The reports were in place before I went there – they just didn’t run well.
They ran by querying all the data, and then looping through it (in VB3) to figure out the gaps, and to try to keep track of what was available and what wasn’t. If the report were at the machine level, it would pull back everything for that machine during the time period of interest and then try to figure it all out, but if it were at the location level, or even city-wide, it was dealing with a very different amount of data. It would take a few minutes to run a report for a single machine for a week, and many reports that were available had simply never been run. Fixing this was one of the biggest challenges I had in my first few months of consulting.
I didn’t know until years later that what I did was to create a data warehouse of sorts.
I could see that the moment when someone hit the button to say “Show me the report for machine 100 last week” was the wrong time to do the reporting, because no one wanted to watch a computer for a few minutes – or a few HOURS if they wanted a larger amount of time or more than a handful of machines. That meant the work needed to be done ahead of time.
So I relaxed many of the rules of database design I’d learned, and came up with some denormalised tables that were purely designed for analysing the data. This would store the amount of downtime per machine or per location, but be populated ahead of time – updating it with maybe an hour or so latency, long enough to allow for the phoned-in alerts. I’d store the number of seconds since the last machine event, and what the availability proportion was during that time period, and include extra dummy events for the start of each hour, so that the report could handle exact time periods. I spread the heavy-lifting across time, pulling the new messages in as regularly as I could, to avoid doing that heavy lifting when time was more urgent. This was ETL, although I just considered it was reporting-preparation, and didn’t learn what ETL meant until much later.
Once that data was in place, I could just sum the product of number of seconds * availability, and the reports could be produced quickly and easily, no matter how large they were. There was no longer a need for a cursor at report runtime – just regular aggregations.
For me, this trick of doing the work at a time when it’s less urgent is key to a lot of things in the database world (and also in the rest of life). I use indexes to keep a sorted copy of data so that aggregations and joins can run faster. I use filtered indexes to separate out lists of new data from old data. I use data warehouse and cubes to handle history and data quality and complex calculations, so that business users can explore data to their hearts’ content. If I can put up with doing the work later, then maybe it’s just fine to not have those indexes in place – perhaps I’m totally okay about scanning a table of 1000 rows every so often. But if the cost of maintaining a separate copy (index, warehouse, whatever) isn’t going to be significant when spread out over time, but that copy is going to make life much easier when it comes to the crunch, then I’m going to do it.
I often think back to some of the customers / projects that I’ve had over the years and roll my eyes at what I had to overcome. Generally though, the worse they were, the more I learned and the stronger I became.
It’s all just practice for whatever is going to come in the future, right?
This post was written in response to this month’s T-SQL Tuesday, hosted by Arun Sirpal.