In SQL Server, the cost of a particular plans is based largely on I/O, which makes this post a good candidate for this month’s T-SQL Tuesday, hosted by Mike Walsh who, like me, walks a straight path.
In considering I/O – the movement of data generally In and Out of disk and memory – my thoughts come very much to query cost. If I run set statistics io on, I can see the reads and writes, both logical and physical, of the queries I run, and I can use this information to gauge the impact of my query on the system.
I can also see that the Estimated Subtree Cost as indicated in the Execution Plan seems to grow with the number of logical reads. This makes sense, as the number of logical reads required by a query can really make a query costly. A logical read implies that the read might only involve examining a page of data that is already in RAM, but it should always be remembered that if the required page of data is not in RAM already, it will need to be pulled in off disk first, being shown as a physical read.
The standard line with measuring cost in terms of I/O comes down to the impact of a physical read on the system. When a physical read occurs, this involves getting data off the actual disk(s), which I liken to the process of picking a track on an old vinyl record player (the records were vinyl, not the record players). Of course, the disk controller lines up the tracks far quicker than I ever could, but the principle still applies. Compared to shifting bits in RAM, it’s incredibly slow. Much less so with Solid State Disks, but that’s another matter entirely.
But the impact of I/O is certainly not the only factor on query cost. I remember some time ago having a discussion about string splitting with a friend of mine. I had seen some posts discussing the different ways of splitting strings, with the conclusion being that CLR functions are the best, as SQL Server generally isn’t very good at string manipulation, but it led to an interesting discussion.
Take one of the standard approaches to splitting strings, such as using a table of numbers. For the purposes of the experiment, I have a table called dbo.nums(num int primary key), populating to at least 50.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
with table_of_strings (string) as ( select 'This is one string' union all select 'And here is another' union all select 'a third...' union all select '...and a fourth' ) select substring(s.string, n.num, charindex(' ',substring(s.string, n.num, len(s.string)) + ' ')) as word from table_of_strings s join dbo.nums n on n.num <= len(s.string) and substring(' ' + s.string, n.num, 1) = ' ' ; |
There are obviously other methods that can be used – this is just one. But now compare it to the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
declare @qry nvarchar(max); with table_of_strings (string) as ( select 'This is one string' union all select 'And here is another' union all select 'a third...' union all select '...and a fourth' ) select @qry = stuff((select replace(' ' + replace(string,'''',''''''),' ',''' as word union all select ''') from table_of_strings for xml path('')), 1, 20, '') + ''' as word' ; exec sp_executesql @qry; |
Here I’m constructing a query using dynamic SQL to split strings, making a query which performs a UNION ALL of all my queries. And when I look at the cost of a batch that runs both these methods (by asking Management Studio to show me the Plans), I find that the second one is incredibly cheap. In fact, if I turn on statistics io, I can see that there is no I/O at all for the second method, whereas the first requires 12 logical reads – presumably to get at my table of numbers.
Clearly my dynamic SQL method of splitting strings is fantastically fast, and this is completely true. It is fantastic – a thing of fantasy. It’s actually really awful.
You see, as I mentioned earlier, SQL Server isn’t particularly good at string manipulation. So much so that creating this piece of dynamic SQL is very costly, won’t scale, and should never be implemented in a real environment. You can have a look at it to verify what I’m saying, but if you try to do this on a large set of data, you will very quickly find out just how poor it is.
To test it, created a table called table_of_strings (string varchar(1000)) and populated it with those four strings. I then doubled its size ten times using:
1 2 3 |
insert dbo.table_of_strings select * from dbo.table_of_strings go 10 |
Using the table of numbers to split the strings involved over 12,000 logical reads, but the elapsed time showed 289 milliseconds to provide the 13312 rows of results.
The dynamic SQL method used a total of just 26 logical reads, but took 34 seconds to run on my (old) machine.
I’m not suggesting you ignore the I/O-based cost of execution plans, but you should most definitely consider more than just that – particularly as Solid State storage becomes more and more commonplace.
This Post Has 6 Comments
Bummer! I was going to do a "Bad habits to kick : ignoring I/O" post for tomorrow, but my ideas were pretty much identical to yours. Kudos.
You should still write it though.
I’d give SET STATISTICS TIME ON a shout out too. I’ve found my own fair share of instances where a query that appears to run acceptably time-wise and I/O wise may be over utilizing CPU.
Yup, you’re really right. I abandoned my original idea though, and published a more broad "bad habits" type of I/O post.
I’d run profiler to get the correct reads for your second method.
i’ve seen STATS IO get totally screwed up when doing string manipulations.
i have a query where the STATS IO show 30k reads while profiler shows 500k reads.
Yes, I demonstrated that in my sqlbits.com talk, where a Compute Scalar had lots of hidden reads. stats io showed just 2 reads, when there were many in Profiler.
But the problem with this query is the intense CPU needed. There are some hidden I/O costs, but that’s a small impact compared to the CPU.