How to make text searching go faster

March 8, 2016

…but first, let’s look at one oft-forgotten reason why finding a particular piece of text can be slow, if not impossible: collation. This will then provide a useful platform for making it go faster.

I say ‘impossible’, but of course it’s never impossible to find something in a database (assuming it’s there). It might take longer, but you can always scan the column for it, starting on the first page and going until you’ve found it. Various things like Full Text Search can help make things easier, but all-too-frequently we see code that searches for SomeText%, or worse: %SomeText%. This is the thing I want to look at – finding non-indexed strings patterns.

First let’s remember that if we are hoping to use an index, we need to know what language we’re in. I have spoken before about how I picked up a map in Sweden to find Västerås, but couldn’t find it listed in the map’s index. I didn’t realise that in Swedish, ‘ä’ and ‘å’ are not the same as ‘a’, and found at a different part of the alphabet. When I searched using an English alphabet, I couldn’t find the entry. I might think that ‘Västerås’ and ‘Vasteras’ are the same, but a Swedish person would tell me otherwise. It’s like if I refer to a game as ‘football’, you would need to understand my personal collation setting to know what I was talking about. When Michael Palin sung (as a lumberjack) about wearing high-heels, suspenders and a bra, he wasn’t referring to anything that held his trousers up, despite what people using an American collation setting might think.

But this is about making searches for text go faster. If we’re comparing two strings in a different collation we get an error, but let’s think about speed.

Consider that I’m looking for rows in a table WHERE CommentText LIKE ‘%Farl%’. Right away, I’m sure you appreciate that no amount of regular indexing on CommentText would let me perform an ordinary b-tree index search to find that row. I could improve it by using other technologies that will allow the individual words in my text to be found, but I’m just looking for a particular piece of text. It’s not even a whole word.

For my experiment, I’m using a table on SQL 2014 on my Surface Pro 2. It’s a larger version of AdventureWorks2012’s Person.Address with 19 million rows. There is a column called AddressLine1, which has collation SQL_Latin1_General_CP1_CI_AS and has type nvarchar(120). You can create it using code like this:

I ran this query quite a few times, and it took about 40 seconds to tell me there were no rows returned.

Obviously no one lives on 203 Farley Avenue, or 1 Farlabulous Drive. But nor do they live at 711 Gofarles Street. You see, despite the fact that I had specified ‘Farl’ with a capital F and lower-case ‘arl’, it didn’t care about that at all. My collation setting told it explicitly to ignore case. That’s what the CI was for in SQL_Latin1_General_CP1_CI_AS. In fact, if we query select * from fn_helpcollations() where name = ‘SQL_Latin1_General_CP1_CI_AS’; we see it says “Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data”. So it’s not only case-insensitive, it’s also kanatype-insensitive and width-insensitive too.

Clearly there is a lot more work for it to do, when scanning large amounts of text looking for a group of consecutive characters that could match inexactly.

Now, without changing my table at all, I changed my query like so, telling it to use a binary collation for the search. To search exactly rather than inexactly.

I could’ve used the SQL collation SQL_Latin1_General_CP437_BIN, but I find it easier to remember the Windows collation, and in Australia, the default settings for SQL are to use Windows collations. They’re a little better than the SQL collations [citation needed].

But anyway – this query returned in just 7 seconds. I re-ran the original one – 40 seconds. I re-ran this one – 7 seconds. It really was significantly faster. The plan is the same. There is no sneakiness going on. The search for the binary text was simply faster.

This makes sense. If I’m looking for a particular string, it’s going to be quicker if I can just look for the exact bits, and not have to consider what the text might be in a different case, or if width needs to play a part, and so on.

Now you might think “Great – I’m going to add that to all my string searches”, but you should understand that there is potential for the results to be different. If there were someone in my table who lived in FARLEY BOULEVARD (in all caps, in the way that French people often write their surnames, for example), then that would have been found in my case-insensitive-collation search, but not in my binary-collation search for the lower-case letters ‘arl’. It’s useful if the data in your system is only stored in capitals, in which case (ha!) you could actually change the collation of your column, but it’s definitely worth considering the benefits of asking for a collation-specific search.

And what about grouping, you ask? Ok, maybe I didn’t hear you ask that, but let’s pretend you did.

If there’s an index on the column you’re grouping, then changing the collation is going to hurt a bit. Grouping could take advantage of a Stream Aggregate under our indexed collation, but changing the column is like throwing it away the index order (ORDER BY doesn’t get handled well by changing the collation) means a Hash is required. But comparing two query plans that both use Hash Match (Aggregate), one on a case-insensitive collation and one on a binary collation, then I found the latter was slightly faster. Not as drastic a change as searching, but still 10-30% better. One would run in about 12 seconds, and one in about 10.

Considering what’s going on with a hash function and non-exact strings is actually pretty interesting. HASH(Value) must produce the same value for any two values that are considered equal – such as ‘FARLEY’ and ‘Farley’ in my CI collation. For this to happen, it obviously can’t hash the actual values, it must have to convert the values into a common form that will hash the same way regardless of case, kana, and width. But this is work that is hidden from the query plan. We can see the impact of it through the query speed, but not anywhere in the plan. This will become yet another thing for me to investigate – but not this week before T-SQL Tuesday comes around and I need to publish this post. New father Bob Pusateri (@sqlbob) is hosting this month, about text searching, in case you hadn’t guessed.

TSQL2sDay150x150

@rob_farley

This Post Has 8 Comments

  1. Uri Dimant

    Hi Rob
    It is great post, but question is not answered :-))) Waiting for your investigation.

  2. Rob Farley

    You mean the question about why a Hash is 10% (ish) faster in a binary collation?
    It’s because with a non-binary collation, it needs to make a call to GetSortKey(), so that it can hash a value which is equal for all the things that sort to the same point in that collation. Like how in a CI collation, H and h need to end up in the same hash bucket. In a binary collation, it doesn’t have to call GetSortKey(), because things are only equal when they have exactly the same binary value.
    I worked this out by asking someone who had access to a debugger which they regularly hook SQL Server into. He’s a good friend, and happens to live only a few time zones away from me. (Thanks Paul!)

  3. Uri Dimant

    I am aware of this Rob.
    Using binary collation as you pointed out may return incorrect data (not all data found)

  4. Rob Farley

    Yes. If you need to find "Uri" and "URI" and "uri", then you can’t use a binary collation. But if it’s always stored as "URI", then you can.

  5. Rob Farley

    Yes, Saeid. Lots of situations. Nice article – I wasn’t aware of it. I presented this as one of my T-SQL Tips at TechEd AU back in 2008, but it wasn’t new then either. Interesting that you refer to Brent’s article on Sargability – he refers to one of mine in his, and was in the audience for my SQLBits session on it in 2010, which you can see at http://bit.ly/Sargability

  6. Saeid Hasani

    Thanks Rob for your fast response. By referring to that article , I wanted to confirm your solution. I’m not against Uri’s idea. He is a bright person in SQL world. But, this solution has its opponents (like few comments on my article). So, I was become happy when I read your great post. I didn’t consider Brent’s article references. I downloaded your SQLBits session and I saw it . It is an awesome session. Wish I saw it few years ago! 🙂 Thank you so much!

  7. Rob Farley

    No worries, and thanks for supporting it.

Leave a Reply

LobsterPot Blogs

Blog posts by Rob Farley and other LobsterPot Solutions team members.

Search

Related Blogs