If you’ve ever done spatial work with SQL Server, I hope you’ve come across the ‘nearest’ problem.
You have five thousand stores around the world, and you want to identify the one that’s closest to a particular place. Maybe you want the store closest to the LobsterPot office in Adelaide, at -34.925806, 138.605073. Or our new US office, at 42.524929, -87.858244. Or maybe both!
You know how to do this. You don’t want to use an aggregate MIN or MAX, because you want the whole row, telling you which store it is. You want to use TOP, and if you want to find the closest store for multiple locations, you use APPLY. Let’s do this (but I’m going to use addresses in AdventureWorks2012, as I don’t have a list of stores). Oh, and before I do, let’s make sure we have a spatial index in place. I’m going to use the default options.
1 |
CREATE SPATIAL INDEX spin_Address ON Person.Address(SpatialLocation); |
And my actual query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
WITH MyLocations AS (SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)), ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326)) ) t (Name, Geo)) SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country FROM MyLocations AS l CROSS APPLY ( SELECT TOP (1) * FROM Person.Address AS ad ORDER BY l.Geo.STDistance(ad.SpatialLocation) ) AS a JOIN Person.StateProvince AS s ON s.StateProvinceID = a.StateProvinceID JOIN Person.CountryRegion AS c ON c.CountryRegionCode = s.CountryRegionCode ; |
Great! This is definitely working. I know both those City locations, even if the AddressLine1s don’t quite ring a bell. I’m sure I’ll be able to find them next time I’m in the area.
But of course what I’m concerned about from a querying perspective is what’s happened behind the scenes – the execution plan.
This isn’t pretty. It’s not using my index. It’s sucking every row out of the Address table TWICE (which sucks), and then it’s sorting them by the distance to find the smallest one. It’s not pretty, and it takes a while. Mind you, I do like the fact that it saw an indexed view it could use for the State and Country details – that’s pretty neat. But yeah – users of my nifty website aren’t going to like how long that query takes.
The frustrating thing is that I know that I can use the index to find locations that are within a particular distance of my locations quite easily, and Microsoft recommends this for solving the ‘nearest’ problem, as described at http://msdn.microsoft.com/en-au/library/ff929109.aspx.
Now, in the first example on this page, it says that the query there will use the spatial index. But when I run it on my machine, it does nothing of the sort.
I’m not particularly impressed. But what we see here is that parallelism has kicked in. In my scenario, it’s split the data up into 4 threads, but it’s still slow, and not using my index. It’s disappointing.
But I can persuade it with hints!
If I tell it to FORCESEEK, or use my index, or even turn off the parallelism with MAXDOP 1, then I get the index being used, and it’s a thing of beauty! Part of the plan is here:
It’s massive, and it’s ugly, and it uses a TVF… but it’s quick.
The way it works is to hook into the GeodeticTessellation function, which is essentially finds where the point is, and works out through the spatial index cells that surround it. This then provides a framework to be able to see into the spatial index for the items we want. You can read more about it at http://msdn.microsoft.com/en-us/library/bb895265.aspx#tessellation – including a bunch of pretty diagrams. One of those times when we have a much more complex-looking plan, but just because of the good that’s going on.
This tessellation stuff was introduced in SQL Server 2012. But my query isn’t using it.
When I try to use the FORCESEEK hint on the Person.Address table, I get the friendly error:
1 2 |
Msg 8622, Level 16, State 1, Line 1 Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN. |
And I’m almost tempted to just give up and move back to the old method of checking increasingly large circles around my location. After all, I can even leverage multiple OUTER APPLY clauses just like I did in my recent Lookup post.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
WITH MyLocations AS (SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)), ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326)) ) t (Name, Geo)) SELECT l.Name, COALESCE(a1.AddressLine1,a2.AddressLine1,a3.AddressLine1), COALESCE(a1.City,a2.City,a3.City), s.Name AS [State], c.Name AS Country FROM MyLocations AS l OUTER APPLY ( SELECT TOP (1) * FROM Person.Address AS ad WHERE l.Geo.STDistance(ad.SpatialLocation) < 1000 ORDER BY l.Geo.STDistance(ad.SpatialLocation) ) AS a1 OUTER APPLY ( SELECT TOP (1) * FROM Person.Address AS ad WHERE l.Geo.STDistance(ad.SpatialLocation) < 5000 AND a1.AddressID IS NULL ORDER BY l.Geo.STDistance(ad.SpatialLocation) ) AS a2 OUTER APPLY ( SELECT TOP (1) * FROM Person.Address AS ad WHERE l.Geo.STDistance(ad.SpatialLocation) < 20000 AND a2.AddressID IS NULL ORDER BY l.Geo.STDistance(ad.SpatialLocation) ) AS a3 JOIN Person.StateProvince AS s ON s.StateProvinceID = COALESCE(a1.StateProvinceID,a2.StateProvinceID,a3.StateProvinceID) JOIN Person.CountryRegion AS c ON c.CountryRegionCode = s.CountryRegionCode ; |
But this isn’t friendly-looking at all, and I’d use the method recommended by Isaac Kunen, who uses a table of numbers for the expanding circles.
It feels old-school though, when I’m dealing with SQL 2012 (and later) versions. So why isn’t my query doing what it’s supposed to? Remember the query…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
WITH MyLocations AS (SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)), ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326)) ) t (Name, Geo)) SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country FROM MyLocations AS l CROSS APPLY ( SELECT TOP (1) * FROM Person.Address AS ad ORDER BY l.Geo.STDistance(ad.SpatialLocation) ) AS a JOIN Person.StateProvince AS s ON s.StateProvinceID = a.StateProvinceID JOIN Person.CountryRegion AS c ON c.CountryRegionCode = s.CountryRegionCode ; |
Well, I just wasn’t reading http://msdn.microsoft.com/en-us/library/ff929109.aspx properly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
The following requirements must be met for a Nearest Neighbor query to use a spatial index: <ol> <li> A spatial index must be present on one of the spatial columns and the STDistance() method must use that column in the WHERE and ORDER BY clauses. </li> <li> The TOP clause cannot contain a PERCENT statement. </li> <li> The WHERE clause must contain a STDistance() method. </li> <li> If there are multiple predicates in the WHERE clause then the predicate containing STDistance() method must be connected by an AND conjunction to the other predicates. The STDistance() method cannot be in an optional part of the WHERE clause. </li> <li> The first expression in the ORDER BY clause must use the STDistance() method. </li> <li> Sort order for the first STDistance() expression in the ORDER BY clause must be ASC. </li> <li> All the rows for which STDistance returns NULL must be filtered out. </li> </ol> |
Let’s start from the top.
1. Needs a spatial index on one of the columns that’s in the STDistance call. Yup, got the index.
2. No ‘PERCENT’. Yeah, I don’t have that.
3. The WHERE clause needs to use STDistance(). Ok, but I’m not filtering, so that should be fine.
4. Yeah, I don’t have multiple predicates.
5. The first expression in the ORDER BY is my distance, that’s fine.
6. Sort order is ASC, because otherwise we’d be starting with the ones that are furthest away, and that’s tricky.
7. All the rows for which STDistance returns NULL must be filtered out. But I don’t have any NULL values, so that shouldn’t affect me either.
…but something’s wrong. I do actually need to satisfy #3. And I do need to make sure #7 is being handled properly, because there are some situations (eg, differing SRIDs) where STDistance can return NULL. It says so at http://msdn.microsoft.com/en-us/library/bb933808.aspx – “STDistance() always returns null if the spatial reference IDs (SRIDs) of the geography instances do not match.” So if I simply make sure that I’m filtering out the rows that return NULL…
…then it’s blindingly fast, I get the right results, and I’ve got the complex-but-brilliant plan that I wanted.
It just wasn’t overly intuitive, despite being documented.
This Post Has 3 Comments
Hi Rob, we tried to make it as intuitive as possible, without trying to explain all constraints and requirements in order to have correct behavior in all cases.
NULL values in SQL Server are on top with ORDER BY clause and that would ruin your TOP k nearest neighbor results but more importantly would kill performance benefit of using NN query plan. As you mentioned NULLs can be produced by STDistance if the two objects have different SRID. Unfortunately SQL Server does not support ORDER BY NULLS LAST and we don’t keep index on SRID value to know if you are using multiple SRIDs. We would need a table scan to figure it out. That’s why the easiest solution is to require proper query semantic to make sure that NULL values for STDistance are filtered out.
I’d like to hear if there are any suggestions on how to make docs more intuitive.
Hi Milan! Thanks for responding. We should talk about PDW more too…
Anyway – I think the docs aren’t bad, but the keys are the bits around points 3 and 7, plus the fact that the example doesn’t actually use the indexes on my machine!
So I wouldn’t mind an explicit explanation in the page saying "The query must include an explicit predicate to eliminate the possibility of NULL results which could occur in certain situations even if the columns disallow NULL values. For further information, see the pages regarding STDistance(), etc"
This kind of statement is going to mean that I don’t just dismiss the requirement as needless in my situation.
As for the example not working, a query hint of OPTION (MAXDOP 1) could help.
Of course, as a fan of execution plans, it would be nice to show what people should be looking for as evidence that the index is being used effectively…
…or maybe just provide a link to this post to help people. 😉
hey rob – great post. any suggestions on where to get a list of lat/long for intersections in NYC?