Should SQL Server even have functions? (Oh yeah – this is a T-SQL Tuesday post, hosted this month by Brad Schulz)
Functions serve an important part of programming, in almost any language. A function is a piece of code that is designed to return something, as opposed to a piece of code which isn’t designed to return anything (which is known as a procedure). SQL Server is no different. You can call stored procedures, even from within other stored procedures, and you can call functions and use these in other queries. Stored procedures might query something, and therefore ‘return data’, but a function in SQL is considered to have the type of the thing returned, and can be used accordingly in queries. Consider the internal GETDATE() function.
1 2 |
SELECT GETDATE(), SomeDatetimeColumn FROM dbo.SomeTable; |
There’s no logical difference between the field that is being returned by the function and the field that’s being returned by the table column. Both are the datetime field – if you didn’t have inside knowledge, you wouldn’t necessarily be able to tell which was which. And so as developers, we find ourselves wanting to create functions that return all kinds of things – functions which look up values based on codes, functions which do string manipulation, and so on.
But it’s rubbish.
Ok, it’s not all rubbish, but it mostly is. And this isn’t even considering the SARGability impact. It’s far more significant than that. (When I say the SARGability aspect, I mean “because you’re unlikely to have an index on the result of some function that’s applied to a column, so try to invert the function and query the column in an unchanged manner”)
I’m going to consider the three main types of user-defined functions in SQL Server:
- Scalar
- Inline Table-Valued
- Multi-statement Table-Valued
I could also look at user-defined CLR functions, including aggregate functions, but not today. I figure that most people don’t tend to get around to doing CLR functions, and I’m going to focus on the T-SQL-based user-defined functions.
Most people split these types of function up into two types. So do I. Except that most people pick them based on ‘scalar or table-valued’. I’d rather go with ‘inline or not’.
If it’s not inline, it’s rubbish. It really is.
Let’s start by considering the two kinds of table-valued function, and compare them.
These functions are going to return the sales for a particular salesperson in a particular year, from the AdventureWorks database.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
CREATE FUNCTION dbo.FetchSales_inline(@salespersonid int, @orderyear int) RETURNS TABLE AS RETURN ( SELECT e.LoginID as EmployeeLogin, o.OrderDate, o.SalesOrderID FROM Sales.SalesOrderHeader AS o LEFT JOIN HumanResources.Employee AS e ON e.EmployeeID = o.SalesPersonID WHERE o.SalesPersonID = @salespersonid AND o.OrderDate >= DATEADD(year,@orderyear-2000,'20000101') AND o.OrderDate < DATEADD(year,@orderyear-2000+1,'20000101') ) ; GO CREATE FUNCTION dbo.FetchSales_multi(@salespersonid int, @orderyear int) RETURNS @results TABLE ( EmployeeLogin nvarchar(512), OrderDate datetime, SalesOrderID int ) AS BEGIN INSERT @results (EmployeeLogin, OrderDate, SalesOrderID) SELECT e.LoginID, o.OrderDate, o.SalesOrderID FROM Sales.SalesOrderHeader AS o LEFT JOIN HumanResources.Employee AS e ON e.EmployeeID = o.SalesPersonID WHERE o.SalesPersonID = @salespersonid AND o.OrderDate >= DATEADD(year,@orderyear-2000,'20000101') AND o.OrderDate < DATEADD(year,@orderyear-2000+1,'20000101') ; RETURN END ; GO |
You’ll notice that I’m being nice and responsible with the use of the DATEADD function, so that I have SARGability on the OrderDate filter.
Regular readers will be hoping I’ll show what’s going on in the execution plans here. Here I’ve run two SELECT * queries with the “Show Actual Execution Plan” option turned on.
Notice that the ‘Query cost’ of the multi-statement version is just 2% of the ‘Batch cost’. But also notice there’s trickery going on. And it’s nothing to do with that extra index that I have on the OrderDate column.
Trickery.
Look at it – clearly, the first plan is showing us what’s going on inside the function, but the second one isn’t. The second one is blindly running the function, and then scanning the results. There’s a Sequence operator which is calling the TVF operator, and then calling a Table Scan to get the results of that function for the SELECT operator. But surely it still has to do all the work that the first one is doing…
To see what’s actually going on, let’s look at the Estimated plan.
Now, we see the same plans (almost) that we saw in the Actuals, but we have an extra one – the one that was used for the TVF. Here’s where we see the inner workings of it. You’ll probably recognise the right-hand side of the TVF’s plan as looking very similar to the first plan – but it’s now being called by a stack of other operators, including an INSERT statement to be able to populate the table variable that the multi-statement TVF requires.
And the cost of the TVF is 57% of the batch!
But it gets worse.
Let’s consider what happens if we don’t need all the columns. We’ll leave out the EmployeeLogin column.
Here, we see that the inline function call has been simplified down. It doesn’t need the Employee table. The join is redundant and has been eliminated from the plan, making it even cheaper. But the multi-statement plan runs the whole thing as before, only removing the extra column when the Table Scan is performed.
A multi-statement function is a lot more powerful than an inline one. An inline function can only be the result of a single sub-query. It’s essentially the same as a parameterised view, because views demonstrate this same behaviour of extracting the definition of the view and using it in the outer query. A multi-statement function is clearly more powerful because it can contain far more complex logic.
But a multi-statement function isn’t really a function at all. It’s a stored procedure. It’s wrapped up like a function, but behaves like a stored procedure. It would be completely unreasonable to expect that a stored procedure could be simplified down to recognise that not all the columns might be needed, but yet this is part of the pain associated with this procedural function situation.
The biggest clue that a multi-statement function is more like a stored procedure than a function is the “BEGIN” and “END” statements that surround the code. If you try to create a multi-statement function without these statements, you’ll get an error – they are very much required. When I used to present on this kind of thing, I even used to call it “The Dangers of BEGIN and END”, and yes, I’ve written about this type of thing before in a similarly-named post over at my old blog.
Now how about scalar functions…
Suppose we wanted a scalar function to return the count of these.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
CREATE FUNCTION dbo.FetchSales_scalar(@salespersonid int, @orderyear int) RETURNS int AS BEGIN RETURN ( SELECT COUNT(*) FROM Sales.SalesOrderHeader AS o LEFT JOIN HumanResources.Employee AS e ON e.EmployeeID = o.SalesPersonID WHERE o.SalesPersonID = @salespersonid AND o.OrderDate >= DATEADD(year,@orderyear-2000,'20000101') AND o.OrderDate < DATEADD(year,@orderyear-2000+1,'20000101') ); END ; GO |
Notice the evil words? They’re required. Try to remove them, you just get an error.
That’s right – any scalar function is procedural, despite the fact that you wrap up a sub-query inside that RETURN statement. It’s as ugly as anything. Hopefully this will change in future versions.
Let’s have a look at how this is reflected in an execution plan. Here’s a query, its Actual plan, and its Estimated plan:
1 2 3 4 5 |
SELECT e.LoginID, y.year, dbo.FetchSales_scalar(p.SalesPersonID, y.year) AS NumSales FROM (VALUES (2001),(2002),(2003),(2004)) AS y (year) CROSS JOIN Sales.SalesPerson AS p LEFT JOIN HumanResources.Employee AS e ON e.EmployeeID = p.SalesPersonID; |
We see here that the cost of the scalar function is about twice that of the outer query. Nicely, the query optimizer has worked out that it doesn’t need the Employee table, but that’s a bit of a red herring here. There’s actually something way more significant going on.
If I look at the properties of that UDF operator, it tells me that the Estimated Subtree Cost is 0.337999.
If I just run the query SELECT dbo.FetchSales_scalar(281,2003); we see that the UDF cost is still unchanged.
You see, this 0.0337999 is the cost of running the scalar function ONCE.
But when we ran that query with the CROSS JOIN in it, we returned quite a few rows. 68 in fact. Could’ve been a lot more, if we’d had more salespeople or more years.
And so we come to the biggest problem. This procedure (I don’t want to call it a function) is getting called 68 times – each one between twice as expensive as the outer query. And because it’s calling it in a separate context, there is even more overhead that I haven’t considered here.
The cheek of it, to say that the Compute Scalar operator here costs 0%! I know a number of IT projects that could’ve used that kind of costing method, but that’s another story that I’m not going to go into here.
Let’s look at a better way.
Suppose our scalar function had been implemented as an inline one. Then it could have been expanded out like a sub-query. It could’ve run something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT e.LoginID, y.year, (SELECT COUNT(*) FROM Sales.SalesOrderHeader AS o LEFT JOIN HumanResources.Employee AS e ON e.EmployeeID = o.SalesPersonID WHERE o.SalesPersonID = p.SalesPersonID AND o.OrderDate >= DATEADD(year,y.year-2000,'20000101') AND o.OrderDate < DATEADD(year,y.year-2000+1,'20000101') ) AS NumSales FROM (VALUES (2001),(2002),(2003),(2004)) AS y (year) CROSS JOIN Sales.SalesPerson AS p LEFT JOIN HumanResources.Employee AS e ON e.EmployeeID = p.SalesPersonID; |
Don’t worry too much about the Scan of the SalesOrderHeader underneath a Nested Loop. If you remember from plenty of other posts on the matter, execution plans don’t push the data through. That Scan only runs once. The Index Spool sucks the data out of it and populates a structure that is used to feed the Stream Aggregate. The Index Spool operator gets called 68 times, but the Scan only once (the Number of Executions property demonstrates this).
Here, the Query Optimizer has a full picture of what’s being asked, and can make the appropriate decision about how it accesses the data. It can simplify it down properly.
To get this kind of behaviour from a function, we need it to be inline. But without inline scalar functions, we need to make our function be table-valued. Luckily, that’s ok.
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE FUNCTION dbo.FetchSales_inline2(@salespersonid int, @orderyear int) RETURNS table AS RETURN (SELECT COUNT(*) as NumSales FROM Sales.SalesOrderHeader AS o LEFT JOIN HumanResources.Employee AS e ON e.EmployeeID = o.SalesPersonID WHERE o.SalesPersonID = @salespersonid AND o.OrderDate >= DATEADD(year,@orderyear-2000,'20000101') AND o.OrderDate < DATEADD(year,@orderyear-2000+1,'20000101') ); GO |
But we can’t use this as a scalar. Instead, we need to use it with the APPLY operator.
1 2 3 4 5 6 |
SELECT e.LoginID, y.year, n.NumSales FROM (VALUES (2001),(2002),(2003),(2004)) AS y (year) CROSS JOIN Sales.SalesPerson AS p LEFT JOIN HumanResources.Employee AS e ON e.EmployeeID = p.SalesPersonID OUTER APPLY dbo.FetchSales_inline2(p.SalesPersonID, y.year) AS n; |
And now, we get the plan that we want for this query.
All we’ve done is tell the function that it’s returning a table instead of a single value, and removed the BEGIN and END statements. We’ve had to name the column being returned, but what we’ve gained is an actual inline simplifiable function. And if we wanted it to return multiple columns, it could do that too. I really consider this function to be superior to the scalar function in every way.
It does need to be handled differently in the outer query, but in many ways it’s a more elegant method there too. The function calls can be put amongst the FROM clause, where they can then be used in the WHERE or GROUP BY clauses without fear of calling the function multiple times (another horrible side effect of functions).
So please. If you see BEGIN and END in a function, remember it’s not really a function, it’s a procedure. And then fix it.
This Post Has 6 Comments
Impressive under-the-hood information about the dark side of functions!
Very nice post! I have had to explain this numerous times to folks when their query takes > 30 mins to run and by inlining things it goes down to 5-10 seconds. Now I can stop explaining and refer them to this post. 🙂 thank you very much!
Nice article. I alwasy felt that the UDF was added to make proceural programmers feel good and to help them avoid learning declarative programming 🙂
Great post ! Very informative.
A ref to keep at hand.
[i]It will be a heck of a job to put that to a guitar tune :-)[/i]
Thanks everyone. I’m going to write a follow-up to this about the use of stored procedures in SSRS. It’ll be a little controversial. But I’m moving house this week, so it might not be for a few days.
And ALZDBA – it could be okay, perhaps something along the lines of "You’re not what you appear to be, you shouldn’t run procedurally" 🙂
I always thought inline TVFs should have been named parametrized views and grouped with them instead. Their treatment by the optimizer and performance profile is much more closely aligned with that.
I very much hope one day we will get inline scalar functions. It is a shame that very simple utility scalar functions are so much more easily outperformed by inline code. For maintainability, sometimes some simple scalar functions can be very useful – it would be nice if you didn’t have to take such a performance hit to use them.