While untangling a pretty nasty correlated subquery someone had written at work that was going three layers deep with IN (select X from Y) type of stuff, I used twitter to vent some of my rage SQL Training Wheels. This found its way to facebook, where one of my friends (and fellow LessThanDot members) asked “How So”.
Well, the answer is simple. Like training wheels, they are easy. Damn easy. But, also like training wheels, they can really get in the way. As such, use of IN is often good for quick ad-hoc queries, but for anything to be run more than once it’s probably worth at least exploring the option of using a derived table and a join.
Ok here is a second try at an example. The query in question was meant to get the most recent appointment for each invoice, when there is not really a direct link. So take some sample data:
T-SQL | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | --invoices, invoice activities, appointments create table #invoices (id int identity(1,1) primary key clustered, bill_date datetime) create table #invoice_activities (id int identity(1,1) primary key clustered, invoice_id int, activity_date datetime, cost smallmoney) create table #activities_appointments (act_id int, appt_id int) create table #appointments (id int identity(1,1) primary key clustered, appointment_date datetime) insert #invoices select GETDATE() union all select GETDATE()-1 union all select GETDATE() - 4 insert #invoice_activities select 1, GETDATE() - 5, 426.6 union all select 1, GETDATE() - 7, 435.9 union all select 2, GETDATE()-10, 100.4 union all select 3, GETDATE() - 15, 1023.54 union all select 3, GETDATE() - 20, 195.32 union all select 3, GETDATE() - 25, 191.65 insert #activities_appointments select 1, 1 union all select 2,2 union all select 4, 3 union all select 5, 4 union all select 6, 5 insert #appointments select GETDATE() - 5 union all select GETDATE() - 7 union all select GETDATE() - 10 union all select GETDATE() - 20 union all select GETDATE() - 25 |
--invoices, invoice activities, appointments create table #invoices (id int identity(1,1) primary key clustered, bill_date datetime) create table #invoice_activities (id int identity(1,1) primary key clustered, invoice_id int, activity_date datetime, cost smallmoney) create table #activities_appointments (act_id int, appt_id int) create table #appointments (id int identity(1,1) primary key clustered, appointment_date datetime) insert #invoices select GETDATE() union all select GETDATE()-1 union all select GETDATE() - 4 insert #invoice_activities select 1, GETDATE() - 5, 426.6 union all select 1, GETDATE() - 7, 435.9 union all select 2, GETDATE()-10, 100.4 union all select 3, GETDATE() - 15, 1023.54 union all select 3, GETDATE() - 20, 195.32 union all select 3, GETDATE() - 25, 191.65 insert #activities_appointments select 1, 1 union all select 2,2 union all select 4, 3 union all select 5, 4 union all select 6, 5 insert #appointments select GETDATE() - 5 union all select GETDATE() - 7 union all select GETDATE() - 10 union all select GETDATE() - 20 union all select GETDATE() - 25
and two different ways to approach it:
select id, (select max(appointment_date) from #appointments where id in ( select distinct(appt_id) from #activities_appointments where act_id in ( select id from #invoice_activities where invoice_id = i.id ) ) ) from #invoices i select i.id, MAX(appointment_date) from #invoices i left join #invoice_activities ia on i.id = ia.invoice_id left join #activities_appointments aa on ia.id = aa.act_id left join #appointments a on aa.appt_id = a.id group by i.id
Notice that one uses a correlated subquery and some IN’s, and the other uses simple joins. For this example, there is not much performance difference (most queries are pretty easy with a small data set) but in the real database there are several million rows in each table, and one of the tables involved takes up about 1/2 of the entire database for the application we support.
The example from work was a bit more complex (it wasn’t straightforward joins, and I needed to create a pretty gross derived table to get the dataset I wanted to join to) but the performance gain was much greater. Because the correlated subquery as I understand it gets called 1x per row in the resultset, and it was not exactly a lightweight query, the original query my coworker had written took 15-30 seconds to return a relatively small resultset. The more rows you want back, the worse it gets.
By joining to a derived table we were able to get the query down below 2 seconds. This gain was probably more an indication of how bad the subquery really was (if you think my example is ridiculous, try thinking of one ten times more so!) but I think the lesson holds that you don’t want to always use IN just because its’ more comfortable.
For another reason I dislike IN, check out Denis’ Old Post
*** Got a SQL Server question? Check out our Microsoft SQL Server Programming forum or our Microsoft SQL Server Admin forum
4 Comments
Excellent post….and I think you should give that friend that prompted this post a beer or 12 🙂
you have more choices, what about EXISTS for example?I prefer EXISTS over IN anytime because of possible NULL values
My experience is that old-school style (IN/EXISTS) has a certain advantage in this particular case.
In the FROM clause of the first query you have only one table – #main – so the cardinality of the resultset if less than or equal to |#main|. In the second query just by looking at the FROM clause the query optimizer puts an upper bound of the resultset cardinality as |#main|*|#status|. It has to look for unique constraints on the columns in the equijoin to limit the resultset to |#main|. If the ON clause is not simple = operator things get ugly.
This said I’ve seen numerous occasions when the query optimizer does not figure out there is a unique (e.g. derived table with a GROUP BY) on one the columns in the equijoin and creates sub-optimal execution plans.
Obviously the JOIN is somewhat more powerful construct here because it brings the columns from second table (#status) to the namespace (column list) of the SELECT clause. But if the goal is only to “filter” rows from #main, as is this case, the “weaker” IN/EXISTS will be sufficient *and* often more performant.
cheers,
</wqw>
I’ll keep this in mind and play with it.