Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « SQL Server: When 8000 Characters Is Not EnoughSQL Friday, The Best SQL Server Links Of The Past Week Episode 16 »
    comments

    While untangling a pretty nasty correlated subquery someone had written at work that was going three layers deep with IN (select X from Y) type of stuff, I used twitter to vent some of my rage SQL Training Wheels. This found its way to facebook, where one of my friends (and fellow LessThanDot members) asked "How So".

    Well, the answer is simple. Like training wheels, they are easy. Damn easy. But, also like training wheels, they can really get in the way. As such, use of IN is often good for quick ad-hoc queries, but for anything to be run more than once it's probably worth at least exploring the option of using a derived table and a join.

    Ok here is a second try at an example. The query in question was meant to get the most recent appointment for each invoice, when there is not really a direct link. So take some sample data:

    1. --invoices, invoice activities, appointments
    2.  
    3. create table #invoices (id int identity(1,1) primary key clustered, bill_date datetime)
    4. create table #invoice_activities (id int identity(1,1) primary key clustered, invoice_id int, activity_date datetime, cost smallmoney)
    5. create table #activities_appointments (act_id int, appt_id int)
    6. create table #appointments (id int identity(1,1) primary key clustered, appointment_date datetime)
    7.  
    8.  
    9. insert #invoices
    10. select GETDATE()
    11. union all select GETDATE()-1
    12. union all select GETDATE() - 4
    13.  
    14. insert #invoice_activities
    15. select 1, GETDATE() - 5, 426.6
    16. union all select 1, GETDATE() - 7, 435.9
    17. union all select 2, GETDATE()-10, 100.4
    18. union all select 3, GETDATE() - 15, 1023.54
    19. union all select 3, GETDATE() - 20, 195.32
    20. union all select 3, GETDATE() - 25, 191.65
    21.  
    22. insert #activities_appointments
    23. select 1, 1
    24. union all select 2,2
    25. union all select 4, 3
    26. union all select 5, 4
    27. union all select 6, 5
    28.  
    29. insert #appointments
    30. select GETDATE() - 5
    31. union all select GETDATE() - 7
    32. union all select GETDATE() - 10
    33. union all select GETDATE() - 20
    34. union all select GETDATE() - 25

    and two different ways to approach it:

    1. select id,
    2.     (select max(appointment_date)
    3.         from #appointments
    4.         where id in (
    5.             select distinct(appt_id)
    6.             from #activities_appointments
    7.             where act_id in (
    8.                 select id
    9.                 from #invoice_activities
    10.                 where invoice_id = i.id
    11.             )
    12.         )
    13.     )
    14. from #invoices i
    15.  
    16. select i.id, MAX(appointment_date)
    17. from #invoices i
    18. left join #invoice_activities ia
    19. on i.id = ia.invoice_id
    20. left join #activities_appointments aa
    21. on ia.id = aa.act_id
    22. left join #appointments a
    23. on aa.appt_id = a.id
    24. group by i.id

    Notice that one uses a correlated subquery and some IN's, and the other uses simple joins. For this example, there is not much performance difference (most queries are pretty easy with a small data set) but in the real database there are several million rows in each table, and one of the tables involved takes up about 1/2 of the entire database for the application we support.

    The example from work was a bit more complex (it wasn't straightforward joins, and I needed to create a pretty gross derived table to get the dataset I wanted to join to) but the performance gain was much greater. Because the correlated subquery as I understand it gets called 1x per row in the resultset, and it was not exactly a lightweight query, the original query my coworker had written took 15-30 seconds to return a relatively small resultset. The more rows you want back, the worse it gets.

    By joining to a derived table we were able to get the query down below 2 seconds. This gain was probably more an indication of how bad the subquery really was (if you think my example is ridiculous, try thinking of one ten times more so!) but I think the lesson holds that you don't want to always use IN just because its' more comfortable.

    For another reason I dislike IN, check out Denis' Old Post

    *** Got a SQL Server question? Check out our Microsoft SQL Server Programming forum or our Microsoft SQL Server Admin forum

    About the Author

    User bio imageAlex is a .net and SQL Server developer from southeastern PA, where he lives with a lovely wife and a veritable smorgasbord of pets. He recently completed a masters degree in Software Engineering from Penn State. He loves mountain biking, open source software, home brewing, Syracuse basketball, and the mono runtime.
    Social SitingsTwitterLinkedInHomePageLTD RSS Feed
    4093 views
    Instapaper

    4 comments

    Comment from: Leeland Clay [Visitor]
    *****
    Leeland Clay Excellent post....and I think you should give that friend that prompted this post a beer or 12 :)
    04/03/09 @ 22:22
    Comment from: SQLDenis [Member] Email
    SQLDenis you have more choices, what about EXISTS for example?I prefer EXISTS over IN anytime because of possible NULL values
    04/04/09 @ 04:57
    Comment from: wqw [Visitor]
    wqw My experience is that old-school style (IN/EXISTS) has a certain advantage in this particular case.

    In the FROM clause of the first query you have only one table - #main - so the cardinality of the resultset if less than or equal to |#main|. In the second query just by looking at the FROM clause the query optimizer puts an upper bound of the resultset cardinality as |#main|*|#status|. It has to look for unique constraints on the columns in the equijoin to limit the resultset to |#main|. If the ON clause is not simple = operator things get ugly.

    This said I've seen numerous occasions when the query optimizer does not figure out there is a unique (e.g. derived table with a GROUP BY) on one the columns in the equijoin and creates sub-optimal execution plans.

    Obviously the JOIN is somewhat more powerful construct here because it brings the columns from second table (#status) to the namespace (column list) of the SELECT clause. But if the goal is only to "filter" rows from #main, as is this case, the "weaker" IN/EXISTS will be sufficient *and* often more performant.

    cheers,
    </wqw>
    04/04/09 @ 06:42
    Comment from: David Forck (thirster42) [Member]
    ***--
    I'll keep this in mind and play with it.
    04/06/09 @ 07:31

    Leave a comment


    Your email address will not be revealed on this site.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)