Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « Please Mr. DBA, Change default passwords and use strong passwordsWhat makes a successful SQL Server Reporting Services implementation? »
    comments

    I'm editing in a link to Adam Machanic's blog on this. In the comments on this topic here you will see there are imperfections found in my methods. Reading Adam's blog shows this in more detail.
    http://sqlblog.com/blogs/adam_machanic/archive/2009/04/26/faster-more-scalable-sqlclr-string-splitting.aspx

    Thanks Adam!

    I wrote this short CLR split function awhile back based on a few other articles I read when 2005 was released. I decided to play with it today and see if I could put it with the SQL split solutions.

    Let's get the basics out of the way on SQL CLR. SQL CLR is only good once it's in memory. The CLR function split basically won over the T-SQL split functions after it was cached. This is a critical variable to consider when thinking CLR vs. T-SQL options on coding. If you are doing heavy manipulation of data and heavy math, CLR will typically help you, but you should be very careful with CLR and memory management. You can run your server resources out and literally stop functionality. I highly recommend reading MrDenny's blog on CLR here. Denny touches on important topics on when to use CLR and why you shouldn't. After that, look into how SQL Server 32bit, 32bit with AWE and 64bit versions handle memory. Each handles memory differently. AWE enalbed instances will probably be the one that will cause you more headaches then the rest. I had severe memory issues a few months ago on a production database server that forced restarts nightly until I fixed the problem. I analyzed the problem and it came to be several factors that caused it and SQL CLR memory was one of those factors. Here is my chance to thank mrdenny and ptheriault again for the assisatnce on that strange problem.

    I went out and google'd "fast split function t-sql". Found a few and tested them against the CLR split method. I found a dozen or so split functions that looked good. I still went with a numbers table one after testing them out next to each other. Here is one of the functions I used. If you have a better one, post it in the comments and I can edit the post.

    1. ALTER FUNCTION [dbo].[Split] (
    2. @List varchar(7998), --The delimited list
    3. @Del char(1) = ',' --The delimiter
    4. )
    5. RETURNS @T TABLE (Item varchar(7998))
    6. AS
    7. BEGIN
    8. DECLARE @WrappedList varchar(8000), @MaxItems int
    9. SELECT @WrappedList = @Del + @List + @Del, @MaxItems = LEN(@List)
    10.  
    11. INSERT INTO @T (Item)
    12. SELECT SUBSTRING(@WrappedList, Number + 1, CHARINDEX(@Del, @WrappedList, Number + 1) - Number - 1)
    13. FROM dbo.Numbers n
    14. WHERE n.Number <= LEN(@WrappedList) - 1
    15. AND SUBSTRING(@WrappedList, n.Number, 1) = @Del
    16.  
    17. RETURN
    18. END

    Here is my CLR split

    1. using System;
    2. using System.Data;
    3. using System.Collections;
    4. using System.Data.SqlClient;
    5. using System.Data.SqlTypes;
    6. using Microsoft.SqlServer.Server;
    7.  
    8. public partial class UserDefinedFunctions
    9. {
    10.     [SqlFunction(Name = "CLR_Split",
    11.     FillRowMethodName = "FillRow",
    12.     TableDefinition = "id nvarchar(10)")]
    13.  
    14.     public static IEnumerable SqlArray(SqlString str, SqlChars delimiter)
    15.     {
    16.         if (delimiter.Length == 0)
    17.             return new string[1] { str.Value };
    18.         return str.Value.Split(delimiter[0]);
    19.     }
    20.  
    21.     public static void FillRow(object row, out SqlString str)
    22.     {
    23.         str = new SqlString((string)row);
    24.     }
    25. };

    I loaded a text file with a huge amount of delimited data to really get a gauge on time this would take. The string is basically, "data%data%data%data%data" and on. Around 600 indexes. I restarted my local instance of SQL Server 2005 that I did these on to ensure you can see CLR before cache and after.

    So on a fresh restart you can see by checking type MEMORYCLERK_SQLCLR from dm_os_memory_clerks that we are clear

    After execution of the CLR function you can see the differences

    Call these functions as

    1. Declare @bigdamnvar varchar(max)
    2. Set @bigdamnvar = (Select * From OpenRowSet ('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=C:\', 'SELECT * from Data.txt'))
    3. Select * From dbo.CLR_Split(@bigdamnvar, '%')

    Below you can see first execution and then how quickly performance picks up on the second cached plan

    and

    1. Declare @bigdamnvar varchar(max)
    2. Set @bigdamnvar = (Select * From OpenRowSet ('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=C:\', 'SELECT * from Data.txt'))
    3. Select * From dbo.[Split](@bigdamnvar, '%')

    I executed this a few times to get it in cache as well. The odd increase was the server working on something else. I validated that so ignore it.

    You may notice the bytes recieved from the server are different and SQL CLR is much heavier. That is something to keep in mind. SQL will always be light compared to SQL CLR. That is my experience in using the two side by side.

    The differences are small but if the task you intend to perform is something cached typically, keep CLR in mind. Mostly when it come to complicated tasks that may be left to other languages you have available.

    About the Author

    Ted Krueger is a SQL Server MVP and has been working in development and database administration for 13+ years. Specialties range from High Availability and Disaster / Recovery setup and testing methods down to custom assembly development for SQL Server Reporting Services. Ted blogs and is also one of the founders of LessThanDot.com technology community. Some of the articles focused on are Backup / Recovery, Security, SSIS and working on SQL Server and using all of the SQL Server features available to create stable and scalable database services. @onpnt Personal Blog over at http://onpnt.wordpress.com/
    Social SitingsTwitterLinkedInLTD RSS Feed
    22342 views
    InstapaperVote on HN

    24 comments

    Comment from: SQLDenis [Member] Email
    *****
    SQLDenis I prefer the numbers table way, seems purer

    Any reason you haven't used varchar(max) instead?
    04/22/09 @ 13:30
    Comment from: Ted Krueger (onpnt) [Member]
    Ted Krueger (onpnt) I didn't write the split function above (SQL version). It was here http://mo.notono.us/2005/08/sql-fast-string-split-function-using.html (I think, it was a link from another blog)
    I went out and grabbed a bunch that were posted on the web. Apologies to the author. I forgot to put the references
    04/22/09 @ 14:38
    Comment from: Jonathan Kehayias [Visitor] Email · http://www.sqlclr.net
    *----
    Jonathan Kehayias The testing that you used here is inherently flawed for your comparisions. You are using a single value to perform your testing versus doing this test against a table of data which is where the performance degredation of SQLCLR really begins to show for string splitting like this. Take this out to a large scale operation, and TSQL will have the advantage because of the cost of switching in and out of SQLCLR.

    If you'd like I can provide previous statistics from benchmarking methods very similar to yours where Jeff Moden and I squared of in a SQLCLR versus TSQL challenge to debunk the myth that SQLCLR is better for simple problems like this one. That is not to say that SQLCLR can't beat TSQL for complex operations like regular expression matches/replacements, and complex mathematics, but most common string operations can be done faster with TSQL when working against a dataset versus a single row.
    04/22/09 @ 19:09
    Comment from: SQLDenis [Member] Email
    SQLDenis Jonathan, feel free to post statistics or provide a link where you have it, it would be interesting to see what your results are
    04/22/09 @ 19:37
    Comment from: George Mastros (gmmastros) [Member]
    *****
    George Mastros (gmmastros) Jonathan,

    Perhaps I don't understand something. Are you suggesting that a column in a table that contains delimited text will perform better using T-SQL than SQL-CLR? If that's the case, I would argue that normalizing the data for persistent storage (and eliminating the need to split it) would perform better.
    04/22/09 @ 20:01
    Comment from: Ted Krueger (onpnt) [Member]
    Ted Krueger (onpnt) Hi Jonathan, Thanks for the comments. I would like to see the benchmarks as well if you can dig them up. I don't think something like s split function is really viable on a large scale table unless the design of the tables or data is just forcing that but of course I see what you are saying. Given that scenario if not configured correctly this type of test could and probably would cause more issues that just slow performance on the statement being executed. I'm just not sure given a split function it is really that common to run on much more than parameters or string values required for processing tasks.
    04/22/09 @ 20:18
    Comment from: Phil Factor [Visitor] · http://www.simple-talk.com
    ***--
    Phil Factor CLR-based routines are better for splitting small strings and TSQL solutions are better for large strings. Your tally-based solution is a poor performer compared to a well-written WHILE-loop solution when the strings you are splitting are fairly long. On powerful machines with plenty of CPU, the difference between CLR and TSQL solutions are less pronounced.
    Basically, TSQL-based solutions scale in a linear way with the size of the string it is splitting, whereas the CLR-based solutions scale in a second-order polynomial curve.
    To test all this out, you will need to create a test harness that executes all the candidate solutions with a wide variety of strings. It needs to repeat the tests on a number of servers, as these will give you striking differences in results.
    Sorry, friend. More work to do!
    04/25/09 @ 06:56
    Comment from: Ted Krueger (onpnt) [Member]
    Ted Krueger (onpnt) I'm not sure I agree when it comes to the length of string I used and the means in which the write up was played towards. You can say every word you just stated about any testing scenario no matter what the comparison is or the technology is. Our job is to write what we learn, play with and try out in a set base and then work off of that while allowing peers to also. Testing in situations like you said isn't in scope. If we had that at our disposal then maybe, but I’m far from a testing farm ;) That will be for the person that reads this and tries it out objectively in their environment and configuration. All results will differ obviously. This was done on a laptop messing around and was nothing more than that as the blog stated. The results were favored in CLR slightly and as mentioned in a higher level on how CLR differs, there are condition variables to take into account.

    There is no more work for me to do other than decide if this would benefit my own environment and production coding after testing there. The "more work to do" is for the readers to take it and test it on their own servers. I also do not think a 3045 character string is small and think it was a good median to test on.

    So it seems I agree but given a common landscape disagree with your choice of deciding this was a poor test :)
    04/25/09 @ 11:14
    Comment from: Adam Machanic [Visitor] · http://sqlblog.com
    ***--
    Adam Machanic I'm going to have to agree with others; your testing method is inherently flawed. Your two functions should have identical inputs and outputs or you're not doing an "apples-to-apples" comparison. That means that bytes received from the server should be identical. Your functions are not outputting the same thing, or perhaps not in the same format. Did you autodeploy the CLR function? If so, it's spitting out NVARCHAR, not VARCHAR, which explains the discrepancy.

    As for Phil's statement that "[a] tally-based solution is a poor performer compared to a well-written WHILE-loop solution when the strings you are splitting are fairly long", that's not really correct either. The fastest general solutions involve a numbers table and "chunking" of the string, since string operations on VARCHAR(MAX) are slower than string operations on VARCHAR(N).

    Here's the best T-SQL method I've come up with:

    http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/splitting-a-string-of-unlimited-length.aspx

    It will be slightly slower on small strings than the method you've posted here, but will scale better.

    I'll post my own CLR method on my blog soon.
    04/25/09 @ 16:46
    Comment from: Ted Krueger (onpnt) [Member]
    Ted Krueger (onpnt) Hi Adam. That wasn't the best choice on my part with the differences in types. After the feedback I've started working on a much cleaner and better formed test and SQLCLR function. Denis actually pointed out my flaw to me on teh side and I didn't catch it at first.

    Thanks for everyones feedback. I'm looking forward to hearing what you all think on the next one :)
    04/25/09 @ 17:12
    Comment from: Adam Machanic [Visitor] · http://sqlblog.com
    Adam Machanic Posted:

    http://sqlblog.com/blogs/adam_machanic/archive/2009/04/26/faster-more-scalable-sqlclr-string-splitting.aspx
    04/26/09 @ 18:46
    Comment from: Paul White [Visitor] Email
    **---
    Paul White Hey everyone,

    Recent intensive testing on SQLServerCentral.com with Jeff Moden, Phil Factor and many others has shown CLR TVFs to outperform all comers so far, on text ranging from the very small to the complete text of Moby Dick. This is for tables of strings to split, *not* single values.

    The two exceptions so far, where T-SQL wins is where all lines of a 10,000 row table with huge strings are *identical* (the QO splits one and uses a table spool for the remaining rows - very clever) ... and the second is where the text of Moby Dick is loaded into a single (N)VARCHAR(MAX) variable (Phil's solution).

    The best performing CLR TVF on huge sets so far uses RegEx splitting, with char-array splitting and simple String.Split() close behind.

    See http://www.sqlservercentral.com/Forums/Topic695508-338-1.aspx?Update=1 for the full details.

    Cheers,

    Paul
    04/27/09 @ 06:27
    Comment from: Phil Factor [Visitor] · http://www.simple-talk.com
    Phil Factor Oops. I didn't mean to sound as if I was being rude. I was just trying to give a feel for the complexity of any meaningful comparisons between splitting methods. There are a number of variables that seem to affect the result, and I was just trying to say that it needed a lot of testing to come to a clear conclusion.

    Adam gives a very interesting illustration, and seems to have found a way around the problems with the CLR solution.

    04/27/09 @ 07:10
    Comment from: Ted Krueger (onpnt) [Member]
    Ted Krueger (onpnt) I don't think you were rude at all Phil. I think that goes for the others as well.

    This entire topic has become extremely good and I'm glad all of you took the time to get the content on this blog and then take so much time to spawn off it. I have little doubt there are going to be a high number of our peers that will learn a valuable lesson from this and all the other conversations linked back. :)
    04/27/09 @ 07:21
    Comment from: Paul White [Visitor] Email
    **---
    Paul White Update - I encourage everyone to check out Adam's solution. I had missed the scalability he has added by splitting the string bit-by-bit in the IEnumerator, rather than up front. This should make all the difference when very long strings are involved.
    04/27/09 @ 07:50
    Comment from: Ted Krueger (onpnt) [Member]
    Ted Krueger (onpnt) After Adam let me know it was posted I edited this blog and addded the link to his so everyone goes there before really reading into my post or just to understand the issues more and see the better solution
    04/27/09 @ 08:12
    Comment from: Erik [Member] Email
    Erik The Split function you listed can be optimized. Aside from removing the unnecessary variable declaration, this handling of the delimiter start and end (instead of adding the delimiter ahead of time) yields a slight performance improvement:

    ALTER FUNCTION [dbo].[Split] (
    @List VARCHAR(8000),
    @Del CHAR(1) = ','
    )
    RETURNS @T TABLE (Item VARCHAR(8000))
    AS
    BEGIN
    INSERT INTO @T (Item)
    SELECT SUBSTRING(@List, Number, CHARINDEX(@Del, @List + @Del, Number) - Number)
    FROM dbo.Numbers n
    WHERE
    n.Number <= LEN(@List) + 1
    AND SUBSTRING(@Del + @List, n.Number, 1) = @Del

    RETURN
    END

    I'll go read Adam's posted article now.
    04/27/09 @ 14:52
    Comment from: niikola [Member] Email
    *****
    niikola Could you compare results with the next one using CTE.
    My test shows this one is much faster.

    Alter Function fnSplitWithCTE(
     @List varchar(8000),
     @Delimiter char(1),
     @ShowEmpty bit
    )
    Returns @ret Table (
     pos int Primary Key,
     value varchar(8000)
    )
    as Begin
       Declare @len int
       Set @len=len(@List)+1

       ;With a as
       (
         Select 1 as nStart,
                isNull(NullIf(charindex(@Delimiter,@List,1),0),@len) as nEnd,
                rtrim(ltrim(substring(@List,1,isNull(NullIf(charindex(@Delimiter,@List,1),0),@len)-1))) as value
         Union All
         Select nEnd+1,
                isNull(NullIf(charindex(@Delimiter,@List,nEnd+1),0),@len),
                rtrim(ltrim(substring(@List,nEnd+1,isNull(NullIf(charindex(@Delimiter,@List,nEnd+1),0),@len)-nEnd-1)))
         From a
         Where nEnd<@len
       )
       Insert Into @ret
       Select Row_Number() over (order by nStart),
              NullIf(value,'')
        from a
        Where (@ShowEmpty=1 or Len(value)>0)
        Return
    End
    09/02/09 @ 08:40
    Comment from: Ted Krueger (onpnt) [Member]
    Ted Krueger (onpnt) More than happy to do that. I will later and post results. In the 24 hours of pass sessions now :-)
    09/02/09 @ 08:48
    Comment from: Erland Sommarskog [Visitor] · http://www.sommarskog.se
    Erland Sommarskog I see many comments about this and that being faster in that and that context. What all that leads to is that if you need the absolutely best solution in some real application, you need to run your own benchmark.

    If you want to read performance data about string-splitting until you are blue in the face, I can offer http://www.sommarskog.se/arrays-in-sql-perftest.html.

    I like to comment on Phil Factor who said: "CLR-based routines are better for splitting small strings and TSQL solutions are better for large strings." I found nothing in my tests that indicates this. Au contrîare, when the string exceeds a certain size - around 600 KB on x64, 840 on x86 - execution times takes an incontiguous jump, and from there the slope is steeper. See my article above for details.
    09/05/09 @ 08:21
    Comment from: Jeff Moden [Visitor]
    Jeff Moden @Paul White

    I've got to revisit that rather long thread... except for the CLR test which did beat the Tally table on the longer (ie. Varchar(MAX)), Tally splits on VARCHAR(8000) came very close to the CLR. The problem I have with the tests that where done is just as you say... most of the tests were done with equal space values. Heh... except for Phil's Moby Dick split.

    What Phil didn't tell everyone is that he an I did a race on splitting the text of Moby Dick on another site. On his machine, the While Loop he uses won. On my machine, the Tally table won. We didn't test using a CTE.

    That's kind of brings us to what Adam said... you need to test against the actual data you're going to be splitting and you need to do it on the machine that will be doing all the heavy lifting.

    Shifting gears, I will say that the Tally table has a comparative performance problem when splitting VARCHAR(MAX) even if it contains less than 8000 characters because of it's very nature to be "out of row". But, only a very well written very short CLR will beat the Tally table on VARCHAR(8000). Even then, the Tally table comes so close that it's not worth the trip to the world of CLR's for me.

    Also, anyone who uses a user defined function to split a whole table of CSV's is making a mistake. The ONLY exception to that rule is Erland's TVF splitter.

    VTW, Erland... nice article.
    09/13/09 @ 20:10
    Comment from: praveen [Visitor] · http://praveenbattula.blogspot.com
    praveen Here is my version of the T-SQL UDF function.
    http://praveenbattula.blogspot.com/2010/06/best-split-udf-function-with-delimeter.html
    06/04/10 @ 12:42
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky Great round up by Aaron Bertrand here
    http://sqlblog.com/blogs/aaron_bertrand/archive/2010/07/07/splitting-a-list-of-integers-another-roundup.aspx

    CLR is a winner in his tests when we're dealing with big lists.
    08/09/10 @ 08:39
    Comment from: Pavel Pawlowski [Visitor] · http://www.pawlowski.cz
    Pavel Pawlowski I was also playing with the strings splitting a while ago and found that the CLR is the fastest when used in wit small and even very large strings.

    It is only important to write the CLR in right way so CLR memory isn't wasted and App Domains are not recycled etc.

    I found, that the fastest method is simply iterate the characters in the string using a foreach loop and use an intermediate buffer.

    An article is here: http://www.pawlowski.cz/2010/10/fastest-csv-strings-splitting-using-clr-t-sql-vs-clr-revisited/

    The function mentioned on the site uses a parameter for defining maximum internal buffer lengths (maximum length of single delimited string) but it's very easy to update it to version which automatically adjusts the buffer if it's length isn't enough.
    01/01/12 @ 14:48

    Leave a comment


    Your email address will not be revealed on this site.

    To mislead the spambots.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)