LessThanDot

I’m editing in a link to Adam Machanic’s blog on this. In the comments on this topic here you will see there are imperfections found in my methods. Reading Adam’s blog shows this in more detail.

http://sqlblog.com/blogs/adam_machanic/archive/2009/04/26/faster-more-scalable-sqlclr-string-splitting.aspx

Thanks Adam!

I wrote this short CLR split function awhile back based on a few other articles I read when 2005 was released. I decided to play with it today and see if I could put it with the SQL split solutions.

Let’s get the basics out of the way on SQL CLR. SQL CLR is only good once it’s in memory. The CLR function split basically won over the T-SQL split functions after it was cached. This is a critical variable to consider when thinking CLR vs. T-SQL options on coding. If you are doing heavy manipulation of data and heavy math, CLR will typically help you, but you should be very careful with CLR and memory management. You can run your server resources out and literally stop functionality. I highly recommend reading MrDenny’s blog on CLR here. Denny touches on important topics on when to use CLR and why you shouldn’t. After that, look into how SQL Server 32bit, 32bit with AWE and 64bit versions handle memory. Each handles memory differently. AWE enalbed instances will probably be the one that will cause you more headaches then the rest. I had severe memory issues a few months ago on a production database server that forced restarts nightly until I fixed the problem. I analyzed the problem and it came to be several factors that caused it and SQL CLR memory was one of those factors. Here is my chance to thank mrdenny and ptheriault again for the assisatnce on that strange problem.

I went out and google’d “fast split function t-sql”. Found a few and tested them against the CLR split method. I found a dozen or so split functions that looked good. I still went with a numbers table one after testing them out next to each other. Here is one of the functions I used. If you have a better one, post it in the comments and I can edit the post.

ALTER FUNCTION [dbo].[Split] ( 
@List varchar(7998), --The delimited list 
@Del char(1) = ',' --The delimiter 
) 
RETURNS @T TABLE (Item varchar(7998)) 
AS 
BEGIN 
DECLARE @WrappedList varchar(8000), @MaxItems int 
SELECT @WrappedList = @Del + @List + @Del, @MaxItems = LEN(@List) 

INSERT INTO @T (Item) 
SELECT SUBSTRING(@WrappedList, Number + 1, CHARINDEX(@Del, @WrappedList, Number + 1) - Number - 1) 
FROM dbo.Numbers n 
WHERE n.Number <= LEN(@WrappedList) - 1 
AND SUBSTRING(@WrappedList, n.Number, 1) = @Del 

RETURN 
END

Here is my CLR split

using System;
using System.Data;
using System.Collections;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class UserDefinedFunctions
{
    [SqlFunction(Name = "CLR_Split",
    FillRowMethodName = "FillRow", 
    TableDefinition = "id nvarchar(10)")]

    public static IEnumerable SqlArray(SqlString str, SqlChars delimiter)
    {
        if (delimiter.Length == 0)
            return new string[1] { str.Value };
        return str.Value.Split(delimiter[0]);
    }

    public static void FillRow(object row, out SqlString str)
    {
        str = new SqlString((string)row);
    }
};

I loaded a text file with a huge amount of delimited data to really get a gauge on time this would take. The string is basically, “data%data%data%data%data” and on. Around 600 indexes. I restarted my local instance of SQL Server 2005 that I did these on to ensure you can see CLR before cache and after.

So on a fresh restart you can see by checking type MEMORYCLERK_SQLCLR from dm_os_memory_clerks that we are clear

After execution of the CLR function you can see the differences

Call these functions as

Declare @bigdamnvar varchar(max)
Set @bigdamnvar = (Select * From OpenRowSet ('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=C:', 'SELECT * from Data.txt'))
Select * From dbo.CLR_Split(@bigdamnvar, '%')

Below you can see first execution and then how quickly performance picks up on the second cached plan

and

Declare @bigdamnvar varchar(max)
Set @bigdamnvar = (Select * From OpenRowSet ('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=C:', 'SELECT * from Data.txt'))
Select * From dbo.[Split](@bigdamnvar, '%')

I executed this a few times to get it in cache as well. The odd increase was the server working on something else. I validated that so ignore it.

You may notice the bytes recieved from the server are different and SQL CLR is much heavier. That is something to keep in mind. SQL will always be light compared to SQL CLR. That is my experience in using the two side by side.

The differences are small but if the task you intend to perform is something cached typically, keep CLR in mind. Mostly when it come to complicated tasks that may be left to other languages you have available.

Split string in SQL Server 2005+ CLR vs. T-SQL

About Ted Krueger (onpnt)