Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « East Iowa SQL Saturday Oct. 17thHow I got started with SQL Server »
    comments

    In a recent article, I explained how to extract numbers from a string. This blog post is slightly different. In the previous article, I show how to extract single numeric values (consecutive numbers). In this article, I will show how to extract all values (non-consecutive).

    This functionality is particularly useful for formatting data (when you need to export in a consistent way). For example, phone numbers (in the U.S.) are usually 10 digits long, but some people like to use (111) 222-3333 or 111-222-3333 or even 111.222.3333.

    The idea here is to remove all characters that we don't want so that we can apply any formatting we do want.

    Most of the time, when you see code like this, it will step through the data character by character. This code is different. It uses PatIndex to find the first occurrence of an invalid character. It removes that character, and then continues on until all instances of invalid characters are removed.

    Please note that I am NOT claiming that this code is the fastest, or the most efficient way to accomplish this. I think the efficiency of this code depends on the nature of your data. If there are a lot of invalid characters in relatively long strings, this is probably not the right code to use. If there are few invalid characters in relatively long strings, then this code should be more efficient than looping over every character in the string.

    The following code will remove non-alpha characters.

    1. CREATE Function [dbo].[RemoveNonAlphaCharacters](@Temp VarChar(1000))
    2. Returns VarChar(1000)
    3. AS
    4. Begin
    5.  
    6.     While PatIndex('%[^a-z]%', @Temp) > 0
    7.         Set @Temp = Stuff(@Temp, PatIndex('%[^a-z]%', @Temp), 1, '')
    8.  
    9.     Return @TEmp
    10. End

    The following code will remove non-alpha numeric characters.
    ** Note: The floowing code has changed recently based on a comment from one of the readers Andrei. Originally I had a search pattern of [^a-z^0-9], which was incorrect. This search pattern incorrectly allowed the ^ to stay in the data when it should have been removed. The correct search pattern should be: [^a-z0-9]

    1. CREATE Function [dbo].[RemoveNonAlphaNumericCharacters](@Temp VarChar(1000))
    2. Returns VarChar(1000)
    3. AS
    4. Begin
    5.  
    6.     While PatIndex('%[^a-z0-9]%', @Temp) > 0
    7.         Set @Temp = Stuff(@Temp, PatIndex('%[^a-z0-9]%', @Temp), 1, '')
    8.  
    9.     Return @Temp
    10. End

    The following code will remove non-numeric characters.

    1. CREATE Function [dbo].[RemoveNonNumericCharacters](@Temp VarChar(1000))
    2. Returns VarChar(1000)
    3. AS
    4. Begin
    5.  
    6.     While PatIndex('%[^0-9]%', @Temp) > 0
    7.         Set @Temp = Stuff(@Temp, PatIndex('%[^0-9]%', @Temp), 1, '')
    8.  
    9.     Return @TEmp
    10. End

    As you can see, the real work is done by using the like comparison functionality of the PatIndex function.

    '%[^a-z]%'
    '%[^a-z0-9]%'
    '%[^0-9]%'

    It should be relatively simple to modify the previous functions to suit any sort of comparisons. You can use it to remove ranges of characters, or even for single characters (similar to the replace function).

    For example, if you wanted to remove certain punctuation characters, but leave everything else, you could use '%[^.^,^-]%' (to remove dot, comma, and dash).

    You can use the following code to test this functionality. It creates a table variable with some sample data. The query at the end returns the original data, and another column showing the output from the three functions presented here.

    1. Declare @Test Table(Data VarChar(100))
    2.  
    3. Insert Into @Test Values('(111) 222-3333')
    4. Insert Into @Test Values('111-222-3333')
    5. Insert Into @Test Values('111.222.3333')
    6. Insert Into @Test Values('(800) XXX-3333')
    7. Insert Into @Test Values('')
    8. Insert Into @Test Values(NULL)
    9.  
    10. Select  Data,
    11.         dbo.RemoveNonAlphaCharacters(Data) As AlphaOnly,
    12.         dbo.RemoveNonAlphaNumericCharacters(Data) As AlphaNumericOnly,
    13.         dbo.RemoveNonNumericCharacters(Data) As NumericOnly
    14. From    @Test
    
    
    Data           AlphaOnly   AlphaNumericOnly NumericOnly
    ---------      ----------- ---------------- -----------
    (111) 222-3333             1112223333       1112223333
    111-222-3333               1112223333       1112223333
    111.222.3333               1112223333       1112223333
    (800) XXX-3333 XXX         800XXX3333       8003333
    
    NULL           NULL        NULL             NULL
    

    About the Author

    George has been developing software professionally for 19 years, first for the department of defense, and then for various other companies. In 1998, George started his software company, Orbit Software, specializing in School Bus Transportation software. His specialty is refining SQL Server queries to deliver optimal performance.
    Social SitingsTwitterLTD RSS Feed
    Instapaper

    12 comments

    Comment from: SQLDenis [Member] Email
    *****
    SQLDenis Awesome...I will be using this in about 10 minutes or so...I have to import about 1620 contacts and will have to scrub some phone numbers.
    09/09/09 @ 09:33
    Comment from: Naomi Nosonovsky [Member]
    *****
    Naomi Nosonovsky Perfect timing for you
    09/09/09 @ 20:03
    Comment from: Gerard [Visitor]
    ****-
    Gerard Nice stuff. Already used it
    09/14/09 @ 01:09
    Comment from: Haley [Visitor]
    *****
    Haley Is there a way to do this in Access? Changed jobs and went from SQL server to Access. Working on getting sql server here, but in the meantime I would like to create an address matchkey in access where I extract numbers only, then letters without vowels only and put them back together.
    09/14/09 @ 14:06
    Comment from: George Mastros (gmmastros) [Member]
    George Mastros (gmmastros) Sorry. I haven't used Access in almost 10 years. I encourage you to post a question in one of our forums.... probably this one.

    http://forum.lessthandot.com/viewforum.php?f=96
    09/14/09 @ 14:28
    Comment from: remou [Member] Email
    remou @Haley
    You can use a user defined function (UDF), as George says, post if you need more details.
    09/16/09 @ 08:39
    Comment from: Haley [Visitor]
    Haley Could you elaborate on how to set up user defined function?
    09/16/09 @ 09:12
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky Just in case, this is a similar function for MySQL - I need to find out how to call it in UPDATE command and if it's really a best one

    tek-tips forums thread
    09/17/09 @ 15:46
    Comment from: George Mastros (gmmastros) [Member]
    George Mastros (gmmastros) That function looks similar, with a slight difference. My function does not loop through all the characters in the string. It uses PatIndex to quickly find the first character in the string that does not meet the criteria. Basically, my code will have less iterations for the loop, which should give better performance if there is a relatively small percentage of characters that need to be removed.
    09/17/09 @ 15:56
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky Right, I agree. Need to learn a bit of MySQL to find out which functions are available and how to call them.
    09/17/09 @ 17:40
    Comment from: Andrei [Visitor]
    Andrei I beleive that correct filter for RemoveNonAlphaNumericCharacters would be ^a-z0-9 and not ^a-z^0-9

    I was going to provide a proof for a previous comment of mine, but the site's engine won't allows it, saying it looks like spam :( So I didnt :)
    08/16/11 @ 05:31
    Comment from: George Mastros (gmmastros) [Member]
    George Mastros (gmmastros) Andrei,

    Thank you for the comment. I have modified the blog to include your corrections.
    08/16/11 @ 14:36

    Leave a comment


    Your email address will not be revealed on this site.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)