Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « T-SQL Tuesday #016 Grouping Market Data With T-SQLLearn the platform differences to be more effective »
    comments


    This month's T-SQL Tuesday is being hosted by our very own, Jes Borland (Twitter | Blog). Not only is she hosting this month but she is making it possible for LessThanDot's first T-SQL Tuesday event. The topic that is brought to us is to discuss with everyone how we solved business problems with aggregate functions. I thought this would be a good time to delete some data so here is my post on the topic.



    Duplicates are Evil

    Duplicates in data can be detrimental to how you return data from tables.  They can be so detrimental that businesses can report large discrepancies on sales, inventory and other critical calculations.   Dealing with duplicates begins with the design of you database.  It ends with the design of your applications that are inserting data into those databases.  Although constraints and everything we can put into maintaining the integrity of our databases are out there for us to use; bad designs happen.

    Seek and Destroy

    Removing duplicates begins with finding them.  Hopefully at the stage in which you are trying to find duplicates in a table (or several), you have proactively found problems they cause before they have had a negative impact on business. 

    Many methods are out there to find duplicates.  Common Table Expressions (CTE) is a known method as well as joining derived tables to each other.  In some odd cases, case statements are used along with ranking functions in them.  All of these methods are viable solutions but over the years I have come to like the use of COUNT and the HAVING clause.  This month’s T-SQL Tuesday on aggregates (e.g. COUNT) got me to thinking this would make a good post.

    Some things to consider

    COUNT has some concerns.  For finding duplicates in a table where a primary key is set as an identity seed, it has challenges.  HEAP tables are actually much easier to use this method as the grouping becomes much less complex.  One other problem that is well known with COUNT is the fact it does not interpret NULL values. 

    To show this, let’s create a table named DUPS.

    1. IF EXISTS(SELECT 1 FROM SYS.objects WHERE [name] = 'DUPS')
    2.  BEGIN
    3.     DROP TABLE DUPS
    4.  END
    5. GO
    6. CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
    7. GO
    8. CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
    9. GO

    Now insert some values into this new table with NULL values in the CUST column

    1. INSERT INTO DUPS
    2. VALUES (NULL,'Test'),
    3. ('Test','Test'),
    4. (NULL,'Test'),
    5. ('Test','Test'),
    6. (NULL,'Test'),
    7. ('Test','Test')

    You may write a simple query using COUNT to return the count of the column CUST as:

    1. SELECT COUNT(CUST) FROM DUPS

     

    Running this query should return 6.  After all, we just inserted 6 rows.  It actually returns 3 though. 

    Looking for duplicates and NULL values plays a key role in what we just went over and using the method I am about to show.  Although it is very uncommon that the unique values that are deemed a duplicate would have NULL as an allowable value, bad designs do happen.  We are checking for duplicates ;-)

    Seek

    The combination of COUNT, HAVING and GROUP BY is how we will look for duplicates today.  We will use a test script that is shown below.  The test script creates out table and inserts 10,000 rows.  There are three columns.  One is the primary key and is an identity insert.  The other two are customer number (CUST) and an order number (ORDERNUM).  A loop is used to insert test data into the new table.

    1. IF EXISTS(SELECT 1 FROM SYS.objects WHERE [name] = 'DUPS')
    2.  BEGIN
    3.     DROP TABLE DUPS
    4.  END
    5. GO
    6. CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
    7. GO
    8. DECLARE @LOOP INT
    9. SET @LOOP = 1
    10.  
    11. WHILE @LOOP <= 10000
    12.  BEGIN
    13.     INSERT INTO DUPS
    14.     SELECT 'Customer ' + CAST(@LOOP as VARCHAR(5)),
    15.            'OrderNum ' + CAST(@LOOP as VARCHAR(5))
    16.     SET @LOOP += 1
    17.  END
    18.  
    19. SET @LOOP = 1
    20.  
    21. WHILE @LOOP <= 10000
    22.  BEGIN
    23.     IF (@LOOP % 2 > 0)
    24.      BEGIN
    25.         INSERT INTO DUPS
    26.         SELECT 'Customer ' + CAST(@LOOP as VARCHAR(5)),
    27.                'OrderNum ' + CAST(@LOOP as VARCHAR(5))
    28.      END
    29.     SET @LOOP += 1
    30.  END

     

    The results from running this transaction will insert 15,000 rows.  We know this from using COUNT(*).  Ah, COUNT(*) doesn’t care about NULL values.  (tip just provided). 

    The HAVING clause will be exactly what grouping will result from a query.  An example of this can be shown by querying the sys.master_files system view for a unique database ID.

    1. SELECT
    2.  SUM(database_id)
    3. FROM sys.master_files
    4. GROUP BY database_id
    5. HAVING database_id = 1

    To use this in a duplicate search, add COUNT to the HAVING clause

    1. SELECT
    2.  database_id
    3. FROM sys.master_files
    4. GROUP BY database_id
    5. HAVING COUNT(database_id) > 2

    This would show us the entire database ID’s that have more than 2 files associated with them.

    Taking this to work for us with our earlier table and data, we could do the following

    1. SELECT
    2.     MAX(IDENT),
    3.     ORDERNUM
    4. FROM DUPS
    5. GROUP BY ORDERNUM
    6. HAVING COUNT(ORDERNUM) > 1

    The results shown list all the order numbers that are found to be duplicates (or more than 1)

    Destroy

    Loaded with this information, adding a DELETE to the statement and anything that is not listed as our MAX identity, will remove all duplicates and leave the last one inserted (based on the identity seed)

    1. DELETE FROM DUPS
    2. WHERE IDENT NOT IN (
    3. SELECT
    4.     MAX(IDENT)
    5. FROM DUPS
    6. GROUP BY ORDERNUM
    7. HAVING COUNT(ORDERNUM) > 1)

    Once this statement is executed, the table is cleansed of the duplicates and back to the row count of 10,000 unique order numbers. 

    Note: Always back you table up before you delete large amounts of data. This can be done (if the data is not too large of a volume) with a DBA designated database set in simple recovery and using SELECT INTO. Always ensure you have a quick recovery plan. 

    The MIN can also be used if the first inserted row is to be retained.  The other CTE method mentioned earlier can also be done by using PARTITION BY and ROW_NUMBER

    1. ;WITH DUP_CTE AS
    2. (
    3. SELECT ORDERNUM,ROW_NUMBER() OVER (PARTITION BY ORDERNUM ORDER BY (SELECT 0)) RN FROM DUPS
    4. )
    5. DELETE FROM DUP_CTE
    6. WHERE RN <> 1

    This allows more selectivity to the row ranking and removal process. 

    There you have it.  Delete away!  (kidding, make sure you delete what you are supposed to be deleting)

    About the Author

    Ted Krueger is a SQL Server MVP and has been working in development and database administration for 13+ years. Specialties range from High Availability and Disaster / Recovery setup and testing methods down to custom assembly development for SQL Server Reporting Services. Ted blogs and is also one of the founders of LessThanDot.com technology community. Some of the articles focused on are Backup / Recovery, Security, SSIS and working on SQL Server and using all of the SQL Server features available to create stable and scalable database services. @onpnt
    Social SitingsTwitterLinkedInLTD RSS Feed
    1221 views
    Instapaper

    1 comment

    Comment from: Jes Schultz Borland (grrlgeek) [Member]
    Jes Schultz Borland (grrlgeek) Thanks for posting Ted!

    Also, I just realized your post says SEEK and DESTROY and laughed out loud. Really loud.
    03/09/11 @ 18:34

    Leave a comment


    Your email address will not be revealed on this site.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)