This month’s T-SQL Tuesday is being hosted by our very own, Jes Borland (Twitter | Blog). Not only is she hosting this month but she is making it possible for LessThanDot’s first T-SQL Tuesday event. The topic that is brought to us is to discuss with everyone how we solved business problems with aggregate functions. I thought this would be a good time to delete some data so here is my post on the topic.

Duplicates are Evil

Duplicates in data can be detrimental to how you return data from tables.  They can be so detrimental that businesses can report large discrepancies on sales, inventory and other critical calculations.   Dealing with duplicates begins with the design of you database.  It ends with the design of your applications that are inserting data into those databases.  Although constraints and everything we can put into maintaining the integrity of our databases are out there for us to use; bad designs happen.

Seek and Destroy

Removing duplicates begins with finding them.  Hopefully at the stage in which you are trying to find duplicates in a table (or several), you have proactively found problems they cause before they have had a negative impact on business. 

Many methods are out there to find duplicates.  Common Table Expressions (CTE) is a known method as well as joining derived tables to each other.  In some odd cases, case statements are used along with ranking functions in them.  All of these methods are viable solutions but over the years I have come to like the use of COUNT and the HAVING clause.  This month’s T-SQL Tuesday on aggregates (e.g. COUNT) got me to thinking this would make a good post.

Some things to consider

COUNT has some concerns.  For finding duplicates in a table where a primary key is set as an identity seed, it has challenges.  HEAP tables are actually much easier to use this method as the grouping becomes much less complex.  One other problem that is well known with COUNT is the fact it does not interpret NULL values. 

To show this, let’s create a table named DUPS.

T-SQL
1
2
3
4
5
6
7
8
9
IF EXISTS(SELECT 1 FROM SYS.objects WHERE [name] = 'DUPS')
 BEGIN
    DROP TABLE DUPS
 END
GO 
CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
GO
CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
GO
IF EXISTS(SELECT 1 FROM SYS.objects WHERE [name] = 'DUPS')
 BEGIN
	DROP TABLE DUPS
 END
GO 
CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
GO
CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
GO

Now insert some values into this new table with NULL values in the CUST column

T-SQL
1
2
3
4
5
6
7
INSERT INTO DUPS 
VALUES (NULL,'Test'),
('Test','Test'),
(NULL,'Test'),
('Test','Test'),
(NULL,'Test'),
('Test','Test')
INSERT INTO DUPS 
VALUES (NULL,'Test'),
('Test','Test'),
(NULL,'Test'),
('Test','Test'),
(NULL,'Test'),
('Test','Test')

You may write a simple query using COUNT to return the count of the column CUST as:

T-SQL
1
SELECT COUNT(CUST) FROM DUPS
SELECT COUNT(CUST) FROM DUPS

 

Running this query should return 6.  After all, we just inserted 6 rows.  It actually returns 3 though. 

Looking for duplicates and NULL values plays a key role in what we just went over and using the method I am about to show.  Although it is very uncommon that the unique values that are deemed a duplicate would have NULL as an allowable value, bad designs do happen.  We are checking for duplicates 😉

Seek

The combination of COUNT, HAVING and GROUP BY is how we will look for duplicates today.  We will use a test script that is shown below.  The test script creates out table and inserts 10,000 rows.  There are three columns.  One is the primary key and is an identity insert.  The other two are customer number (CUST) and an order number (ORDERNUM).  A loop is used to insert test data into the new table.

T-SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
IF EXISTS(SELECT 1 FROM SYS.objects WHERE [name] = 'DUPS')
 BEGIN
    DROP TABLE DUPS
 END
GO 
CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
GO
DECLARE @LOOP INT
SET @LOOP = 1
 
WHILE @LOOP <= 10000
 BEGIN
    INSERT INTO DUPS
    SELECT 'Customer ' + CAST(@LOOP as VARCHAR(5)),
           'OrderNum ' + CAST(@LOOP as VARCHAR(5))
    SET @LOOP += 1
 END
 
SET @LOOP = 1
 
WHILE @LOOP <= 10000
 BEGIN
    IF (@LOOP % 2 > 0)
     BEGIN
        INSERT INTO DUPS
        SELECT 'Customer ' + CAST(@LOOP as VARCHAR(5)),
               'OrderNum ' + CAST(@LOOP as VARCHAR(5))
     END
    SET @LOOP += 1
 END
IF EXISTS(SELECT 1 FROM SYS.objects WHERE [name] = 'DUPS')
 BEGIN
	DROP TABLE DUPS
 END
GO 
CREATE TABLE DUPS (IDENT BIGINT IDENTITY(1,1) PRIMARY KEY, CUST VARCHAR(20), ORDERNUM VARCHAR(20))
GO
DECLARE @LOOP INT
SET @LOOP = 1

WHILE @LOOP <= 10000
 BEGIN
	INSERT INTO DUPS
	SELECT 'Customer ' + CAST(@LOOP as VARCHAR(5)),
		   'OrderNum ' + CAST(@LOOP as VARCHAR(5))
	SET @LOOP += 1
 END

SET @LOOP = 1

WHILE @LOOP <= 10000
 BEGIN
	IF (@LOOP % 2 > 0)
	 BEGIN
		INSERT INTO DUPS
		SELECT 'Customer ' + CAST(@LOOP as VARCHAR(5)),
			   'OrderNum ' + CAST(@LOOP as VARCHAR(5))
	 END
	SET @LOOP += 1
 END

 

The results from running this transaction will insert 15,000 rows.  We know this from using COUNT(*).  Ah, COUNT(*) doesn’t care about NULL values.  (tip just provided). 

The HAVING clause will be exactly what grouping will result from a query.  An example of this can be shown by querying the sys.master_files system view for a unique database ID.

T-SQL
1
2
3
4
5
SELECT 
 SUM(database_id)
FROM sys.master_files
GROUP BY database_id
HAVING database_id = 1
SELECT 
 SUM(database_id)
FROM sys.master_files
GROUP BY database_id
HAVING database_id = 1

To use this in a duplicate search, add COUNT to the HAVING clause

T-SQL
1
2
3
4
5
SELECT 
 database_id
FROM sys.master_files
GROUP BY database_id
HAVING COUNT(database_id) > 2
SELECT 
 database_id
FROM sys.master_files
GROUP BY database_id
HAVING COUNT(database_id) > 2

This would show us the entire database ID’s that have more than 2 files associated with them.

Taking this to work for us with our earlier table and data, we could do the following

T-SQL
1
2
3
4
5
6
SELECT 
    MAX(IDENT),
    ORDERNUM
FROM DUPS 
GROUP BY ORDERNUM 
HAVING COUNT(ORDERNUM) > 1
SELECT 
	MAX(IDENT),
	ORDERNUM
FROM DUPS 
GROUP BY ORDERNUM 
HAVING COUNT(ORDERNUM) > 1

The results shown list all the order numbers that are found to be duplicates (or more than 1)

Destroy

Loaded with this information, adding a DELETE to the statement and anything that is not listed as our MAX identity, will remove all duplicates and leave the last one inserted (based on the identity seed)

T-SQL
1
2
3
4
5
6
7
DELETE FROM DUPS 
WHERE IDENT NOT IN (
SELECT 
    MAX(IDENT)
FROM DUPS 
GROUP BY ORDERNUM 
HAVING COUNT(ORDERNUM) > 1)
DELETE FROM DUPS 
WHERE IDENT NOT IN (
SELECT 
	MAX(IDENT)
FROM DUPS 
GROUP BY ORDERNUM 
HAVING COUNT(ORDERNUM) > 1)

Once this statement is executed, the table is cleansed of the duplicates and back to the row count of 10,000 unique order numbers. 

Note: Always back you table up before you delete large amounts of data. This can be done (if the data is not too large of a volume) with a DBA designated database set in simple recovery and using SELECT INTO. Always ensure you have a quick recovery plan. 

The MIN can also be used if the first inserted row is to be retained.  The other CTE method mentioned earlier can also be done by using PARTITION BY and ROW_NUMBER

T-SQL
1
2
3
4
5
6
;WITH DUP_CTE AS
(
SELECT ORDERNUM,ROW_NUMBER() OVER (PARTITION BY ORDERNUM ORDER BY (SELECT 0)) RN FROM DUPS 
)
DELETE FROM DUP_CTE
WHERE RN <> 1
;WITH DUP_CTE AS
(
SELECT ORDERNUM,ROW_NUMBER() OVER (PARTITION BY ORDERNUM ORDER BY (SELECT 0)) RN FROM DUPS 
)
DELETE FROM DUP_CTE
WHERE RN <> 1

This allows more selectivity to the row ranking and removal process. 

There you have it.  Delete away!  (kidding, make sure you delete what you are supposed to be deleting)