LessThanDot

A recent MSDN thread presented a very interesting problem – find duplicates based on any 4 of the 5 columns and eliminate the duplicates.

Here is the data table we will be working with:

CREATE TABLE tblTEST ( 
  ID           INT    UNIQUE    NOT NULL, 
  FirstColumn  VARCHAR(50)    NOT NULL, 
  SecondColumn VARCHAR(50)    NOT NULL, 
  ThirdColumn  VARCHAR(50)    NOT NULL, 
  FourthColumn VARCHAR(50)    NOT NULL, 
  FifthColumn  VARCHAR(50)    NOT NULL ) 

INSERT INTO tblTEST 
           (ID, 
            FirstColumn, 
            SecondColumn, 
            ThirdColumn, 
            FourthColumn, 
            FifthColumn) 
VALUES     (1,'value1','value2','value3','value4','value5'), 
           (2,'value2','value3','value4','value5','value6'), 
           (3,'value1','value4','value5','value8','value9'), 
           (4,'value1','value2','value7','value8','value9'), 
           (5,'value3','value4','value5','value6','value7'), 
           (6,'value1','value2','value9','value4','value5'), 
           (7,'value2','value3','value4','value5','value11'), 
           (8,'value4','value5','value3','value9','value1')

Our test table:

Test Table with Duplicates
ID	FirstColumn	SecondColumn	ThirdColumn	FourthColumn	FifthColumn
1	value1	value2	value3	value4	value5
2	value2	value3	value4	value5	value6
3	value1	value4	value5	value8	value9
4	value1	value2	value7	value8	value9
5	value3	value4	value5	value6	value7
6	value1	value2	value9	value4	value5
7	value2	value3	value4	value5	value11
8	value4	value5	value3	value9	value1

This problem is different from the typical find the duplicates problem and at first does not even seem solvable.

The first idea that comes to mind is to UNPIVOT values from these 5 columns into rows. This is simple enough with this code:

;with UnPvt AS (SELECT ID, ColValue FROM tblTEST 
UNPIVOT (ColValue FOR ColName IN ([FirstColumn],[SecondColumn],[ThirdColumn],[FourthColumn], [FifthColumn])) unpvt),

The second step is also more or less clear – find possible duplicates by counting IDs partitioned by ColValue

SameVals as (SELECT * FROM (select *, COUNT(ID) OVER (PARTITION by ColValue) as cntSame from UnPvt) X WHERE cntSame >=2),

Now, what can we do next? The next step was a road block for me. But then, an Eureka moment – we can use CROSS APPLY and find all records that have more than 4 values matching the current record values

DupRecs as (select T.*,S.cntDups, S.ID as DupID
 from tblTEST T CROSS APPLY (select S.ID, COUNT(*) as cntDups from SameVals S
WHERE T.ID < S.ID and S.ColValue IN (T.FirstColumn,T.SecondColumn,T.ThirdColumn,T.FourthColumn, T.FifthColumn) GROUP BY S.ID) S),

Candidates as (select distinct ID,DupID  from DupRecs where cntDups >=4)

Ok, but do we want to delete all records identified in DupID column? Turned out, that it’s not even that simple.

Say, in our sample rows 1 & 2 match. So, do we want to delete the row 2? Well, row 2 matches with the row 5. Which rows of ²⁄₅ we delete and which to keep? That’s not easy at all.

So, my final select is

select Distinct DupID from Candidates where DupID not IN (select ID from Candidates)

This select will only produce real duplicates.

Now, this is the whole solution again as one statement:

;with UnPvt as (select ID, ColValue from tblTEST 
UNPIVOT (ColValue for ColName IN ([FirstColumn],[SecondColumn],[ThirdColumn],[FourthColumn], [FifthColumn])) unpvt),
SameVals as (select * from (select *, COUNT(ID) over (partition by ColValue) as cntSame from UnPvt) X where cntSame >=2),
DupRecs as (select T.*,S.cntDups, S.ID as DupID
 from tblTEST T CROSS APPLY (select S.ID, COUNT(*) as cntDups from SameVals S
WHERE T.ID < S.ID and S.ColValue IN (T.FirstColumn,T.SecondColumn,T.ThirdColumn,T.FourthColumn, T.FifthColumn) GROUP BY S.ID) S),

Candidates as (select distinct ID,DupID  from DupRecs where cntDups >=4) 

select Distinct DupID from Candidates where DupID not IN (select ID from Candidates)

Well, even the above solution is not quite right as now it doesn’t delete all duplicates.

After some more discussion in the mentioned thread and with the help of Peter Larsson, here is the solution that seems to work for the problem:

DECLARE  @DupLoop INT 

SET @DupLoop = 1 

WHILE @DupLoop = 1 
  BEGIN 
     
     
    WITH Dups 
         AS (SELECT DISTINCT a.ID AS DuplicateID 
             FROM   dbo.tblTEST AS t 
                    LEFT JOIN dbo.tblTEST AS a 
                      ON a.ID > t.ID 
                         AND a.FirstColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
                                               t.FifthColumn) 
                    LEFT JOIN dbo.tblTEST AS b 
                      ON b.ID = a.ID 
                         AND b.SecondColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
                                                t.FifthColumn) 
                    LEFT JOIN dbo.tblTEST AS c 
                      ON c.ID = a.ID 
                         AND c.ThirdColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
                                               t.FifthColumn) 
                    LEFT JOIN dbo.tblTEST AS d 
                      ON d.ID = a.ID 
                         AND d.FourthColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
                                                t.FifthColumn) 
                    LEFT JOIN dbo.tblTEST AS e 
                      ON e.ID = a.ID 
                         AND e.FifthColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
                                               t.FifthColumn) 
             WHERE  CASE 
                      WHEN a.ID IS NULL THEN 0 
                      ELSE 1 
                    END + CASE 
                            WHEN b.ID IS NULL THEN 0 
                            ELSE 1 
                          END + CASE 
                                  WHEN c.ID IS NULL THEN 0 
                                  ELSE 1 
                                END + CASE 
                                        WHEN d.ID IS NULL THEN 0 
                                        ELSE 1 
                                      END + CASE 
                                              WHEN e.ID IS NULL THEN 0 
                                              ELSE 1 
                                            END >= 4) 
    DELETE FROM tblTEST 
    WHERE       ID IN (SELECT   TOP ( 1 ) DuplicateID 
                       FROM     Dups 
                       ORDER BY DuplicateID) 
     
    SELECT @DupLoop = @@ROWCOUNT 
  END 

SELECT * 
FROM   tblTest

Well, hopefully this was interesting to read as for me trying to solve such problem. Thanks for listening!

*** Remember, if you have a SQL related question, try our Microsoft SQL Server Programming forum or our Microsoft SQL Server Admin forum

Finding Duplicates – Interesting twist

About Naomi Nosonovsky