A recent MSDN thread presented a very interesting problem - find duplicates based on any 4 of the 5 columns and eliminate the duplicates.
Here is the data table we will be working with:
- CREATE TABLE tblTEST (
- ID INT UNIQUE NOT NULL,
- FirstColumn VARCHAR(50) NOT NULL,
- SecondColumn VARCHAR(50) NOT NULL,
- ThirdColumn VARCHAR(50) NOT NULL,
- FourthColumn VARCHAR(50) NOT NULL,
- FifthColumn VARCHAR(50) NOT NULL )
- INSERT INTO tblTEST
- (ID,
- FirstColumn,
- SecondColumn,
- ThirdColumn,
- FourthColumn,
- FifthColumn)
- VALUES (1,'value1','value2','value3','value4','value5'),
- (2,'value2','value3','value4','value5','value6'),
- (3,'value1','value4','value5','value8','value9'),
- (4,'value1','value2','value7','value8','value9'),
- (5,'value3','value4','value5','value6','value7'),
- (6,'value1','value2','value9','value4','value5'),
- (7,'value2','value3','value4','value5','value11'),
- (8,'value4','value5','value3','value9','value1')
Our test table:
| ID | FirstColumn | SecondColumn | ThirdColumn | FourthColumn | FifthColumn |
|---|---|---|---|---|---|
| 1 | value1 | value2 | value3 | value4 | value5 |
| 2 | value2 | value3 | value4 | value5 | value6 |
| 3 | value1 | value4 | value5 | value8 | value9 |
| 4 | value1 | value2 | value7 | value8 | value9 |
| 5 | value3 | value4 | value5 | value6 | value7 |
| 6 | value1 | value2 | value9 | value4 | value5 |
| 7 | value2 | value3 | value4 | value5 | value11 |
| 8 | value4 | value5 | value3 | value9 | value1 |
This problem is different from the typical find the duplicates problem and at first does not even seem solvable.
The first idea that comes to mind is to UNPIVOT values from these 5 columns into rows. This is simple enough with this code:
- ;with UnPvt AS (SELECT ID, ColValue FROM tblTEST
- UNPIVOT (ColValue FOR ColName IN ([FirstColumn],[SecondColumn],[ThirdColumn],[FourthColumn], [FifthColumn])) unpvt),
The second step is also more or less clear - find possible duplicates by counting IDs partitioned by ColValue
- SameVals as (SELECT * FROM (select *, COUNT(ID) OVER (PARTITION by ColValue) as cntSame from UnPvt) X WHERE cntSame >=2),
Now, what can we do next? The next step was a road block for me. But then, an Eureka moment - we can use CROSS APPLY and find all records that have more than 4 values matching the current record values
- DupRecs as (select T.*,S.cntDups, S.ID as DupID
- from tblTEST T CROSS APPLY (select S.ID, COUNT(*) as cntDups from SameVals S
- WHERE T.ID < S.ID and S.ColValue IN (T.FirstColumn,T.SecondColumn,T.ThirdColumn,T.FourthColumn, T.FifthColumn) GROUP BY S.ID) S),
- Candidates as (select distinct ID,DupID from DupRecs where cntDups >=4)
Ok, but do we want to delete all records identified in DupID column? Turned out, that it's not even that simple.
Say, in our sample rows 1 & 2 match. So, do we want to delete the row 2? Well, row 2 matches with the row 5. Which rows of 2/5 we delete and which to keep? That's not easy at all.
So, my final select is
- select Distinct DupID from Candidates where DupID not IN (select ID from Candidates)
This select will only produce real duplicates.
Now, this is the whole solution again as one statement:
- ;with UnPvt as (select ID, ColValue from tblTEST
- UNPIVOT (ColValue for ColName IN ([FirstColumn],[SecondColumn],[ThirdColumn],[FourthColumn], [FifthColumn])) unpvt),
- SameVals as (select * from (select *, COUNT(ID) over (partition by ColValue) as cntSame from UnPvt) X where cntSame >=2),
- DupRecs as (select T.*,S.cntDups, S.ID as DupID
- from tblTEST T CROSS APPLY (select S.ID, COUNT(*) as cntDups from SameVals S
- WHERE T.ID < S.ID and S.ColValue IN (T.FirstColumn,T.SecondColumn,T.ThirdColumn,T.FourthColumn, T.FifthColumn) GROUP BY S.ID) S),
- Candidates as (select distinct ID,DupID from DupRecs where cntDups >=4)
- select Distinct DupID from Candidates where DupID not IN (select ID from Candidates)
Well, even the above solution is not quite right as now it doesn't delete all duplicates.
After some more discussion in the mentioned thread and with the help of Peter Larsson, here is the solution that seems to work for the problem:
- DECLARE @DupLoop INT
- SET @DupLoop = 1
- WHILE @DupLoop = 1
- BEGIN
- WITH Dups
- AS (SELECT DISTINCT a.ID AS DuplicateID
- FROM dbo.tblTEST AS t
- LEFT JOIN dbo.tblTEST AS a
- ON a.ID > t.ID
- AND a.FirstColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
- t.FifthColumn)
- LEFT JOIN dbo.tblTEST AS b
- ON b.ID = a.ID
- AND b.SecondColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
- t.FifthColumn)
- LEFT JOIN dbo.tblTEST AS c
- ON c.ID = a.ID
- AND c.ThirdColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
- t.FifthColumn)
- LEFT JOIN dbo.tblTEST AS d
- ON d.ID = a.ID
- AND d.FourthColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
- t.FifthColumn)
- LEFT JOIN dbo.tblTEST AS e
- ON e.ID = a.ID
- AND e.FifthColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
- t.FifthColumn)
- WHERE CASE
- WHEN a.ID IS NULL THEN 0
- ELSE 1
- END + CASE
- WHEN b.ID IS NULL THEN 0
- ELSE 1
- END + CASE
- WHEN c.ID IS NULL THEN 0
- ELSE 1
- END + CASE
- WHEN d.ID IS NULL THEN 0
- ELSE 1
- END + CASE
- WHEN e.ID IS NULL THEN 0
- ELSE 1
- END >= 4)
- DELETE FROM tblTEST
- WHERE ID IN (SELECT TOP ( 1 ) DuplicateID
- FROM Dups
- ORDER BY DuplicateID)
- SELECT @DupLoop = @@ROWCOUNT
- END
- SELECT *
- FROM tblTest
Well, hopefully this was interesting to read as for me trying to solve such problem. Thanks for listening!
*** Remember, if you have a SQL related question, try our Microsoft SQL Server Programming forum or our Microsoft SQL Server Admin forum




LTD Social Sitings
Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.