Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « SQL Server 2008 R2 Setup starts and then disappearsDo you still use sp_change_users_login instead of ALTER USER UserName WITH LOGIN = UserName »
    comments

    A recent MSDN thread presented a very interesting problem - find duplicates based on any 4 of the 5 columns and eliminate the duplicates.

    Here is the data table we will be working with:

    1. CREATE TABLE tblTEST (
    2.   ID           INT    UNIQUE    NOT NULL,
    3.   FirstColumn  VARCHAR(50)    NOT NULL,
    4.   SecondColumn VARCHAR(50)    NOT NULL,
    5.   ThirdColumn  VARCHAR(50)    NOT NULL,
    6.   FourthColumn VARCHAR(50)    NOT NULL,
    7.   FifthColumn  VARCHAR(50)    NOT NULL )
    8.  
    9. INSERT INTO tblTEST
    10.            (ID,
    11.             FirstColumn,
    12.             SecondColumn,
    13.             ThirdColumn,
    14.             FourthColumn,
    15.             FifthColumn)
    16. VALUES     (1,'value1','value2','value3','value4','value5'),
    17.            (2,'value2','value3','value4','value5','value6'),
    18.            (3,'value1','value4','value5','value8','value9'),
    19.            (4,'value1','value2','value7','value8','value9'),
    20.            (5,'value3','value4','value5','value6','value7'),
    21.            (6,'value1','value2','value9','value4','value5'),
    22.            (7,'value2','value3','value4','value5','value11'),
    23.            (8,'value4','value5','value3','value9','value1')

    Our test table:

    Test Table with Duplicates
    ID FirstColumn SecondColumn ThirdColumn FourthColumn FifthColumn
    1 value1 value2 value3 value4 value5
    2 value2 value3 value4 value5 value6
    3 value1 value4 value5 value8 value9
    4 value1 value2 value7 value8 value9
    5 value3 value4 value5 value6 value7
    6 value1 value2 value9 value4 value5
    7 value2 value3 value4 value5 value11
    8 value4 value5 value3 value9 value1

    This problem is different from the typical find the duplicates problem and at first does not even seem solvable.

    The first idea that comes to mind is to UNPIVOT values from these 5 columns into rows. This is simple enough with this code:

    1. ;with UnPvt AS (SELECT ID, ColValue FROM tblTEST
    2. UNPIVOT (ColValue FOR ColName IN ([FirstColumn],[SecondColumn],[ThirdColumn],[FourthColumn], [FifthColumn])) unpvt),

    The second step is also more or less clear - find possible duplicates by counting IDs partitioned by ColValue

    1. SameVals as (SELECT * FROM (select *, COUNT(ID) OVER (PARTITION by ColValue) as cntSame from UnPvt) X WHERE cntSame >=2),

    Now, what can we do next? The next step was a road block for me. But then, an Eureka moment - we can use CROSS APPLY and find all records that have more than 4 values matching the current record values

    1. DupRecs as (select T.*,S.cntDups, S.ID as DupID
    2.  from tblTEST T CROSS APPLY (select S.ID, COUNT(*) as cntDups from SameVals S
    3. WHERE T.ID < S.ID and S.ColValue IN (T.FirstColumn,T.SecondColumn,T.ThirdColumn,T.FourthColumn, T.FifthColumn) GROUP BY S.ID) S),
    4.  
    5. Candidates as (select distinct ID,DupID  from DupRecs where cntDups >=4)

    Ok, but do we want to delete all records identified in DupID column? Turned out, that it's not even that simple.

    Say, in our sample rows 1 & 2 match. So, do we want to delete the row 2? Well, row 2 matches with the row 5. Which rows of 2/5 we delete and which to keep? That's not easy at all.

    So, my final select is

    1. select Distinct DupID from Candidates where DupID not IN (select ID from Candidates)

    This select will only produce real duplicates.

    Now, this is the whole solution again as one statement:

    1. ;with UnPvt as (select ID, ColValue from tblTEST
    2. UNPIVOT (ColValue for ColName IN ([FirstColumn],[SecondColumn],[ThirdColumn],[FourthColumn], [FifthColumn])) unpvt),
    3. SameVals as (select * from (select *, COUNT(ID) over (partition by ColValue) as cntSame from UnPvt) X where cntSame >=2),
    4. DupRecs as (select T.*,S.cntDups, S.ID as DupID
    5.  from tblTEST T CROSS APPLY (select S.ID, COUNT(*) as cntDups from SameVals S
    6. WHERE T.ID < S.ID and S.ColValue IN (T.FirstColumn,T.SecondColumn,T.ThirdColumn,T.FourthColumn, T.FifthColumn) GROUP BY S.ID) S),
    7.  
    8. Candidates as (select distinct ID,DupID  from DupRecs where cntDups >=4)
    9.  
    10. select Distinct DupID from Candidates where DupID not IN (select ID from Candidates)

    Well, even the above solution is not quite right as now it doesn't delete all duplicates.

    After some more discussion in the mentioned thread and with the help of Peter Larsson, here is the solution that seems to work for the problem:

    1. DECLARE  @DupLoop INT
    2.  
    3. SET @DupLoop = 1
    4.  
    5. WHILE @DupLoop = 1
    6.   BEGIN
    7.      
    8.      
    9.     WITH Dups
    10.          AS (SELECT DISTINCT a.ID AS DuplicateID
    11.              FROM   dbo.tblTEST AS t
    12.                     LEFT JOIN dbo.tblTEST AS a
    13.                       ON a.ID > t.ID
    14.                          AND a.FirstColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
    15.                                                t.FifthColumn)
    16.                     LEFT JOIN dbo.tblTEST AS b
    17.                       ON b.ID = a.ID
    18.                          AND b.SecondColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
    19.                                                 t.FifthColumn)
    20.                     LEFT JOIN dbo.tblTEST AS c
    21.                       ON c.ID = a.ID
    22.                          AND c.ThirdColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
    23.                                                t.FifthColumn)
    24.                     LEFT JOIN dbo.tblTEST AS d
    25.                       ON d.ID = a.ID
    26.                          AND d.FourthColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
    27.                                                 t.FifthColumn)
    28.                     LEFT JOIN dbo.tblTEST AS e
    29.                       ON e.ID = a.ID
    30.                          AND e.FifthColumn IN (t.FirstColumn,t.SecondColumn,t.ThirdColumn,t.FourthColumn,
    31.                                                t.FifthColumn)
    32.              WHERE  CASE
    33.                       WHEN a.ID IS NULL THEN 0
    34.                       ELSE 1
    35.                     END + CASE
    36.                             WHEN b.ID IS NULL THEN 0
    37.                             ELSE 1
    38.                           END + CASE
    39.                                   WHEN c.ID IS NULL THEN 0
    40.                                   ELSE 1
    41.                                 END + CASE
    42.                                         WHEN d.ID IS NULL THEN 0
    43.                                         ELSE 1
    44.                                       END + CASE
    45.                                               WHEN e.ID IS NULL THEN 0
    46.                                               ELSE 1
    47.                                             END >= 4)
    48.     DELETE FROM tblTEST
    49.     WHERE       ID IN (SELECT   TOP ( 1 ) DuplicateID
    50.                        FROM     Dups
    51.                        ORDER BY DuplicateID)
    52.      
    53.     SELECT @DupLoop = @@ROWCOUNT
    54.   END
    55.  
    56. SELECT *
    57. FROM   tblTest

    Well, hopefully this was interesting to read as for me trying to solve such problem. Thanks for listening!

    *** Remember, if you have a SQL related question, try our Microsoft SQL Server Programming forum or our Microsoft SQL Server Admin forum

    About the Author

    Naomi Nosonovsky, Senior Software Developer, has more than 15 years of enterprise experience in analysis, design, development and implementation of high-performance client/server data management solutions. She is a Personality of the Year for 2008, 2009, 2010 and 2011 at www.universalthread.com in .NET, SQL Server & VFP categories. She is also an All Star contributor/MCC Community Contributor at forums.asp.net and MSDN T-SQL forum.
    Social SitingsFacebookLinkedIndeliciousLTD RSS Feed
    981 views
    Instapaper

    5 comments

    Comment from: George [Visitor]
    George Hello Naomi,
    I hope you don't mind I'm back with my questions. After testing it looks like we still have error in the result. This is what I did:
    INSERT INTO tblTEST(ID, FirstColumn,SecondColumn,ThirdColumn,FourthColumn, FifthColumn)
    values
    ('1','Alex','George','Gerard','Michael','Paul'),
    ('2','Alex','George','Gerard','Michael','Peter'),
    ('3','Alex','George','Gerard','Michael','Thomas'),
    ('4','Alex','George','Gerard','Paul','Peter'),
    ('5','Alex','George','Gerard','Paul','Thomas'),
    ('6','Alex','George','Gerard','Peter','Thomas'),
    ('7','Alex','George','Michael','Paul','Peter'),
    ('8','Alex','George','Michael','Paul','Thomas'),
    ('9','Alex','George','Michael','Peter','Thomas'),
    ('10','Alex','George','Paul','Peter','Thomas'),
    ('11','Alex','Gerard','Michael','Paul','Peter'),
    ('12','Alex','Gerard','Michael','Paul','Thomas'),
    ('13','Alex','Gerard','Michael','Peter','Thomas'),
    ('14','Alex','Gerard','Paul','Peter','Thomas'),
    ('15','Alex','Michael','Paul','Peter','Thomas'),
    ('16','George','Gerard','Michael','Paul','Peter'),
    ('17','George','Gerard','Michael','Paul','Thomas'),
    ('18','George','Gerard','Michael','Peter','Thomas'),
    ('19','George','Gerard','Paul','Peter','Thomas'),
    ('20','George','Michael','Paul','Peter','Thomas'),
    ('21','Gerard','Michael','Paul','Peter','Thomas')

    And the result is rows:1, 6, 15, 20 and 21.

    Rows with ID 20, 21 should not be in it. I've tried few things already and it didn't help. I'm going to keep tying no matter what but your ideas and solutions may really cut the long way short.
    George
    06/09/11 @ 05:17
    Comment from: niikola [Member] Email
    niikola Very nice article, which I'm going to use as an example why normalization matters :D
    06/09/11 @ 06:21
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky George, I see the problem, but I don't see an easy solution right away :(
    06/09/11 @ 06:53
    Comment from: Harizone Estimada [Visitor]
    Harizone Estimada For finding duplicates, I used the following:

    SELECT COUNT(idfield), field1_to_check, field2_to_check, fieldX_to_check FROM SomeTable
    GROUP BY field1_to_check, field2_to_check, fieldX_to_check
    HAVING COUNT(idfield) > 1

    In the case above, this is the query to find the duplicate record.

    SELECT count(ID), FirstColumn, SecondColumn, ThirdColumn, FourthColumn
    FROM tblTEST
    GROUP BY FirstColumn, SecondColumn, ThirdColumn, FourthColumn
    HAVING COUNT(ID)>1

    -Harizone
    06/29/11 @ 02:27
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky Harizone,

    Unfortunately, the problem is more complex, as we want to find duplicates by any 4 out of the 5 columns and we don't expect the exact same 4 for each record. That's why we presented more complex solutions, but we gave up at the end - the problem seems too complicated to solve.
    06/29/11 @ 06:22

    Leave a comment


    Your email address will not be revealed on this site.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)