Archive

Posts Tagged ‘ROW_NUMBER()’

Listing duplicates from the same table without repetition

December 8, 2011 Leave a comment

 

I recently came across what was apparently a simple problem, but which had me initially stumped. The problem was, how to list out duplicates within the same table without repetition. For example if rows A and B in a table are deemed to have matched, then selecting from the table joined to itself will give two results, A matches B and B matches A. Both are valid results, but it effectively doubles the number of duplicates reported. I only needed one of the two results.
The solution turned out to be simple, of course, but since I spent some time thinking it through I thought it worth creating a quick post.

Walking through a simple example, create the following sample table.

CREATE TABLE dbo.DuplicateSample(
    id int IDENTITY(1,1) NOT NULL,
    name varchar(50) NOT NULL
)

 

Throw a load of sample data into it.

Insert dbo.DuplicateSample (name) Values('Bob')
Insert dbo.DuplicateSample (name) Values('Paul')
Insert dbo.DuplicateSample (name) Values('James')
Insert dbo.DuplicateSample (name) Values('Claire')
Insert dbo.DuplicateSample (name) Values('Alex')
Insert dbo.DuplicateSample (name) Values('Bob')
Insert dbo.DuplicateSample (name) Values('Rebecca')
Insert dbo.DuplicateSample (name) Values('Paul')
Insert dbo.DuplicateSample (name) Values('Andrew')
Insert dbo.DuplicateSample (name) Values('Giles')
Insert dbo.DuplicateSample (name) Values('Terry')
Insert dbo.DuplicateSample (name) Values('Robert')
Insert dbo.DuplicateSample (name) Values('Ralph')
Insert dbo.DuplicateSample (name) Values('Alex')
Insert dbo.DuplicateSample (name) Values('Heather')
Insert dbo.DuplicateSample (name) Values('Alice')
Insert dbo.DuplicateSample (name) Values('Heather')
Insert dbo.DuplicateSample (name) Values('Oliver')
Insert dbo.DuplicateSample (name) Values('Jack')
Insert dbo.DuplicateSample (name) Values('Harry')
Insert dbo.DuplicateSample (name) Values('Alfie')
Insert dbo.DuplicateSample (name) Values('Charlie')
Insert dbo.DuplicateSample (name) Values('Thomas')
Insert dbo.DuplicateSample (name) Values('William')
Insert dbo.DuplicateSample (name) Values('Joshua')
Insert dbo.DuplicateSample (name) Values('George')
Insert dbo.DuplicateSample (name) Values('James')

 

Now our initial attempt at selecting duplicates might be something like the following.

Select          ds1.id As id1
               ,ds1.name As name1
               ,ds2.id As id2
               ,ds2.name As name2
From            DuplicateSample ds1
Inner Join      DuplicateSample ds2 On (ds1.name = ds2.name And ds1.id != ds2.id)

 

This will give us the following results.

id1 name1 id2 name2
6 Bob 1 Bob
8 Paul 2 Paul
27 James 3 James
14 Alex 5 Alex
1 Bob 6 Bob
2 Paul 8 Paul
5 Alex 14 Alex
17 Heather 15 Heather
15 Heather 17 Heather
3 James 27 James

 

Note that all the rows are shown twice, but switched round, i.e. Bob 1 matches Bob 6, but also Bob 6 matches Bob 1.

So, how do we solve this? By adding in an order to the rows and only matching ahead.

;With cte_DuplicateSample As (
    Select          Row_Number() Over (Order By id) As row_number
                   ,id
                   ,name
    From            DuplicateSample
)
Select          ds1.id As id1
               ,ds1.name As name1
               ,ds2.id As id2
               ,ds2.name As name2
From            cte_DuplicateSample ds1
Inner Join      cte_DuplicateSample ds2 On (ds1.name = ds2.name And ds1.row_number < ds2.row_number)

 

This will give us the following, much better, results.

id1 name1 id2 name2
1 Bob 6 Bob
2 Paul 8 Paul
5 Alex 14 Alex
15 Heather 17 Heather
3 James 27 James

 

The CTE selects out the same source data, but adds in a row_number, then in the joining clause we only match for rows that are ahead of the current row, i.e. when the row_number is greater that the current row_number.

Simple in the end.

Categories: SQL Server Tags: , ,