Archive
Listing duplicates from the same table without repetition
I recently came across what was apparently a simple problem, but which had me initially stumped. The problem was, how to list out duplicates within the same table without repetition. For example if rows A and B in a table are deemed to have matched, then selecting from the table joined to itself will give two results, A matches B and B matches A. Both are valid results, but it effectively doubles the number of duplicates reported. I only needed one of the two results.
The solution turned out to be simple, of course, but since I spent some time thinking it through I thought it worth creating a quick post.
Walking through a simple example, create the following sample table.
CREATE TABLE dbo.DuplicateSample( id int IDENTITY(1,1) NOT NULL, name varchar(50) NOT NULL )
Throw a load of sample data into it.
Insert dbo.DuplicateSample (name) Values('Bob') Insert dbo.DuplicateSample (name) Values('Paul') Insert dbo.DuplicateSample (name) Values('James') Insert dbo.DuplicateSample (name) Values('Claire') Insert dbo.DuplicateSample (name) Values('Alex') Insert dbo.DuplicateSample (name) Values('Bob') Insert dbo.DuplicateSample (name) Values('Rebecca') Insert dbo.DuplicateSample (name) Values('Paul') Insert dbo.DuplicateSample (name) Values('Andrew') Insert dbo.DuplicateSample (name) Values('Giles') Insert dbo.DuplicateSample (name) Values('Terry') Insert dbo.DuplicateSample (name) Values('Robert') Insert dbo.DuplicateSample (name) Values('Ralph') Insert dbo.DuplicateSample (name) Values('Alex') Insert dbo.DuplicateSample (name) Values('Heather') Insert dbo.DuplicateSample (name) Values('Alice') Insert dbo.DuplicateSample (name) Values('Heather') Insert dbo.DuplicateSample (name) Values('Oliver') Insert dbo.DuplicateSample (name) Values('Jack') Insert dbo.DuplicateSample (name) Values('Harry') Insert dbo.DuplicateSample (name) Values('Alfie') Insert dbo.DuplicateSample (name) Values('Charlie') Insert dbo.DuplicateSample (name) Values('Thomas') Insert dbo.DuplicateSample (name) Values('William') Insert dbo.DuplicateSample (name) Values('Joshua') Insert dbo.DuplicateSample (name) Values('George') Insert dbo.DuplicateSample (name) Values('James')
Now our initial attempt at selecting duplicates might be something like the following.
Select ds1.id As id1 ,ds1.name As name1 ,ds2.id As id2 ,ds2.name As name2 From DuplicateSample ds1 Inner Join DuplicateSample ds2 On (ds1.name = ds2.name And ds1.id != ds2.id)
This will give us the following results.
id1 | name1 | id2 | name2 |
---|---|---|---|
6 | Bob | 1 | Bob |
8 | Paul | 2 | Paul |
27 | James | 3 | James |
14 | Alex | 5 | Alex |
1 | Bob | 6 | Bob |
2 | Paul | 8 | Paul |
5 | Alex | 14 | Alex |
17 | Heather | 15 | Heather |
15 | Heather | 17 | Heather |
3 | James | 27 | James |
Note that all the rows are shown twice, but switched round, i.e. Bob 1 matches Bob 6, but also Bob 6 matches Bob 1.
So, how do we solve this? By adding in an order to the rows and only matching ahead.
;With cte_DuplicateSample As ( Select Row_Number() Over (Order By id) As row_number ,id ,name From DuplicateSample ) Select ds1.id As id1 ,ds1.name As name1 ,ds2.id As id2 ,ds2.name As name2 From cte_DuplicateSample ds1 Inner Join cte_DuplicateSample ds2 On (ds1.name = ds2.name And ds1.row_number < ds2.row_number)
This will give us the following, much better, results.
id1 | name1 | id2 | name2 |
---|---|---|---|
1 | Bob | 6 | Bob |
2 | Paul | 8 | Paul |
5 | Alex | 14 | Alex |
15 | Heather | 17 | Heather |
3 | James | 27 | James |
The CTE selects out the same source data, but adds in a row_number, then in the joining clause we only match for rows that are ahead of the current row, i.e. when the row_number is greater that the current row_number.
Simple in the end.