Home > SQL Server > SQL SERVER – Find and delete duplicate records in a table @SQLSERVER

SQL SERVER – Find and delete duplicate records in a table @SQLSERVER


Finding duplicate records in a table is an interesting task for me. I always like to do data cleaning task. Because, I work with lots of data sets and each data set is coming from different parts of the world. So, I get different data format with different data modeling. Of course, I do get data with duplicate records. So, I have written lots of logic to find duplicate rows in a table. In which, some rows have exactly same data and some of them only has duplicate key values. But, I don’t use same logic to solve this problem every time and I try  different methods every-time to solve this case. Based on my experience I am going to share my idea or technique to find duplicate rows in a table in this blog post.

I have two scenarios to explain now.

  1. I may have multiple rows exactly same
  2. I may have to check duplicates with specific column(s).

Implementation

Scenario 1:

In this demo I am going to show you how to fetch the row that has more than one entry in a table. I am using binary_checksum(*) to create hash key for the entire row. But there are some restrictions in using binary_checksum(*) function. Please refer msdn for more details about this function.

I have a table with few US state names and I want to pick the duplicate rows from the table using the below SQL query with the help of Common Table Expression (CTE).

Exploring table data:

pic1

Finding Duplicates rows:

pic2

This query output gives only the duplicate rows from the table. You might have noticed an additional column RowID. This column is used to generate sequential number for each group. For instance, State name “Arizona” appears 3 time in the source table and this query prints RowID 2 and 3. Because, we are keeping the first record in the source.

If you want to get only unique record from the group then you should use RowID=1 in the where clause.

pic3

You can use this query to delete duplicate rows except the first row.

pic4Scenario 2:

In this demo, I will use column names to find duplicate row. To do so, we take our query from scenario 1 and put the column name(s) instead binary_checksum(*) function in CTE.

pic5

Conclusion

I hope this post may be useful for you. Please feel free to use this code in your project.

Download Script here.

As always, Please share your comments about this post.

Advertisements
  1. Smital
    July 3, 2013 at 10:44 pm

    hello … need help
    i tried this code but getting right parenthesis missing error. can u tell me what mistake i am doing here

    with getDuplicateValues
    as
    (
    SELECT row_number() OVER (PARITION BY surname ORDER BY surname) as rowid, * “605910766”.UKB_BTS_CUST_PROD_DET_BACKUP
    )
    SELECT * FROM GETDUPLICATEVALUES WHERE ROWID > 1
    go;

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: