Duplicate records in a database can lead to inaccurate data analysis, increased storage usage, and slower queries. It’s crucial to identify and remove these duplicates to maintain the integrity and performance of your database. In this blog post, we’ll explore three methods to find and delete duplicates in SQL: using GROUP BY, subqueries, and Common Table Expressions (CTE). We’ll also provide the SQL script to create the sample table used in these examples.
There are three method to Find and Delete Duplicates:
1. Using Group By
2. Using Subquery
3. Using Common Table Expression (CTE) And Window Function.
Lets check the each method one by one with explanation.
Let’s start by creating a sample Employee table and populating it with some data, including duplicates.
-- Create Employee Table
CREATE TABLE Employee (
ID INT PRIMARY KEY,
employee_name VARCHAR(100),
city VARCHAR(100)
);
-- Insert Sample Data
INSERT INTO Employee (ID, employee_name, city) VALUES
(1, 'Amit', 'Delhi'),
(2, 'Rohit', 'Mumbai'),
(3, 'Suman', 'Delhi'),
(4, 'Amit', 'Delhi'), -- Duplicate
(5, 'Rohit', 'Mumbai'), -- Duplicate
(6, 'Neha', 'Pune'),
(7, 'Amit', 'Delhi'); -- Duplicate
--With the table and data ready, let's explore the methods to find
and delete duplicates.
Method 1: Using GROUP BY
The GROUP BY clause groups records with identical values in specified columns. By combining it with the HAVING clause, we can find groups with more than one occurrence, indicating duplicates.
SELECT
employee_name,
city,
COUNT(*) AS duplicate_count
FROM Employee
GROUP BY employee_name, city
HAVING COUNT(*) > 1;
Explanation:
The query groups rows by employee_name and city.
COUNT(*) AS duplicate_count counts the occurrences of each group.
The HAVING COUNT(*) > 1 condition filters the results to show only groups with more than one record.
Usage:
This method is useful when you want to identify duplicates without immediately deleting them. It allows you to review which records are duplicated before proceeding with any deletions.
Sql Projects :
Method 2: Using Subquery
This method uses a subquery to identify duplicate records and deletes them, keeping only the first occurrence based on the ID column.
--To view duplicates
SELECT * FROM Employee;
WHERE EXISTS
(
SELECT 1
FROM Employee AS e2
WHERE Employee.employee_name = e2.employee_name
AND Employee.city = e2.city
AND Employee.ID > e2.ID
);
-- To delete duplicates
DELETE FROM Employee
WHERE EXISTS
(
SELECT 1
FROM Employee AS e2
WHERE Employee.employee_name = e2.employee_name
AND Employee.city = e2.city
AND Employee.ID > e2.ID
);
Explanation:
The subquery checks if a duplicate exists by comparing the current row (Employee) with another row (e2) that has the same employee_name and city.
AND Employee.ID > e2.ID ensures that only the duplicate rows with higher ID values are deleted, preserving the first occurrence.
Usage: This approach is effective when you want to remove duplicates directly while keeping the earliest record (smallest ID).
Method 3: Using Common Table Expression (CTE)
A Common Table Expression (CTE) provides a more readable and flexible approach to handling duplicates using the ROW_NUMBER() function.
--Using CTE AND Window function
-- To view duplicates
WITH cte AS
(
SELECT
employee_name,
city,
id,
ROW_NUMBER()
OVER(PARTITION BY employee_name, city ORDER BY employee_name DESC) AS Rwn
FROM Employee
)
SELECT * FROM cte WHERE Rwn > 1;
-- For deleting duplicate records
WITH cte AS (
SELECT
employee_name,
city,
id,
ROW_NUMBER()
OVER(PARTITION BY employee_name, city ORDER BY employee_name DESC) AS Rwn
FROM Employee
)
DELETE FROM Employee
WHERE ID IN (SELECT id FROM cte WHERE Rwn > 1);
Explanation:
The CTE assigns a row number to each row within a partition defined by employee_name and city.
ROW_NUMBER() generates a sequential number for each row, ordered by employee_name DESC.
The main query then filters out rows with a row number greater than one (Rwn > 1), identifying duplicates.
The DELETE statement removes these duplicates, keeping only the first occurrence.
Usage: This method is ideal when you need a flexible, scalable way to handle duplicates, especially in complex datasets.
Conclusion
Managing duplicates is a common but critical task in SQL database management. The methods outlined above provide various approaches depending on your specific needs—whether it’s to identify, review, or delete duplicates directly. By using GROUP BY, subqueries, or CTEs, you can ensure that your database remains clean, efficient, and ready for accurate data analysis.
Feel free to choose the method that best suits your scenario, and always remember to back up your data before performing delete operations!
Leave a Reply