Search and Delete duplicate MYSQL records

1

Well, my situation is as follows, I have a rather large database, and there are a lot of duplicate records (there are actually many that are repeated more than twice but I do not know how to call that). What I'm looking for is a query to help me search , or maybe delete directly duplicate records and > leaving only one . So far I have the following query:

SELECT email, COUNT(*) Total
FROM usuarios
GROUP BY email
HAVING COUNT(*) > 1

Which selects me the records that are repeated more than once, or in this case, the records that have the same email more than once. This query throws me nothing more and nothing less than 58.680 results, and then, it is quite cumbersome to have to delete them one by one. I have the feeling that this consultation can be extended to achieve what I want but I have no idea how to do it, I hope you can help me. If it works, my database manager is Phpmyadmin 4.7.4 .

    
asked by Jalkhov 20.03.2018 в 01:41
source

2 answers

4

There are several ways to do what you want.

I. Using JOIN :

This technique consists of joining the table to itself by JOIN . And use a combined comparison of the column id (which is assumed never to be repeated) and the repeated column ( email in this case).

The query would be this:

DELETE t1 FROM usuarios t1
INNER JOIN usuarios t2 
WHERE t1.id > t2.id AND t1.email = t2.email; 

Let's see a complete example:

VIEW DEMO IN REXTESTER

-- I. Datos de prueba 

CREATE TABLE usuarios (
    id INT PRIMARY KEY AUTO_INCREMENT,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL, 
    email VARCHAR(255) NOT NULL
);

INSERT INTO usuarios (first_name,last_name,email) 
VALUES ('Pedro ','Pérez','[email protected]'),
       ('Juan','Roa','[email protected]'),
       ('Santiago','Guerrero','[email protected]'),
       ('Juan','Roa','[email protected]'),
       ('Andrés ','Brito','[email protected]'),
       ('Juan','Roa','[email protected]'),
       ('Santiago','Guerrero','[email protected]'),
       ('Felipe','Castro','[email protected]')
       ;

/* Prueba de datos antes de borrar*/

SELECT 
    id, email
FROM usuarios
ORDER BY email;    

-- II. Borrar mediante JOIN

DELETE t1 FROM usuarios t1
INNER JOIN usuarios t2 
WHERE t1.id > t2.id AND t1.email = t2.email; 


/* Prueba de datos después de borrar*/

SELECT 
    id, email
FROM usuarios
ORDER BY id;    

-- III. Restringir duplicados

ALTER TABLE usuarios ADD UNIQUE (email);

/* 
    Prueba de la nueva restricción creada:
    al descomentar el código tendremos el siguiente error:
    Error(s), warning(s): Duplicate entry '[email protected]' for key 'email'
*/

-- INSERT INTO usuarios (first_name,last_name,email) VALUES ('Pedro ','Pérez','[email protected]');

Results:

Before deleting duplicates:

id  email
5   [email protected]
8   [email protected]
3   [email protected]
7   [email protected]
1   [email protected]
2   [email protected]
4   [email protected]
6   [email protected]

After deleting duplicates:

id  email
1   [email protected]
2   [email protected]
3   [email protected]
5   [email protected]
8   [email protected]

Note that the values whose id was lower were kept here. If you want to keep those whose id is greater, just put the WHERE of the query as: ... WHERE t1.id < t2.id AND t1.email = t2.email

Note that by applying this technique, there will be breaks in the id column. If you want a table with a better order in that column, see the method mentioned below.

II. By creating a new table

This technique consists of creating a new table with the same structure as the original table. Insert the records from the table with duplicates by applying a filter that avoids duplicates ( GROUP BY in this case) and then delete the old table, renaming the new one.

1. We create an empty copy of the table usuarios

CREATE TABLE usuarios_copy LIKE usuarios;

2. Insert all records in the new table, grouping by the repeated column

There are two ways to do this:

  • If you want to have a new table with the column id ordered, you can explicitly name each column, indicating the value NULL for the incremental auto column. This will allow that column to obtain its value in a natural way:

       INSERT INTO usuarios_copy   
            SELECT 
                NULL,
                first_name,
                last_name,
                email 
             FROM usuarios 
             GROUP BY email;     
    
  • If you do not mind having so many jumps in the id column, you can do the insertion like this:

       INSERT INTO usuarios_copy   
            SELECT *
             FROM usuarios 
             GROUP BY email; 
    

3. We delete the old table and rename the new one

DROP TABLE usuarios;
ALTER TABLE usuarios_copy RENAME TO usuarios;

In the data test, you will get this result. Note that here the column id regained its normal order:

id  email
1   [email protected]
2   [email protected]
3   [email protected]
4   [email protected]
5   [email protected]

Let's see a complete example applying this possibility:

SEE DEMO IN REXTESTER

III. Solve the root problem

It will be of little use to do all the previous work, if we do not change the design of the table, making the column email is UNIQUE , that is, that does not allow duplicate emails.

This would be achieved with a single line of code, which must be executed once our table is clean of duplicates:

ALTER TABLE usuarios ADD UNIQUE (email);

If now we try to insert a duplicate:

INSERT INTO usuarios (first_name,last_name,email) 
VALUES ('Pedro ','Pérez','[email protected]');

The DBMS will stop us, saying the following:

Duplicate entry '[email protected]' for key 'email'

NOTE: Maybe in a table like this, there is a group of columns that are UNIQUE , in that case, you would have to create a single combined index ... But that's another matter . I just wanted to indicate that this type of problem must be solved at the root.

    
answered by 20.03.2018 / 11:04
source
1

Basically, this is your problem

to fix it:

First step: Disable the safe updates mode (this if you use Workbench) you can do it from the Edit menu - > preferences

Next step: Create a temporary table with unique emails

create temporary table t1
select * from usuarios group by email;

will stay like this (without repeated emails)

Third Step: Delete repeated

DELETE FROM usuarios
WHERE id NOT IN ( SELECT id FROM t1 );

Ready with that you would delete the records with repeated emails.

    
answered by 20.03.2018 в 02:09