Problems with merge and dplyr

2

I am joining two data.frames huge by a common variable using merge , and the data.frame final% has many more lines than the initial ones, which suggests that you are duplicating observations. I'm using:

df3 <- merge(df1, df2, by="ID", all=FALSE)

It is assumed that with that all=False would be avoiding duplicates, right?

I will give an example

df1

ID           Ubicación     AñoParto      Hijos
26            0012           2000          2
26            0012           2002          3
26            0012           2005          2
42            0013           2001          1
42            0013           2002          1
42            0013           2007          2

And another df like that

ID           Ubicación     AñoParto      Observaciones     Peso
26            0012           2000          1                300
26            0012           2000          2                450
26            0012           2000          3                650
26            0012           2002          1                250
26            0012           2002          2                450
26            0012           2005          1                550
26            0012           2005          2                650
26            0012           2005          3                900
42            0013           2001          1                300
42            0013           2001          2                450
42            0013           2002          1                520
42            0013           2007          1                250
42            0013           2007          2                550

In the end what I want is

ID           Ubicación     AñoParto      Observaciones     Peso   Hijos
26            0012           2000          1                300    2
26            0012           2000          2                450    2
26            0012           2000          3                650    2
26            0012           2002          1                250    3
26            0012           2002          2                450    3
26            0012           2005          1                550    2
26            0012           2005          2                650    2
26            0012           2005          3                900    2
42            0013           2001          1                300    1
42            0013           2001          2                450    1
42            0013           2002          1                520    1
42            0013           2007          1                250    2
42            0013           2007          2                550    2

What interests me is to stay with a final df that only contains the elements that were joined by the common variable "nombre" , but that I keep all the columns of df2 . I tried too

df4 <- semi_join(df1, df2) 

But I see that it only keeps the variables of the df1 , although I would say that if you only leave in common the variable "nombre" .

What should I do?

    
asked by Caro 26.06.2018 в 09:16
source

1 answer

2

The problem you have, I understand, is that you have not correctly defined the JOIN arguments. By doing this:

df3 <- merge(df1, df2, by="ID", all=FALSE)

Basically you are combining all the rows that share the same ID , which means that each ID of df1 will be combined with each of the ID of df2 , hence you get more records of what you expect, for example in df1 you have 3 rows with ID == 26 and 8 rows with the same ID in df2 , the final result will be 3 * 8 = 24 rows for ID == 26 .

Viewing your data, what I think is that you need to add to the argument of join the variables of Ubicación and AñoParto in the following way:

> merge(df1, df2, by=c("ID", "Ubicación", "AñoParto"))

   ID Ubicación AñoParto Hijos Observaciones Peso
1  26      0012     2000     2             1  300
2  26      0012     2000     2             2  450
3  26      0012     2000     2             3  650
4  26      0012     2002     3             1  250
5  26      0012     2002     3             2  450
6  26      0012     2005     2             1  550
7  26      0012     2005     2             2  650
8  26      0012     2005     2             3  900
9  42      0013     2001     1             1  300
10 42      0013     2001     1             2  450
11 42      0013     2002     1             1  520
12 42      0013     2007     2             1  250
13 42      0013     2007     2             2  550

An interesting detail of merge is that if it detects equal column names in both data.frames it uses them as an argument of join automatically, so that in your case make this simple: merge(df1, df2) should work the same.

    
answered by 26.06.2018 в 16:08