I am joining two data.frames
huge by a common variable using merge
, and the data.frame
final% has many more lines than the initial ones, which suggests that you are duplicating observations. I'm using:
df3 <- merge(df1, df2, by="ID", all=FALSE)
It is assumed that with that all=False
would be avoiding duplicates, right?
I will give an example
df1
ID Ubicación AñoParto Hijos
26 0012 2000 2
26 0012 2002 3
26 0012 2005 2
42 0013 2001 1
42 0013 2002 1
42 0013 2007 2
And another df like that
ID Ubicación AñoParto Observaciones Peso
26 0012 2000 1 300
26 0012 2000 2 450
26 0012 2000 3 650
26 0012 2002 1 250
26 0012 2002 2 450
26 0012 2005 1 550
26 0012 2005 2 650
26 0012 2005 3 900
42 0013 2001 1 300
42 0013 2001 2 450
42 0013 2002 1 520
42 0013 2007 1 250
42 0013 2007 2 550
In the end what I want is
ID Ubicación AñoParto Observaciones Peso Hijos
26 0012 2000 1 300 2
26 0012 2000 2 450 2
26 0012 2000 3 650 2
26 0012 2002 1 250 3
26 0012 2002 2 450 3
26 0012 2005 1 550 2
26 0012 2005 2 650 2
26 0012 2005 3 900 2
42 0013 2001 1 300 1
42 0013 2001 2 450 1
42 0013 2002 1 520 1
42 0013 2007 1 250 2
42 0013 2007 2 550 2
What interests me is to stay with a final df that only contains the elements that were joined by the common variable "nombre"
, but that I keep all the columns of df2
. I tried too
df4 <- semi_join(df1, df2)
But I see that it only keeps the variables of the df1
, although I would say that if you only leave in common the variable "nombre"
.
What should I do?