Extract / Filter info from a data.frame in R

Question

Extract / Filter info from a data.frame in R

Navigation

#1 by (0 votes)

-1

The issue is this, I hope you can help me:

I have two files (tables) imported into R, the first is a small table with reference names, the second are names repeated with statistical data, I need to make a filter where ONLY the names that match the first file, together with the other statistical data, that is to say to reduce the second file with the criteria of the first file, since I have many names, I need to make a kind of match and that I keep all the information of these to a table. Any ideas?

r filtro

asked by Pedro R. Orozco 18.12.2018 в 16:45

source

1 answer

Send email with Firebase Cloud Functions Use certain parts of a JSON file in PHP

score 0 · Answer 1

To be able to test the solution we create random data with the structure of your data. A data.frame with keys to keep, which we call df1 and a data.frame with those same keys and other different keys, in addition to some random data that simulate the statistics.

set.seed(2018)
df1 <- data.frame(claves = letters[1:3],    # a, b y c
                  stringsAsFactors = FALSE)
df2 <- data.frame(claves = sample(letters, 100, replace = TRUE), #100 letras al azar, algunas son a, b o c 
                  estadisticos = rnorm(100),                     #Unos datos numéricos al azar que simulan los estadísticos
                  stringsAsFactors = FALSE)

To filter we are going to ask R to keep only the rows of df2 in which the value of the key column is also present in the keys of df1 . We use the operator %in (membership). If we used == (equality) it would only verify equality with an element, and we would verify equality in several elements. We use df2[ , ] to specify that we want a subset of that object. The name claves is irrelevant, the columns could be called in any way.

df2[df2$claves %in% df1$claves, ]

It's a basic operation and there are many ways to do it. This solution uses simpler operators possible. If your data is very large it could be somewhat slow. In that case, it could be faster to work with a join of dplyr , which tend to be faster:

library(dplyr)
left_join(df1, df2)

left_join indicates that only the data.frame values on the left side (in this case df1 ) that match those on the right side ( df2 ) will be preserved.

In this case it is necessary that the tie column names (key) are exactly the same in both data.frame, otherwise left_join would not know which column to join. You can specify names with the by= argument, but it's more complicated.