Filter rows of a dataframe whose columns have different / equal values in R

3

I have a data frame with a tumor classification.

The program I use uses two different classification methods and in some samples it appears to me classified differently.

I would like to make two tables one with the tumors whose classification is equal (those lines whose values of all their columns are equal) and another with those that do not match (those whose value of their columns are different)

I tried using subset next to $ to indicate the columns and equal values

subset(tabla_comp, tabla_comp$Rfcms.RF.nearestCMS == tabla_comp$Rfcms.RF.predictedCMS) 

an example (although this would only take into account the first two columns but so you can see by what idea I have been pulling) but it does not work tells me:

  

Error in Ops.factor (buy_table $ Rfcms.RF.nearestCMS, buy_table $ Rfcms.RF.predictedCMS)

    
asked by Carlos Carretero Puche 22.02.2017 в 10:45
source

2 answers

3

To achieve what you want you have several options.

Using subset

When calling subset , you do not need to specify with the dollar sign ( $ ) the table and the column you are comparing. The names you use will be searched within the table you give as the x argument.

So, you can do the following:

subset(x = tabla_comp, 
       subset = Rfcms.RF.nearestCMS == Rfcms.RF.predictedCMS &
                Rfcms.RF.predictedCMS == SScms.SSP.nearestCMS & 
                SScmp.SSP.nearestCMS == SScms.SSP.predictedCMS)

Using brackets

The above can be applied to notation with brackets ( [] )

tabla_comp[tabla_comp$Rfcms.RF.nearestCMS == tabla_comp$Rfcms.RF.predictedCMS &
           tabla_comp$Rfcms.RF.predictedCMS == tabla_comp$SScms.SSP.nearestCMS & 
           tabla_comp$SScmp.SSP.nearestCMS == tabla_comp$SScms.SSP.predictedCMS, ]

Using brackets and functions

The previous methods have the disadvantage that they require many times to write values with long and complete names, which can cause "finger" errors to be commented and with this we do not obtain the expected result.

In addition to that it needs to be rewritten in case we have a table with different column names or with more or less columns. This makes our code more difficult to maintain.

Therefore, it is desirable to use the capabilities of R. We combine the bracket notation with some functions.

mi_df[which(apply(X = mi_df, MARGIN = 1, FUN = function(renglon) { length(unique(renglon)) } ) == 1), ]

We use apply with an anonymous function.

First we obtain all the unique values that each row of our table has using unique and once this is done, we use length to count how many unique values each line has. If the result of this is number 1, then all values are equal in all columns.

We use which within the brackets to select only the lines where True ( TRUE ) that the result of calling this function is == 1 .

This procedure can be reused no matter how "high" or "wide" the table in question is, as long as it has the same type of data as the table you have shown in this question.

    
answered by 27.02.2017 в 23:46
2

If you want to use the dplyr package (I always prefer it!), you can do what follows.

First I generate an example of your dataframe :

Rfcms.RF.nearestCMS   <- c("CMS1", "CMS2", "CMS1", "CMS2")
Rfcms.RF.predictedCMS <- c("CMS1", "CMS2", "CMS1", "CMS1")
SScms.SSP.nearestCMS  <- c("CMS1", "CMS2", "CMS2", "CMS1")

library(dplyr) 
(tabla_comp <- data_frame(Rfcms.RF.nearestCMS, 
                          Rfcms.RF.predictedCMS, 
                          SScms.SSP.nearestCMS)

# A tibble: 4 × 3
  Rfcms.RF.nearestCMS Rfcms.RF.predictedCMS SScms.SSP.nearestCMS
            <chr>                 <chr>                <chr>
1                CMS1                  CMS1                 CMS1
2                CMS2                  CMS2                 CMS2
3                CMS1                  CMS1                 CMS2
4                CMS2                  CMS1                 CMS1

Now I generate the expression that will be evaluated in the next step, using the lazyeval::interp (you can find more information here ). This is where the values of the columns are compared.

expr_iguales <- lazyeval::interp(quote(x == y & x == z), 
                                 x = as.name(names(tabla_comp)[1]), 
                                 y = as.name(names(tabla_comp)[2]),
                                 z = as.name(names(tabla_comp)[3]))

Now I generate a dataframe with the rows that have the same values in all the columns using dplyr::filter .

(iguales <- tabla_comp %>% filter(expr_iguales))

# A tibble: 2 × 3
  Rfcms.RF.nearestCMS Rfcms.RF.predictedCMS SScms.SSP.nearestCMS
                <chr>                 <chr>                <chr>
1                CMS1                  CMS1                 CMS1
2                CMS2                  CMS2                 CMS2

Then I do a dplyr::anti_join to generate the other dataframe that has the rows whose columns are not all the same.

(distintos <- anti_join(tabla_comp, iguales))

Joining, by = c("Rfcms.RF.nearestCMS", "Rfcms.RF.predictedCMS", "SScms.SSP.nearestCMS")
# A tibble: 2 × 3
  Rfcms.RF.nearestCMS Rfcms.RF.predictedCMS SScms.SSP.nearestCMS
                <chr>                 <chr>                <chr>
1                CMS2                  CMS1                 CMS1
2                CMS1                  CMS1                 CMS2

As I do not specify according to which column I want to do the anti_join , it does it according to all.

    
answered by 28.02.2017 в 19:25