extract equal data from columns

1

Hi I am trying to find the names that are the same between the two columns as the following example.

       [,1]      [,2]     
 [1,] "jara"   "moreno" 
 [2,] "moreno"  "lopez"  
 [3,] "diaz"    "Swanson"
 [4,] "powell"  "jara"   
 [5,] "Mckinze" "jenner" 
 [6,] "jenner"  "londra" 
 [7,] "londra"  "kennedy"

and I need at the end to have within a matrix the names that are the same as the following matrix:

      [,1]    
[1,] "moreno"
[2,] "jara"  
[3,] "jenner"
[4,] "londra" 

Is there a function to do this? There are also 10 columns with more or less 50 thousand data each.

Thanks

    
asked by nicolas garzon 08.05.2018 в 22:08
source

2 answers

1

First let's prepare your data in a reproducible example:

dat <- read.table(text='N1, N2
                        "jara",   "moreno" 
                        "moreno",  "lopez"  
                        "diaz",    "Swanson"
                        "powell",  "jara"   
                        "Mckinze", "jenner" 
                        "jenner",  "londra" 
                        "londra",  "kennedy"', 
                  header=T, sep=',', stringsAsFactors = F, quote = '"', strip.white = T)

This leaves us a data.frame but to be fairer, your data seems to be an unnamed matrix of columns, so we'll do this:

dat <- as.matrix(dat)
colnames(dat) <- NULL
dat

     [,1]      [,2]     
[1,] "jara"    "moreno" 
[2,] "moreno"  "lopez"  
[3,] "diaz"    "Swanson"
[4,] "powell"  "jara"   
[5,] "Mckinze" "jenner" 
[6,] "jenner"  "londra" 
[7,] "londra"  "kennedy"

Now, yes, we have the data as you have stated, let's go to the solution. One way to get the repeated values from one column in another, could be: dat[dat[,1] %in% dat[,2], 1] , that gives us the values of column 1 that are identical to those in column 2. However it is complicated to do so, because you should also check backwards also, those of column 2 that are equal to those of 1. And so with the 10 variables / columns that you mention.

But luckily we have a very useful feature to count frequencies that is table() , so we could do this:

tbl <- table(dat)
names(tbl[tbl > 1])
[1] "jara"   "jenner" "londra" "moreno"

With table(dat) we obtain a frequency table of all the variables and observations of your matrix, you should eventually "cut" it to those columns that interest you. The result is something like this:

   diaz    jara  jenner kennedy  londra   lopez Mckinze  moreno  powell Swanson 
      1       2       2       1       2       1       1       2       1       1 

Quite clear, now, it would only be necessary to obtain the names that have more than one occurrence and that we do with names(tbl[tbl > 1]) .

Important Clarification : this solution will count as repeated within the same column as well. If you do not want to get a name that has only been repeated in a single column, there is a little trick to make this solution:

tbl <- table(apply(dat, 2, function(x) {ifelse(duplicated(x), NA, x)}))
names(tbl[tbl > 1])

Basically what we are doing with apply(dat, 2, function(x) {ifelse(duplicated(x), NA, x)}) is to remove within each column, the values that are repeated replacing them by NA and then count the occurrences effectively.

    
answered by 08.05.2018 / 23:23
source
2

By what you see your data is in a matrix. For this type of operation I think it would be much simpler to work with a data.frame, especially because it is easier to "call" the columns.

Solution for the example

library(tidyverse)   #Para tribble y otras funciones: 

nombres <- tribble(
     ~a,     ~b,
   "jara"   ,"moreno" ,
   "moreno" , "lopez"  ,
   "diaz"   , "Swanson",
   "powell" , "jara"   ,
   "Mckinze", "jenner" ,bar
   "jenner" , "londra" ,
   "londra" , "kennedy")

nombres$a[nombres$a %in% nombres$b]

That would be read as "all rows of nombres$a when nombres$a belongs to nombres$b and the next vector returns:

[1] "jara"   "moreno" "jenner" "londra"

You could use as.matrix() to pass it to matrix.

Using dplyr::filter()

nombres %>% 
  filter(a %in% b)

Return a data.frame with all the rows that match.

# A tibble: 4 x 2
a      b      
<chr>  <chr>  
1 jara   moreno 
2 moreno lopez  
3 jenner londra 
4 londra kennedy

To make the comparison with more than two columns, it would be necessary to clarify a little what a match would be in that case. When there is a coincidence in at least two columns? When is it between a and any other than a ? What should be done if there is a match in more than two columns? I suggest you edit the question with an example with more than two columns and present the expected result. The minimum example that you include is well formulated, however the actual case that you mention at the end is different.

    
answered by 08.05.2018 в 23:15