Identify in a data the variables that are factor type in R

1

I need a cycle that runs through a database and I identify the variables that I have to change to factor type, those that have 0 and 1 because there are variables that only have 0 or 1 that I need to leave numerical (and that there are variables that lack the other factor and the function lm only receives variables with more than two factors)

For example, we have the following data that we can read from a database, an excel, a csv:

txt <- "c1,c2,c3,c4,c5
1,0,100,1,alto
1,1,101,0,medio
0,0,200,1,alto
1,1,101,1,bajo
"
df <- as.data.frame(read.table(textConnection(txt), sep = ",", header=TRUE))

If we review the data.frame we see that the character type columns are converted to factor

> class(df$c5)
[1] "factor"

However, the numerical values remain as such, for example:

> class(df$c3)
[1] "factor"

I would like to be able to specify which columns to pass to factor .

    
asked by Laura Isaza Echeverri 14.07.2017 в 16:47
source

3 answers

1

If we have a large set of columns / variables and we want to convert some of them into one factor and not others, the simplest way is to assemble a vector of columns to convert. Suppose we have the following data.frame

txt <- "c1,c2,c3,c4,c5,c6
1,0,100,1,alto, 0
1,1,101,0,medio, 0
0,0,200,1,alto, 0
1,1,101,1,bajo, 0
"

df <- as.data.frame(read.table(textConnection(txt), sep = ",", header=TRUE))

Inspecting:

> str(df)
'data.frame':   4 obs. of  6 variables:
 $ c1: int  1 1 0 1
 $ c2: int  0 1 0 1
 $ c3: int  100 101 200 101
 $ c4: int  1 0 1 1
 $ c5: Factor w/ 3 levels "alto","bajo",..: 1 3 1 2
 $ c6: int  0 0 0 0

Not all variables are factor , by default only those that are chain types. Now, suppose we want to change columns% 1,2,3,4 to factor and not 6, or 5 which is already factor . This we can solve this way:

col.to.factor <- c(1,2,3,4)
df[col.to.factor] <- lapply(df[col.to.factor], as.factor)

The result:

> str(df)
'data.frame':   4 obs. of  6 variables:
 $ c1: Factor w/ 2 levels "0","1": 2 2 1 2
 $ c2: Factor w/ 2 levels "0","1": 1 2 1 2
 $ c3: Factor w/ 3 levels "100","101","200": 1 2 3 2
 $ c4: Factor w/ 2 levels "0","1": 2 1 2 2
 $ c5: Factor w/ 3 levels "alto","bajo",..: 1 3 1 2
 $ c6: int  0 0 0 0

Clearly we see that we have left only the column / variable c6 as integer and the rest we have converted them to factor . Another more interesting way would be, for example: convert all the variables / columns that have only 1 and 0 in factor automatically and the rest leave them as they are:

First we generate a logical vector apply.factor that tells us that columns have only 1 and 0:

apply.factor <- sapply(df, function(x) isTRUE(all.equal(levels(as.factor(x)),as.vector(as.factor(c("0", "1"))))))
> apply.factor
   c1    c2    c3    c4    c5    c6 
 TRUE  TRUE FALSE  TRUE FALSE FALSE 

The important thing is: as.vector(as.factor(c("0", "1"))) that arms the sample of values that we want to verify in a column / variable, obviously it can be modified by what we need to make a comparison of each column with this same 'vector.

Then in col.to.factor we generate the vector with the indexes of columns that we are going to convert (the columns that met our criterion)

col.to.factor <- seq(length(apply.factor))[apply.factor]
> col.to.factor
[1] 1 2 4

And finally we apply the conversion only on the chosen columns

df[col.to.factor] <- lapply(df[col.to.factor], as.factor)

Summing up everything:

> apply.factor <- sapply(df, function(x) isTRUE(all.equal(levels(as.factor(x)),as.vector(as.factor(c("0", "1"))))))
> col.to.factor <- seq(length(apply.factor))[apply.factor]
> df[col.to.factor] <- lapply(df[col.to.factor], as.factor)
> str(df)
'data.frame':   4 obs. of  6 variables:
 $ c1: Factor w/ 2 levels "0","1": 2 2 1 2
 $ c2: Factor w/ 2 levels "0","1": 1 2 1 2
 $ c3: int  100 101 200 101
 $ c4: Factor w/ 2 levels "0","1": 2 1 2 2
 $ c5: Factor w/ 3 levels "alto","bajo",..: 1 3 1 2
 $ c6: int  0 0 0 0

We see then that we have converted the columns we wanted into factor .

I hope it's useful for you.

    
answered by 17.07.2017 в 17:22
1

Using the dyplr:: library can be done with the mutate_if() function, which changes all the columns of a data.frame that meet a certain condition by applying a function to them.

#Cargo la librería
library(dplyr) 

#Uso los datos que preparó Patricio

txt <- "c1,c2,c3,c4,c5,c6
1,0,100,1,alto, 0
1,1,101,0,medio, 0
0,0,200,1,alto, 0
1,1,101,1,bajo, 0"
df <- as.data.frame(read.table(textConnection(txt), sep = ",", header=TRUE))


# La siguiente línea hace todo el trabajo:

mutate_if(df, df[1,]==1 | df[1,]==0, as.factor) 

mutate_if() has three arguments: the data, the condition and the function that we are going to apply when the condition is met. In this case:

  • the data is df
  • the condition is that the first row of df equals 1 or equal to 0, that makes | .
      

    Caution: the result of the evaluation depends only on the information in the first row and there could be ambiguities. Let's say you have a numeric variable that you do NOT want to convert to a factor and by chance it starts with 0 or 1: in that case it would also become a factor.

  • the function is as.factor , which coercions the column to the factor type.
answered by 17.07.2017 в 18:21
1

The best thing would be to directly import the data with the correct type. For this the package readr that is part of the tidyverse is very useful, a set of very good packages that I recommend you to explore.

With readr there are a series of functions that start with read_ and that allow to import data in the form of tibbles.Tibble is another type of data very similar to the dataframe and that can be used exactly the same but has some advantages, more info: < a href="https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html"> link .

The interesting thing about these functions is that they determine the type of data from the first observations of each column and do not assume the factor type when there is text, but they also allow specifying the type for each column of the dataframe in the following way:

library(readr)

txt <- "c1,c2,c3,c4,c5
1,0,100,1,alto
1,1,101,0,medio
0,0,200,1,alto
1,1,101,1,bajo
"
df <-read_delim(file = txt, 
                delim = ',',
                col_types = cols(c5 = col_factor(levels = NULL))
                )

In this case we are only converting the column c5 to a factor and by setting levels = NULL factor levels are taken from the unique values of c5.

If we observe df, we see that only c5 is of factor type. It is shown in this way because it is a tibble and not a dataframe:

> df
# A tibble: 4 x 5
     c1    c2    c3    c4     c5
  <int> <int> <int> <int> <fctr>
1     1     0   100     1   alto
2     1     1   101     0  medio
3     0     0   200     1   alto
4     1     1   101     1   bajo

> str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   4 obs. of  5 variables:
 $ c1: int  1 1 0 1
 $ c2: int  0 1 0 1
 $ c3: int  100 101 200 101
 $ c4: int  1 0 1 1
 $ c5: Factor w/ 3 levels "alto","medio",..: 1 2 1 3
    
answered by 02.09.2017 в 20:58