Random Forest in R

0

I am practicing with the well-known Kaggle Titanic challenge for R and this is what I have in code but I have stalled because it tells me that there are missing values in the object.

#Seleccionamos el directorio donde iremos cogiendo los archivos
setwd("C:/Users/User/Desktop/Titanic")
#Cargamos los CSV
Titanic.train <- read.csv(file="train.csv", stringsAsFactors = FALSE, header = TRUE)
Titanic.test <- read.csv(file="test.csv", stringsAsFactors = FALSE, header = TRUE)
#Crea una columna nueva para cada tabla, una rellena de FALSE y la otra de TRUE
Titanic.train$iSTrainSet <- TRUE
Titanic.test$iSTrainSet <- FALSE
#Titanic.test tiene una columna menos y la creamos
Titanic.test$Survived <- NA
#Combinamos los dos objetos 
Titanic.full <- rbind(Titanic.train, Titanic.test)
#Como teníamos dos valores sin columna asignada en la 
Titanic.full[Titanic.full$Embarked=="", "Embarked"] <- 'S'
#Creamos el objeto media de edad que es la media de todas las edades y eliminamos todas los valores que sean Not Available 
age.median <- median(Titanic.full$Age, na.rm = TRUE)

Titanic.full[is.na(Titanic.full$Age), "Age"] <- age.median
#Creamos el objeto media de edad que es la media de todas las tarifasy eliminamos todas los valores que sean Not Available 
fare.median <- median(Titanic.full$Fare, na.rm = TRUE)

Titanic.full[is.na(Titanic.full$Fare), "Fare"] <- fare.median

#Categorical casting
Titanic.full$Pclass <- as.factor(Titanic.full$Pclass)
Titanic.full$Sex <- as.factor(Titanic.full$Sex)
Titanic.full$Embarked <- as.factor(Titanic.full$Embarked)
#Dividimos el conjunto de datos en train y en test, en TRUE y FALSE
Titanic.train <- Titanic.full[Titanic.full$iSTrainSet==TRUE,]
Titanic.test <- Titanic.full[Titanic.full$iSTrainSet==FALSE,]
#Categorical casting
Titanic.train$Survived <- as.factor(Titanic.train$Survived)
#Definimos la ecuación de supervivencia y la metemos dentro de una fórmula
Survived.equation <- "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
Survived.formula <- as.formula(Survived.equation)
#Instalamos el paquete de randomForest
install.packages("randomForest")
#Cargamos la librería
library(randomForest)
#Error fatal
Titanic.model <- randomForest(formula = Survived.formula, data = Titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(Titanic.test) )

Error

  

Error in na.fail.default (list (Survived = c (1L, 2L, 2L, 2L, 1L, 1L, 1L,: missing values in object

    
asked by TGB 13.10.2017 в 18:33
source

1 answer

1

I understand that the reason may be that you have values na in Titanic.train . In the case of Randomforest you have some options to deal with these cases. Let's see some examples:

First of all we take a dataset and we delete some data on purpose.

library(randomForest)
data(iris)

iris.na <- iris
set.seed(111)

## Borramos algunos valores
for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA

If we try to call randomForest with values na in the predictor variables:

# Esto da un error
iris.rf <- randomForest(Species ~ ., iris.na)
Error in na.fail.default(list(Species = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,  : 
  missing values in object

What can we do?

  • Option 1 omit the na with na.action=na.omit

    iris.rf <- randomForest(Species ~ ., iris.na, na.action=na.omit)
    
  • Option 2 impute the values na for the median of the column

    iris.rf <- randomForest(Species ~ ., iris.na, na.action=na.roughfix.)
    

In your example, you could:

Titanic.model <- randomForest(formula = Survived.formula, data = Titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(Titanic.test), na.action=na.omit ) 

or:

Titanic.model <- randomForest(formula = Survived.formula, data = Titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(Titanic.test), na.action=na.roughfix) 

You can also use rfImpute() to pre-allocate the na values and analyze them.

By modifying my code and entering what you have told me, it puts something like this:  Titanic.model < - randomForest (formula = Survived.formula, data = Titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow (Titanic.test), na.action = na.omit) Error in randomForest.default (m, y, ...):   NA / NaN / Inf in foreign function call (arg 1) In addition: Warning messages: 1: In data.matrix (x): NAs introduced by coercion 2: In data.matrix (x): NAs introduced by coercion

  

Titanic.model

answered by 13.10.2017 в 19:23