Linear Regression and calculation of R2 for qualitative variable vs quantitative

2

I want to perform a linear regression in a data set of a landslide inventory. Each event has an area in square meters (the quantitative variable of my interest) and type of movement.

The solution I have is to generate as many independent variables as there are categories in the qualitative variable, and then code each of these variables with "zeros" and "ones" according to the category to which the different subjects belong. Once this is done, I import the data to excel and perform the linear regression. The code that I have so far is the following.

import numpy as np
from scipy.interpolate import interp1d
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

nombreFichero='muestra.csv'
Data= pd.read_csv(nombreFichero, header = 0, sep=None,  engine='python')
muestra= pd.DataFrame(Data)

Rotacional= []
Traslacional= []
Sin_catalogar= []

def subtipo(lista):
    for i in lista:
        if i== 'Rotacional':
            Rotacional.append(1), Traslacional.append(0), Sin_catalogar.append(0)
        elif i== 'Traslacional':
            Rotacional.append(0), Traslacional.append(1), Sin_catalogar.append(0)
        elif i== 'Sin catalogar':
            Rotacional.append(0), Traslacional.append(0), Sin_catalogar.append(1)

a= muestra['Subtipo'].tolist()
subtipo(a)

muestra['Rotacional']= pd.DataFrame(Rotacional)                 
muestra['Traslacional']= pd.DataFrame(Traslacional)   
muestra['Sin catalogar']=  pd.DataFrame(Sin_catalogar)
muestra.to_csv('muestra.csv')
muestra.head()

Even if I have what I want, this process is very impractical, knowing that the more categories the variable has, the more extensive the code will be (For example, the variable Name_c_1 has 11 categories).

I leave the data link: link

Is there any way to optimize the code to calculate the R2 between the quantitative variable and the qualitative variable?

Thank you in advance.

    
asked by Juan Pablo Cuevas 17.01.2018 в 01:19
source

1 answer

0

What you're trying to do is called one-hot encoding, for future reference.

Now, you do not need to construct the same function that will do this separation of a% cualitativa in several vectors with 1 in the category that interests you and 0 in the others.

The same pandas already has this functionality, being quite common.

import pandas as pd
pd.get_dummies(muestra.Clas_slp_1)

I leave this guide where you find multiple examples.

    
answered by 03.06.2018 / 23:45
source