DataFrame cleanup

2

I have a dataframe where there is an index and several columns, in the columns there are empty rows (my idea is not to eliminate them, but to fill them, since I do not want to lose so much data) I know a lot of the information of that column (with rows empty) is in another column called 'description', but this column 'description' as the name says, is a column where the user writes a lot. My idea is to make a scraping of the words within each of the rows and place them in the column that is empty.

I explain myself better with the following example:

import pandas as pd
import numpy as np
Mundo = {
    'ciudades': ['San Jose','buenos aires','NaN'],
    'culinaria': ['pescado','NaN','tacos'],
    'precio': ['Nan','$60','$20'],
    'descripcion': ['en la ciudad de san jose comemos mucho carne y su precio es 40', 'en la ciudad de buenos aires comemos mucho pescado y su precio es 60','en la ciudad de mexico df comemos muchos tacos y su precio es 20']
}

df = pd.DataFrame(Mundo)
df
    
asked by Emaa 29.08.2018 в 15:52
source

1 answer

0

You have to use .apply in multiple columns. Change your dataframe a bit and I put np.NaN and str 'NaN' (as in your example) because I did not know where you were reading the values and how panda had interpreted them.

Within the function "scrape_row" you can call any method that calculates the value you want based on the value you pass.

import pandas as pd
import numpy as np

def calcular_valor(fuente):
    #logica va aqui
    return "nuevo valor"

def is_nan(value):
    return type(value) == float and np.isnan(value)

# reemplaza el str que estoy asignando con el llamado a la  funcion que va a obtener el valor deseado de "descripcion". por ejemplo:
# row['precio'] = calcular_valor(row['descripcion')"    

def scrape_row(row):
    if is_nan(row['culinaria']) or row['culinaria'] == 'NaN':
        row['culinaria'] = "calcular nuevo valor culinaria." 
    if is_nan(row['ciudades']) or row['culinaria'] == 'NaN':
        row['ciudades'] = "calcular nueva ciudad"
    if is_nan(row['precio']) or row['precio'] == 'NaN':
        row['precio'] = "calcular nuevo precio"

    return row


if __name__ == '__main__':
    Mundo = {
        'ciudades': ['San Jose', 'buenos aires', np.NaN],
        'culinaria': ['pescado', 'NaN', 'tacos'],
        'precio': [np.NaN, '$60', '$20'],
        'descripcion': ['en la ciudad de san jose comemos mucho carne y su precio es 40',
                        'en la ciudad de buenos aires comemos mucho pescado y su precio es 60',
                        'en la ciudad de mexico df comemos muchos tacos y su precio es 20']
    }

    df = pd.DataFrame(Mundo)
    cols = ['ciudades','culinaria', 'precio']
    df[cols] = df[cols].apply(scrape_row, axis=1)

    df

Good luck, R6

    
answered by 29.08.2018 в 18:02