Extract cell data using pandas

3

I have a table in csv format with more than 30 columns with values of 0 per row, but some columns have values. Then I would like to use Pandas to extract those values and store them by Zone # Area # and Ptc # and store the values according to their OBJ. The percentage column has its values separated by commas.

input

OBJ total run  PERCENTAGE             C5    M1     M12  M15 R7A R10 MR7 R9
0   0.59 |0.40| 100.00                 0    0.401   0   0   0   0   0   0
1   0.88 |0.60| 100.00                 0    0.602   0   0   0   0   0   0
2   0.52 |0.51| 100.00                 0    0.511   0   0   0   0   0   0
3   1.05 |0.98| 100.00                 0    0.979   0   0   0   0   0   0
4   0.58 |0.66| 0.088, 0.12, 99.0     0.001 0.656   0   0 0.00096 0 0   0
5   0.55 |0.88| 0.62, 99.37            0    0.878   0   0 0.00556 0 0   0
6   1.04 |0.86| 0.46, 99.53            0    0.851 0 0   0.0040  0   0   0

what I'm looking for

OBJ total   run    Zone1 Area1  Ptc1  Zone2 Area2   Ptc2    Zone3   Area3   Ptc3
0   0.59    0.40    M1  0.401   100     0    0      0         0        0    0
1   0.88    0.60    M1  0.602   100     0    0      0         0        0    0 
2   0.52    0.51    M1  0.511   100     0    0      0         0        0    0
3   1.05    0.98    M1  0.979   100     0    0      0         0        0    0
4   0.58    0.66    C5  0.001   0.088   M1   0.656  99.76    R7A    0.00096 0.12
5   0.55    0.88    M1  0.878   99.37   R7A  0.0055 0.62      0        0    0
6   1.04    0.86    M1  0.851   99.53   M15  0.0040 0.46      0        0    0
    
asked by Jose Vasquez 23.06.2018 в 17:17
source

2 answers

0

I do not understand your question well but to find the rows in a data frame you can use this:

import pandas as pd
dataframe=pd.read_csv('Path de tu archivo')
for indice_fila, fila in dataframe.iterrows():
    print(indice_fila, fila)

Here you print the index of your row and row, and there you can iterate the row like this:

print(fila['nombre de la columna en el dataframe'])

in the end it would be something like that

import pandas as pd
dataframe=pd.read_csv('Path de tu archivo')
for indice_fila, fila in dataframe.iterrows():
    print(indice_fila, fila)
    print(fila['nombre de la columna en el dataframe'])
    
answered by 08.10.2018 в 23:17
0

Since what you want to do is a bit complex, I can not figure out how to achieve it through "vectorized" operations (so to speak), that is, functional programming like map() , assign() , etc. that operates with all the rows of the dataframe at the same time.

It only occurs to me to iterate through the rows and do the conversion "by hand", which is not very efficient and may take time if your dataframe is very large.

Based on the example you provide and assuming that all rows meet the requirement that they contain as many columns with non-zero value (from the fourth column, which is "C5") as comma-separated elements exist in the column "PERCENTAGE", the following code would do what you ask.

I have commented, because it is not obvious to see how it works:

from collections import OrderedDict

# Crearemos una lista en la que cada elemento será una fila del
# dataframe que queremos obtener como resultado. Cada una de esas filas
# será de momento un diccionario (OrderedDict)cuyas claves son los nombres
# de las columnas y los valores los de las correspondientes celdas
new_rows = []

# Iteramos sobre cada fila del dataframe original
for indice, data in df.iterrows():
  # Del campo PERCENTAGE separamos por las comas y convertimos
  # a float cada trozo. El resultado es una tupla con tantos elementos
  # como porcentajes hallados
  percts = tuple(float(p)  for p in data.PERCENTAGE.split(","))

  # El diccionario es de tipo OrderedDict() para que las columnas
  # salgan al final en el orden que me interesa
  extraido = OrderedDict()
  # Insertamos las "columnas" en el orden deseado
  extraido["total"] = data["total"]
  extraido["run"] = data["run"]

  # Contador de "zonas" a crear
  contador = 1
  for c in data.index[4:]:
    # Para cada columna a partir de la cuarta... 
    if data[c]>0:
      # Si esa columna tiene un dato distinto de cero, hemos encontrado
      # una zona. Añadimos al diccionario los datos para esa zona
      extraido["Zone{}".format(contador)] =  c
      extraido["Area{}".format(contador)] = data[c]
      extraido["Pct{}".format(contador)] = percts[contador-1]
      contador += 1       
  # Añadimos este diccionario como nueva "fila" al resultado
  new_rows.append(extraido)

When executing the previous code (being df your input dataframe), we will obtain a list of dictionaries like this:

>>> new_rows
[OrderedDict([('total', 0.59),
              ('run', 0.4),
              ('Zone1', 'M1'),
              ('Area1', 0.401),
              ('Pct1', 100.0)]),
 OrderedDict([('total', 0.88),
              ('run', 0.6),
              ('Zone1', 'M1'),
              ('Area1', 0.602),
              ('Pct1', 100.0)]),
...

This will be converted to the final dataframe, with the precaution of using fillna(0) to fill in zero all non-existent values, and changing the name of the index to be "OBJ" as in the input dataframe:

resultado = pd.DataFrame(new_rows).fillna(0)
resultado.index.name = "OBJ"

The result is:

     total   run Zone1  Area1     Pct1 Zone2    Area2   Pct2 Zone3    Area3  Pct3
OBJ                                                                              
0     0.59  0.40    M1  0.401  100.000     0  0.00000   0.00     0  0.00000   0.0
1     0.88  0.60    M1  0.602  100.000     0  0.00000   0.00     0  0.00000   0.0
2     0.52  0.51    M1  0.511  100.000     0  0.00000   0.00     0  0.00000   0.0
3     1.05  0.98    M1  0.979  100.000     0  0.00000   0.00     0  0.00000   0.0
4     0.58  0.66    C5  0.001    0.088    M1  0.65600   0.12   R7A  0.00096  99.0
5     0.55  0.88    M1  0.878    0.620   R7A  0.00556  99.37     0  0.00000   0.0
6     1.04  0.86    M1  0.851    0.460   R7A  0.00400  99.53     0  0.00000   0.0
    
answered by 09.10.2018 в 09:41