Delete and replace values in python pandas using conditionals

Question

Delete and replace values in python pandas using conditionals

Navigation

#1 by (1 votes)
#2 by (0 votes)

2

I have the following Dataframe

prueba = 

     M1    M2    M3    M4
0     1     1     1   NaN
1     2     3     3   NaN
2     3     2     2     1
3     4   NaN     1   NaN
4     1   NaN   NaN   NaN
5     1     3     2     2
6     3     3   NaN     1
7     2     2     3   NaN
8     1     3   NaN     1
9     6     4     5     5

I need to do two tasks for each of the rows:

If a column is empty (NaN) and a following one has value, that value remains in the first empty column and in the rest it is NaN ... That is, to move the values to the left

If two values are equal in a row that leaves it alone in the first column that appears: for example if M1 and M2 are equal, that only the value remains in M1 and M2 becomes NaN, if the value is repeated in several M must be left only in the first and in the other NaN.

I have tried with the following options:

For the first question, try to make a comparison in pairs. For example for M2 and M3:

for row in prueba.itertuples(): prueba['M2']= prueba.where((prueba['M2'].isnull() & prueba['M3'].notnull()), prueba['M3']) but generates error.

For the second question (This part works)

prueba.loc[prueba['M1']== prueba['M2'] , 'M2'] = 'NaN'
prueba.loc[prueba['M1']== prueba['M3'] , 'M3'] = 'NaN'
prueba.loc[prueba['M1']== prueba['M4'] , 'M4'] = 'NaN'
prueba.loc[prueba['M2']== prueba['M3'] , 'M3'] = 'NaN'
prueba.loc[prueba['M2']== prueba['M4'] , 'M4'] = 'NaN'
prueba.loc[prueba['M3']== prueba['M4'] , 'M4'] = 'NaN'

I am new programming, I appreciate if you can help me to solve the two questions mentioned. It is important the time that is spent finding the solution because there are many data.

The processed Dataframe should look like this:

     M1     M2      M3    M4
0     1     NaN     NaN   NaN
1     2     3       NaN   NaN
2     3     2       1     NaN
3     4     1       NaN   NaN
4     1     NaN     NaN   NaN
5     1     3       2     NaN
6     3     1       NaN   NaN
7     2     3       NaN   NaN
8     1     3       NaN   NaN
9     6     4       5     NaN

python python-3.x pandas

asked by Carolina 20.05.2018 в 21:13

source

2 answers

0

I do not know if I understood correctly, but I think it's about:

In each of the rows of the dataframe:
Remove duplicates and stay with only one instance of each number that appears
Fill the rest of the row with NaN

Although this statement does not match what you have put, I believe that in the end the result is the same, and expressed in this way is clearer.

In fact, this suggests another way to calculate that result without using Pandas, but by extracting the two-dimensional array underlying the dataframe. In this array, it is a question of traversing it by rows and building with each one a set with its elements (sets automatically eliminate duplicates). After this transformation there will be rows with only two elements, others with four, etc.

Finally you can build a new dataframe with all those sets. Since Pandas when converting it to dataframe will make all the rows have the same length, it will fill with NaN the missing elements.

The problem with the previous idea is that the sets do not have an internal order , so the first row, for example, would result in a set with the elements 1, NaN , or perhaps with the elements NaN, 1 . That is, the order in which the elements have been added to the set is not preserved, and this does not suit us because we want to respect that order when they are expanded back into rows.

One solution to this is the following trick. Instead of a set, we use a OrderdeDict() , which is a dictionary that preserves the order in which the keys are added. We use the elements of each row to create the keys of that dictionary (the values are irrelevant and I will use True ). A repeated key is stored in the same key as there was. If at the end we take the keys of the resulting dictionary ( .keys() ), we will have the ordered set of the numbers of that row, in the order in which they were inserted, which is the order of columns from left to right.

That is, to the point, this is my idea:

import io
import pandas as pd
from collections import OrderedDict

datos = """\
     M1    M2    M3    M4
0     1     1     1   NaN
1     2     3     3   NaN
2     3     2     2     1
3     4   NaN     1   NaN
4     1   NaN   NaN   NaN
5     1     3     2     2
6     3     3   NaN     1
7     2     2     3   NaN
8     1     3   NaN     1
9     6     4     5     5"""

# Leer el dataframe en cuestión
df = pd.read_table(io.StringIO(datos), sep=r'\s+')

# Construir la lista de las nuevas filas
r = []
for fila in df.values:
  r.append(OrderedDict({k:True for k in fila}).keys())

# Convertir a dataframe de nuevo la lista obtenida
resultado = df.DataFrame(r, columns=df.columns)

Result:

    M1   M2   M3  M4
0  1.0  NaN  NaN NaN
1  2.0  3.0  NaN NaN
2  3.0  2.0  1.0 NaN
3  4.0  NaN  1.0 NaN
4  1.0  NaN  NaN NaN
5  1.0  3.0  2.0 NaN
6  3.0  NaN  1.0 NaN
7  2.0  3.0  NaN NaN
8  1.0  3.0  NaN NaN
9  6.0  4.0  5.0 NaN

Update

@Carolina tells me in a comment that row 3 does not meet specifications. In fact, not all the NaN are together on the right.

The error comes from the fact that the loop that is entering each element in an ordered dictionary, also puts the NaN . Actually we just want to preserve the original order of the numbers, not the NaN . Therefore, it is enough not to put those NaN in the dictionary, that is:

r = []
for fila in df.values:
  r.append(OrderedDict({k:True for k in fila if not np.isnan(k)}).keys())

The problem is that now the r list will contain the resulting rows without any NaN , so when creating a DataFrame from them, the final number of columns may be smaller than the one we originally had if, for example (and as it is the case) to the right there were columns in which there was only NaN.

To fix this second problem, I will use the following trick. First I create a dataframe with the data I have in r . This dataframe in general will have N columns being N < = 4. Next I name these columns by copying the names of the original dataframe, but only the first N names. Finally I use reindex() in the columns to expand the number of columns to which the dataframe originally had. This will fill with NaN the extra columns that you have to add. That is:

result = pd.DataFrame(r)
result.columns = df.columns[:result.shape[1]]
# En realidad, si te vale con que el dataframe resultante tenga sólo
# las columnas M1, M2, y M3 (ya que M4 sería todo NaN), podriamos dejarlo
# así. Si quieres que el resultado tenga el mismo número de columnas, entonces...

result = result.reindex(columns=df.columns)
print(result)

And now yes:

    M1   M2   M3  M4
0  1.0  NaN  NaN NaN
1  2.0  3.0  NaN NaN
2  3.0  2.0  1.0 NaN
3  4.0  1.0  NaN NaN
4  1.0  NaN  NaN NaN
5  1.0  3.0  2.0 NaN
6  3.0  1.0  NaN NaN
7  2.0  3.0  NaN NaN
8  1.0  3.0  NaN NaN
9  6.0  4.0  5.0 NaN

answered by 21.05.2018 в 20:44

libvlc - python access violation reading 0x00000094 Problems with lc_switch.js

score 1 · Accepted Answer

For the second point you can generally use pandas.Series.drop_duplicates by passing the argument keep="first" to keep only the first occurrence. A Boolean mask with pandas.Series.duplicated would also work.

For the first point I can not think of a vectorized form. It is possible to do so by using pandas.DataFrame.apply applied to the rows ( axis=1 ) and calling for each row a Python function that uses the pandas.Series.dropna method to build the new row.

import io
import pandas as pd
import numpy as np


data = io.StringIO('''\
M1,M2,M3,M4
1,1,1,NaN
2,3,3,NaN
3,2,2,1
4,NaN,1,NaN
1,NaN,NaN,NaN
1,3,2,2
3,3,NaN,1
2,2,3,NaN
3,3,NaN,1
6,4,5,5
''')

df = pd.read_csv(data, dtype="f")

With the above we obtain a DataFrame that allows us to reproduce your example:

>>> df

    M1   M2   M3   M4
0  1.0  1.0  1.0  NaN
1  2.0  3.0  3.0  NaN
2  3.0  2.0  2.0  1.0
3  4.0  NaN  1.0  NaN
4  1.0  NaN  NaN  NaN
5  1.0  3.0  2.0  2.0
6  3.0  3.0  NaN  1.0
7  2.0  2.0  3.0  NaN
8  3.0  3.0  NaN  1.0
9  6.0  4.0  5.0  5.0

Now let's apply the idea explained before:

res = df.apply(lambda row: pd.Series(row.drop_duplicates(keep="first")
                                        .dropna()
                                        .values
                                     ),
                axis=1
              )

With this we get something that is quite close:

>>> res

     0    1    2
0  1.0  NaN  NaN
1  2.0  3.0  NaN
2  3.0  2.0  1.0
3  4.0  1.0  NaN
4  1.0  NaN  NaN
5  1.0  3.0  2.0
6  3.0  1.0  NaN
7  2.0  3.0  NaN
8  3.0  1.0  NaN
9  6.0  4.0  5.0

We just need to add the missing columns (columns with all the NaN values) and rename the rest:

p_cols, m_cols = df.columns[:res.shape[1]], df.columns[res.shape[1]:] 
res.columns = p_cols

for col in m_cols:
    res[col] = np.nan

Result:

>>> res

    M1   M2   M3  M4
0  1.0  NaN  NaN NaN
1  2.0  3.0  NaN NaN
2  3.0  2.0  1.0 NaN
3  4.0  1.0  NaN NaN
4  1.0  NaN  NaN NaN
5  1.0  3.0  2.0 NaN
6  3.0  1.0  NaN NaN
7  2.0  3.0  NaN NaN
8  3.0  1.0  NaN NaN
9  6.0  4.0  5.0 NaN