I do not know if I understood correctly, but I think it's about:
- In each of the rows of the dataframe:
- Remove duplicates and stay with only one instance of each number that appears
- Fill the rest of the row with NaN
Although this statement does not match what you have put, I believe that in the end the result is the same, and expressed in this way is clearer.
In fact, this suggests another way to calculate that result without using Pandas, but by extracting the two-dimensional array underlying the dataframe. In this array, it is a question of traversing it by rows and building with each one a set with its elements (sets automatically eliminate duplicates). After this transformation there will be rows with only two elements, others with four, etc.
Finally you can build a new dataframe with all those sets. Since Pandas when converting it to dataframe will make all the rows have the same length, it will fill with NaN the missing elements.
The problem with the previous idea is that the sets do not have an internal order , so the first row, for example, would result in a set with the elements 1, NaN
, or perhaps with the elements NaN, 1
. That is, the order in which the elements have been added to the set is not preserved, and this does not suit us because we want to respect that order when they are expanded back into rows.
One solution to this is the following trick. Instead of a set, we use a OrderdeDict()
, which is a dictionary that preserves the order in which the keys are added. We use the elements of each row to create the keys of that dictionary (the values are irrelevant and I will use True
). A repeated key is stored in the same key as there was. If at the end we take the keys of the resulting dictionary ( .keys()
), we will have the ordered set of the numbers of that row, in the order in which they were inserted, which is the order of columns from left to right.
That is, to the point, this is my idea:
import io
import pandas as pd
from collections import OrderedDict
datos = """\
M1 M2 M3 M4
0 1 1 1 NaN
1 2 3 3 NaN
2 3 2 2 1
3 4 NaN 1 NaN
4 1 NaN NaN NaN
5 1 3 2 2
6 3 3 NaN 1
7 2 2 3 NaN
8 1 3 NaN 1
9 6 4 5 5"""
# Leer el dataframe en cuestión
df = pd.read_table(io.StringIO(datos), sep=r'\s+')
# Construir la lista de las nuevas filas
r = []
for fila in df.values:
r.append(OrderedDict({k:True for k in fila}).keys())
# Convertir a dataframe de nuevo la lista obtenida
resultado = df.DataFrame(r, columns=df.columns)
Result:
M1 M2 M3 M4
0 1.0 NaN NaN NaN
1 2.0 3.0 NaN NaN
2 3.0 2.0 1.0 NaN
3 4.0 NaN 1.0 NaN
4 1.0 NaN NaN NaN
5 1.0 3.0 2.0 NaN
6 3.0 NaN 1.0 NaN
7 2.0 3.0 NaN NaN
8 1.0 3.0 NaN NaN
9 6.0 4.0 5.0 NaN
Update
@Carolina tells me in a comment that row 3 does not meet specifications. In fact, not all the NaN are together on the right.
The error comes from the fact that the loop that is entering each element in an ordered dictionary, also puts the NaN
. Actually we just want to preserve the original order of the numbers, not the NaN
. Therefore, it is enough not to put those NaN
in the dictionary, that is:
r = []
for fila in df.values:
r.append(OrderedDict({k:True for k in fila if not np.isnan(k)}).keys())
The problem is that now the r
list will contain the resulting rows without any NaN
, so when creating a DataFrame from them, the final number of columns may be smaller than the one we originally had if, for example (and as it is the case) to the right there were columns in which there was only NaN.
To fix this second problem, I will use the following trick. First I create a dataframe with the data I have in r
. This dataframe in general will have N columns being N < = 4. Next I name these columns by copying the names of the original dataframe, but only the first N names. Finally I use reindex()
in the columns to expand the number of columns to which the dataframe originally had. This will fill with NaN the extra columns that you have to add. That is:
result = pd.DataFrame(r)
result.columns = df.columns[:result.shape[1]]
# En realidad, si te vale con que el dataframe resultante tenga sólo
# las columnas M1, M2, y M3 (ya que M4 sería todo NaN), podriamos dejarlo
# así. Si quieres que el resultado tenga el mismo número de columnas, entonces...
result = result.reindex(columns=df.columns)
print(result)
And now yes:
M1 M2 M3 M4
0 1.0 NaN NaN NaN
1 2.0 3.0 NaN NaN
2 3.0 2.0 1.0 NaN
3 4.0 1.0 NaN NaN
4 1.0 NaN NaN NaN
5 1.0 3.0 2.0 NaN
6 3.0 1.0 NaN NaN
7 2.0 3.0 NaN NaN
8 1.0 3.0 NaN NaN
9 6.0 4.0 5.0 NaN