I have a series of dataframes that execute orders in an automaton, the dataframe is collected from a data base that has been filled "in any way" for more than ten years, so there are thousands of records.
Now we want to standardize the values of some records based on the values contained in some columns.
For this I have made an excel where each record says what value the columns should contain or an empty value if the value contained in that column does not matter.
Each test consists of a few hundred to a few thousand records depending on the test
I use the .iterrows () function to iterate over the records and for each one I check column by column if all the columns with value match those of the sample, if so, execute the action of changing the associated values.
The problem is that the iteration becomes extremely slow taking several minutes for each record.
Is there a simpler way to do this check?
Example:
The dataframe with the values to check has the following columns:
+========+===============+===========+==============+=============+
| 'MODO' | 'TIPO' | 'TAG_OPC' | 'CANAL_MULT' | 'VALOR_OPC' |
+========+===============+===========+==============+=============+
| NaN | manual | NaN | NaN | NaN |
+--------+---------------+-----------+--------------+-------------+
| NaN | OPC_FAIL | NaN | NaN | NaN |
+--------+---------------+-----------+--------------+-------------+
| NaN | OPC_FAIL_READ | NaN | NaN | NaN |
+--------+---------------+-----------+--------------+-------------+
| NaN | OPC_FAIL_READ | KRBT_OK | NaN | True |
+--------+---------------+-----------+--------------+-------------+
| NaN | OPC_MEDIDA | NaN | NaN | NaN |
+--------+---------------+-----------+--------------+-------------+
| NaN | OPC_MEDIDA | TI03 | NaN | NaN |
+--------+---------------+-----------+--------------+-------------+
| NaN | OPC_MEDIDA | TI02 | NaN | NaN |
+--------+---------------+-----------+--------------+-------------+
| NaN | OPC_MEDIDA | TI01 | NaN | NaN |
+--------+---------------+-----------+--------------+-------------+
| NaN | OPC_MEDIDA | VL1_N | NaN | NaN |
+--------+---------------+-----------+--------------+-------------+
Following the previous dataframe, for each line of the dataframe to check if the value of the columns coincides with that of the line (NaN in the previous table is considered equal regardless of the value of the corresponding column in the line) is executed an action.
I currently have this:
def parsea(conditions_list: list, file_to_parse: pd.DataFrame):
for index, row in file_to_parse.iterrows():
print('trabajando en la linea', index, 'de file_to_parse')
all_ok = True
while all_ok:
for lista in conditions_list:
for condition in lista['condiciones'].keys():
if lista['condiciones'][condition] == file_to_parse.at[index,
condition]:
print('La condición %s es cierta en la linea %s' % (condition, index))
#acciones a ejecutar
else:
all_ok = False
all_ok = False