You can make a groupby()
by the names of all the columns you want to consider when viewing two rows as equal. For example, you can consider all of them, except packet_length
. Once grouped, you'll keep the packet_length
column to apply the sum
function on it.
The result would be a series with the sums of packet_lenght
, but indexed hierarchically (multi-index) according to the groups formed. You can do a .reset_index()
to "undo" the hierarchical index, converting it back into columns like the original dataframe had.
Code
We read the dataframe
# Comienzo cargando el dataframe de ejemplo, que fue dado en la pregunta
# como cadena, por lo que lo leeré usando StringIO y .read_table
data = """ip_src ip_dst packet_length src_port dst_port comm_type
0 192.168.0.7 52.31.84.22 280 46306.0 443.0 10
1 192.168.0.7 192.168.0.17 105 33105.0 9999.0 00
2 192.168.0.7 192.168.0.17 105 33105.0 9999.0 00
3 192.168.0.7 192.168.0.17 105 33105.0 9999.0 00
4 192.168.0.17 192.168.0.7 127 9999.0 33105.0 00
5 192.168.0.17 192.168.0.7 127 9999.0 33105.0 00
6 192.168.0.17 192.168.0.7 127 9999.0 33105.0 00
7 192.168.0.7 34.192.110.9 583 45914.0 443.0 01
8 52.31.84.22 192.168.0.7 1066 443.0 46306.0 10
9 192.168.0.7 52.31.84.22 73 46306.0 443.0 10
10 34.192.110.9 192.168.0.7 1032 443.0 45914.0 01
11 192.168.0.7 34.192.110.9 73 45914.0 443.0 01
12 192.168.0.7 192.168.0.17 120 33110.0 9999.0 00
13 192.168.0.7 192.168.0.17 120 33110.0 9999.0 00
14 192.168.0.7 192.168.0.17 120 33110.0 9999.0 00"""
import pandas as pd
import io
df = pd.read_table(io.StringIO(data), sep=r'\s+', dtype={"comm_type": str})
# El separador r'\s+' representa una secuencia de espacios o tabuladores
# Mediante dtype fuerzo a que la columna comm_type se evalúe como string
# en lugar de entero
We apply the operation described:
result = (df.groupby(("ip_src", "ip_dst", "src_port", "dst_port", "comm_type")) # Agrupar
.packet_length # Quedarse con la columna packet_length
.apply(sum) # Calcular su suma
.reset_index()) # Deshacer índice jerárquico
The result is:
ip_src ip_dst src_port dst_port comm_type packet_length
0 192.168.0.17 192.168.0.7 9999.0 33105.0 00 381
1 192.168.0.7 192.168.0.17 33105.0 9999.0 00 315
2 192.168.0.7 192.168.0.17 33110.0 9999.0 00 360
3 192.168.0.7 34.192.110.9 45914.0 443.0 01 656
4 192.168.0.7 52.31.84.22 46306.0 443.0 10 353
5 34.192.110.9 192.168.0.7 443.0 45914.0 01 1032
6 52.31.84.22 192.168.0.7 443.0 46306.0 10 1066
Second version
The previous method would group the rows with the same source IP, destination, ports and type of communication, but without taking into account the packet_lenght
, that is to say, those with different packet_length
would also be part of the same group. such that the other attributes match.
If what you want is to only group the ones with the same all fields (including packet_length
) you can pass% column% of the list of columns. In this case we have the problem that the final dataframe would have two columns groupby()
, one with the original value, and another with the sum. Pandas does not allow two columns of the same name (besides it would be very confusing), but the problem is fixed if we compute two aggregation functions, one that adds the packet_lenght
and another that counts how many there is in each group (which would be the number of repetitions within each group).
The following code does that:
result = (df.groupby(list(df.columns)) # Agrupar filas iguales
.agg({'packet_length': ['count', 'sum']}) # Computar sobre la columna packet_length la cuenta y la suma
.reset_index()) # Deshacer índice jerárquico
Result:
ip_src ip_dst packet_length src_port dst_port comm_type packet_length
count sum
0 192.168.0.17 192.168.0.7 127 9999.0 33105.0 00 3 381
1 192.168.0.7 192.168.0.17 105 33105.0 9999.0 00 3 315
2 192.168.0.7 192.168.0.17 120 33110.0 9999.0 00 3 360
3 192.168.0.7 34.192.110.9 73 45914.0 443.0 01 1 73
4 192.168.0.7 34.192.110.9 583 45914.0 443.0 01 1 583
5 192.168.0.7 52.31.84.22 73 46306.0 443.0 10 1 73
6 192.168.0.7 52.31.84.22 280 46306.0 443.0 10 1 280
7 34.192.110.9 192.168.0.7 1032 443.0 45914.0 01 1 1032
8 52.31.84.22 192.168.0.7 1066 443.0 46306.0 10 1 1066
Update
I have noticed a problem with the previous solution, and that is that the grouping occurs regardless of whether the rows are contiguous or not .
For example, in version 1 of the solution rows 0 and 9 are grouped because they have the same source IP, destination, ports and communication type, although they are not contiguous in the dataframe. In solution 2 they are not grouped because they do not have the same value of packet_lengths
, but if they had it they would also be grouped.
We see another side effect, and that is that in the result the groups appear arranged in a different way to how they appeared in the original sequence (apparently it puts the groups with the highest number of repetitions first).
If what we want is that only the identical and contiguous rows are grouped together, and that the "temporary" order of the series is preserved, the solution could be the following. I do not explain it because it would be too long, but I can add an explanation later if someone deems it necessary.
result = (df.groupby((df!=df.shift()).cumsum().to_records(index=False)) # Agrupación "mágica"
.agg({'ip_src': 'last', # Cómputo aplicable a cada columna
'ip_dst': 'last', # En casi todas uso 'last' que toma el
'packet_length': ['count', 'sum'], # último valor que aparezca en el grupo
'src_port': 'last', # excepto para packet_lenght que
'dst_port': 'last', # computo la cuenta y suma
'comm_type': 'last'
})
.reset_index(drop=True)) # Deshacer el índice jerárquico
# Renombrar columnas para eliminar multi-niveles introducidos por .agg
result.columns = ['ip_src', 'ip_dst',
'count', 'packet_length_sum',
'src_port', 'dst_port', 'comm_type']
New result:
ip_src ip_dst count packet_length_sum src_port dst_port comm_type
0 192.168.0.7 52.31.84.22 1 280 46306.0 443.0 10
1 192.168.0.7 192.168.0.17 3 315 33105.0 9999.0 00
2 192.168.0.17 192.168.0.7 3 381 9999.0 33105.0 00
3 192.168.0.7 34.192.110.9 1 583 45914.0 443.0 01
4 52.31.84.22 192.168.0.7 1 1066 443.0 46306.0 10
5 192.168.0.7 52.31.84.22 1 73 46306.0 443.0 10
6 34.192.110.9 192.168.0.7 1 1032 443.0 45914.0 01
7 192.168.0.7 34.192.110.9 1 73 45914.0 443.0 01
8 192.168.0.7 192.168.0.17 3 360 33110.0 9999.0 00