Group consecutive duplicates in python

1

I am wanting to do a grouping of rows in a dataframe, but only of the consecutive ones. To put it more clearly, I give an example:

ip_src        ip_dst  packet_length  src_port  dst_port comm_type
0      192.168.0.7   52.31.84.22            280   46306.0     443.0        10
1      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
2      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
3      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
4     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
5     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
6     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
7      192.168.0.7  34.192.110.9            583   45914.0     443.0        01
8      52.31.84.22   192.168.0.7           1066     443.0   46306.0        10
9      192.168.0.7   52.31.84.22             73   46306.0     443.0        10
10    34.192.110.9   192.168.0.7           1032     443.0   45914.0        01
11     192.168.0.7  34.192.110.9             73   45914.0     443.0        01
12     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00
13     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00
14     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00

In this df, I want to group for example [1,2,3], [4,5,6], [12,13,14], so that we can make some calculations with them, for example grupo.sum ['packet_length'] (that is, I am left with a single row that has 'packet_length' the sum of this value for each member.

I hope my question has been understood.

Thank you very much!

    
asked by sooaran 15.05.2018 в 05:13
source

2 answers

2

You can make a groupby() by the names of all the columns you want to consider when viewing two rows as equal. For example, you can consider all of them, except packet_length . Once grouped, you'll keep the packet_length column to apply the sum function on it.

The result would be a series with the sums of packet_lenght , but indexed hierarchically (multi-index) according to the groups formed. You can do a .reset_index() to "undo" the hierarchical index, converting it back into columns like the original dataframe had.

Code

We read the dataframe

# Comienzo cargando el dataframe de ejemplo, que fue dado en la pregunta
# como cadena, por lo que lo leeré usando StringIO y .read_table

data = """ip_src        ip_dst  packet_length  src_port  dst_port comm_type
0      192.168.0.7   52.31.84.22            280   46306.0     443.0        10
1      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
2      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
3      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
4     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
5     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
6     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
7      192.168.0.7  34.192.110.9            583   45914.0     443.0        01
8      52.31.84.22   192.168.0.7           1066     443.0   46306.0        10
9      192.168.0.7   52.31.84.22             73   46306.0     443.0        10
10    34.192.110.9   192.168.0.7           1032     443.0   45914.0        01
11     192.168.0.7  34.192.110.9             73   45914.0     443.0        01
12     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00
13     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00
14     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00"""

import pandas as pd
import io
df = pd.read_table(io.StringIO(data), sep=r'\s+', dtype={"comm_type": str})
# El separador r'\s+' representa una secuencia de espacios o tabuladores
# Mediante dtype fuerzo a que la columna comm_type se evalúe como string
# en lugar de entero

We apply the operation described:

result = (df.groupby(("ip_src", "ip_dst", "src_port", "dst_port", "comm_type")) # Agrupar
          .packet_length    # Quedarse con la columna packet_length
          .apply(sum)       # Calcular su suma
          .reset_index())   # Deshacer índice jerárquico

The result is:

         ip_src        ip_dst  src_port  dst_port comm_type  packet_length
0  192.168.0.17   192.168.0.7    9999.0   33105.0        00            381
1   192.168.0.7  192.168.0.17   33105.0    9999.0        00            315
2   192.168.0.7  192.168.0.17   33110.0    9999.0        00            360
3   192.168.0.7  34.192.110.9   45914.0     443.0        01            656
4   192.168.0.7   52.31.84.22   46306.0     443.0        10            353
5  34.192.110.9   192.168.0.7     443.0   45914.0        01           1032
6   52.31.84.22   192.168.0.7     443.0   46306.0        10           1066

Second version

The previous method would group the rows with the same source IP, destination, ports and type of communication, but without taking into account the packet_lenght , that is to say, those with different packet_length would also be part of the same group. such that the other attributes match.

If what you want is to only group the ones with the same all fields (including packet_length ) you can pass% column% of the list of columns. In this case we have the problem that the final dataframe would have two columns groupby() , one with the original value, and another with the sum. Pandas does not allow two columns of the same name (besides it would be very confusing), but the problem is fixed if we compute two aggregation functions, one that adds the packet_lenght and another that counts how many there is in each group (which would be the number of repetitions within each group).

The following code does that:

result = (df.groupby(list(df.columns))              # Agrupar filas iguales
          .agg({'packet_length': ['count', 'sum']}) # Computar sobre la columna packet_length la cuenta y la suma
          .reset_index())                           # Deshacer índice jerárquico

Result:

         ip_src        ip_dst packet_length src_port dst_port comm_type packet_length      
                                                                                count   sum
0  192.168.0.17   192.168.0.7           127   9999.0  33105.0        00             3   381
1   192.168.0.7  192.168.0.17           105  33105.0   9999.0        00             3   315
2   192.168.0.7  192.168.0.17           120  33110.0   9999.0        00             3   360
3   192.168.0.7  34.192.110.9            73  45914.0    443.0        01             1    73
4   192.168.0.7  34.192.110.9           583  45914.0    443.0        01             1   583
5   192.168.0.7   52.31.84.22            73  46306.0    443.0        10             1    73
6   192.168.0.7   52.31.84.22           280  46306.0    443.0        10             1   280
7  34.192.110.9   192.168.0.7          1032    443.0  45914.0        01             1  1032
8   52.31.84.22   192.168.0.7          1066    443.0  46306.0        10             1  1066

Update

I have noticed a problem with the previous solution, and that is that the grouping occurs regardless of whether the rows are contiguous or not .

For example, in version 1 of the solution rows 0 and 9 are grouped because they have the same source IP, destination, ports and communication type, although they are not contiguous in the dataframe. In solution 2 they are not grouped because they do not have the same value of packet_lengths , but if they had it they would also be grouped.

We see another side effect, and that is that in the result the groups appear arranged in a different way to how they appeared in the original sequence (apparently it puts the groups with the highest number of repetitions first).

If what we want is that only the identical and contiguous rows are grouped together, and that the "temporary" order of the series is preserved, the solution could be the following. I do not explain it because it would be too long, but I can add an explanation later if someone deems it necessary.

result = (df.groupby((df!=df.shift()).cumsum().to_records(index=False)) # Agrupación "mágica"
          .agg({'ip_src': 'last',                  # Cómputo aplicable a cada columna
                'ip_dst': 'last',                  # En casi todas uso 'last' que toma el
                'packet_length': ['count', 'sum'], # último valor que aparezca en el grupo
                'src_port': 'last',                # excepto para packet_lenght que
                'dst_port': 'last',                # computo la cuenta y suma
                'comm_type': 'last'
               })
          .reset_index(drop=True))  # Deshacer el índice jerárquico

# Renombrar columnas para eliminar multi-niveles introducidos por .agg
result.columns = ['ip_src', 'ip_dst', 
                  'count', 'packet_length_sum',
                  'src_port', 'dst_port', 'comm_type']

New result:

         ip_src        ip_dst  count  packet_length_sum  src_port  dst_port comm_type
0   192.168.0.7   52.31.84.22      1                280   46306.0     443.0        10
1   192.168.0.7  192.168.0.17      3                315   33105.0    9999.0        00
2  192.168.0.17   192.168.0.7      3                381    9999.0   33105.0        00
3   192.168.0.7  34.192.110.9      1                583   45914.0     443.0        01
4   52.31.84.22   192.168.0.7      1               1066     443.0   46306.0        10
5   192.168.0.7   52.31.84.22      1                 73   46306.0     443.0        10
6  34.192.110.9   192.168.0.7      1               1032     443.0   45914.0        01
7   192.168.0.7  34.192.110.9      1                 73   45914.0     443.0        01
8   192.168.0.7  192.168.0.17      3                360   33110.0    9999.0        00
    
answered by 15.05.2018 / 09:33
source
1

a! interesting! I ask you a question that has come up, and if from the grouping that was originally made instead of doing the sum of packet_length , you need to have a list of the scr_port field?

To get something like this:

You can make a groupby() by the names of all the columns you want to consider when viewing two rows as equal. For example, you can consider all of them, except packet_length . Once grouped, you keep the packet_length column to apply the sum function on it.

The result would be a series with the sums of packet_lenght, but hierarchically indexed (multi-index) according to the groups formed. You can do a .reset_index() to "undo" the hierarchical index, converting it back into columns like the one with the original dataframe .

Code

We read the dataframe

# Comienzo cargando el dataframe de ejemplo, que fue dado en la pregunta
# como cadena, por lo que lo leeré usando StringIO y .read_table

data = """ip_src        ip_dst  packet_length  src_port  dst_port comm_type
0      192.168.0.7   52.31.84.22            280   46306.0     443.0        10
1      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
2      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
3      192.168.0.7  192.168.0.17            105   33105.0    9999.0        00
4     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
5     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
6     192.168.0.17   192.168.0.7            127    9999.0   33105.0        00
7      192.168.0.7  34.192.110.9            583   45914.0     443.0        01
8      52.31.84.22   192.168.0.7           1066     443.0   46306.0        10
9      192.168.0.7   52.31.84.22             73   46306.0     443.0        10
10    34.192.110.9   192.168.0.7           1032     443.0   45914.0        01
11     192.168.0.7  34.192.110.9             73   45914.0     443.0        01
12     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00
13     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00
14     192.168.0.7  192.168.0.17            120   33110.0    9999.0        00"""

import pandas as pd
import io
df = pd.read_table(io.StringIO(data), sep=r'\s+', dtype={"comm_type": str})

# El separador r'\s+' representa una secuencia de espacios o tabuladores
# Mediante dtype fuerzo a que la columna comm_type se evalúe como string
# en lugar de entero

We apply the operation described:

result = (df.groupby(("ip_src", "ip_dst", "src_port", "dst_port", "comm_type")) # Agrupar
          .packet_length    # Quedarse con la columna packet_length
          .apply(sum)       # Calcular su suma
          .reset_index())   # Deshacer índice jerárquico

The result is:

         ip_src        ip_dst  src_port  
0  192.168.0.17   192.168.0.7    9999.0 
1   192.168.0.7  192.168.0.17  [ 33105.0   ,33110.0,45914.0 ,46306.0]
5  34.192.110.9   192.168.0.7     443.0 
6   52.31.84.22   192.168.0.7     443.0
    
answered by 20.09.2018 в 09:52