How to optimize search for duplicate byte arrays in C #?

Question

How to optimize search for duplicate byte arrays in C #?

Navigation

#1 by (2 votes)

1

I have the following list of Objects:

List<PacketData> packetList = new List<PacketData>();

public class PacketData{
    public DateTime timestamp {get; set;}
    public byte[] data {get; set;}
    public bool isDuplicated {get; set;}
}

From that list I have to eliminate all data duplicates in a time range based on timestamp .

To remove the duplicates hide the following (it works, but I feel that the performance is not optimal):

//creo una lista para guardar el resultado
List<PacketData> newListPackets = new List<PacketData>();

//recorro la lista original
foreach (var packet in packetList)
{
    //si el paquete no esta duplicado
    if (!packet.isDuplicated )
    {
        //hago la busqueda utilizando LINQ
        packetList.Where(
                (p, k) =>
                    //paquetes que no esten duplicados
                    !p.isDuplicated  &&
                    //donde la diferencia de tiempo sea menor a un segundo
                    Math.Abs((packet.timestamp  - p.timestamp ).TotalSeconds) < 1 &&
                    //y que el paquete (arreglo de bytes) sean iguales
                    p.data.SequenceEqual(packet.data)
            ).Select(p =>
            {
                //el resultado, los marco como duplicados
                p.isDuplicated  = true;
                return p;
            }).ToList();

        newListPackets.Add(packet);
    }
}

I have used other methods but this is the most efficient so far. With a total of 46,633 items on the list, it gives me a processing time of 6 minutes with 16 seconds.

Thank you in advance for your time.

c# algoritmos linq

asked by Tecnologer 10.03.2017 в 02:16

source

1 answer

Java - Problem with if [duplicate] Doubt with the internal structure of sockaddr_in

score 2 · Accepted Answer

One of the problems that affects your performance is that for each package in foreach you make a Where , which is equivalent to scanning the entire list. It is true that you eventually avoid considering the already marked duplicates, but I assume that these are the minority, so in a general way, I imagine that for a list of 46,333 packages, you have to do approximately (46,333 x 46,333) 2,146,746,889 iterations.

Another problem is that, in each iteration of the foreach , you are creating a list that you do not actually use, but you only use it to make sure that the code is executed in Select() , where you modify the% property isDuplicated . This is not only expensive, but it is actually an abuse of LINQ, since it is not designed to modify data how you do it.

I take this opportunity to mention that although LINQ is fantastic for expressing code in a very compact and readable form, we must be careful to imagine that just because there are fewer lines of code, this automatically means that performance is better. In fact, even when used correctly, the use of LINQ typically makes performance worse. But in most cases, the difference is negligible and it's worth it to be able to simplify the code.

I propose you to try an algorithm that orders your list by date to start. Once ordered by date, you can take advantage of this to avoid having to iterate the entire list within the loop. Rather, you can stop the search right away when you come across a package that is a second or more in the future. Since the list is ordered, you know that there is nothing to continue searching after that happens. This should greatly reduce the total number of iterations and the number of times you need to do the byte comparison.

var newListPackets = new List<PacketData>();
var orderedPacketList = packetList; //.OrderBy(p => p.timestamp).ToList(); -- no necesitas esto si la lista ya está ordenada.
for (int x = 0; x < orderedPacketList.Count; x++)
{
    var packetX = orderedPacketList[x];
    bool isDuplicate = false;
    for (int y = x + 1; y < orderedPacketList.Count; y++)
    {
        var packetY = orderedPacketList[y];
        if ((packetY.timestamp - packetX.timestamp).TotalSeconds >= 1)
        {
            break;
        }

        if (packetX.data.SequenceEqual(packetY.data))
        {
            isDuplicate = true;
            break;
        }
    }

    if (!isDuplicate)
    {
        newListPackets.Add(packetX);
    }
}

In passing, you'll notice that the algorithm does not use or need to define a isDuplicated property in your class PacketData .

Edit

I removed the OrderBy being that in the comment below you mention that the list is already sorted by timestamp .