Problem 'System.OutOfMemoryException' when loading flat file

0

I am trying to load a large flat file (250 Mb) in a datatable but after loading a considerable amount of records (4 M), I throw the error 'System.OutOfMemoryException'

                DataTable listaTX = new DataTable("listaTX");
                listaTX.Columns.Add("CodDep", typeof(int));
                listaTX.Columns.Add("CodMun", typeof(int));
                listaTX.Columns.Add("CodZon", typeof(int));
                listaTX.Columns.Add("CodPue", typeof(String));
                listaTX.Columns.Add("Mesa", typeof(int));
                listaTX.Columns.Add("CodJal", typeof(int));
                listaTX.Columns.Add("Comunicado", typeof(int));
                listaTX.Columns.Add("CodCirc", typeof(int));
                listaTX.Columns.Add("CodPar", typeof(int));
                listaTX.Columns.Add("CodCan", typeof(int));
                listaTX.Columns.Add("Votos", typeof(int));
                listaTX.Columns.Add("CodTX", typeof(int));                    

                using (FileStream fs = File.Open(ruta, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
                using (BufferedStream bs = new BufferedStream(fs))
                using (StreamReader sr = new StreamReader(bs))
                {                    
                    try
                    {
                        while ((linea = sr.ReadLine()) != null)
                        {

                            DataRow fila = listaTX.NewRow();
                            fila[0] = linea.Substring(0, 2);
                            fila[1] = linea.Substring(2, 3);
                            fila[2] = linea.Substring(5, 2);
                            fila[3] = linea.Substring(7, 2);
                            fila[4] = linea.Substring(9, 6);
                            fila[5] = linea.Substring(15, 2);
                            fila[6] = linea.Substring(17, 4);
                            fila[7] = linea.Substring(21, 1);
                            fila[8] = linea.Substring(22, 3);
                            fila[9] = linea.Substring(25, 3);
                            fila[10] = linea.Substring(28, 8);
                            fila[11] = linea.Substring(36, 7);
                            listaTX.Rows.Add(fila);
                   }
               }
          }

If in the property box of the project I disable the option "Prefer 32 - bit" it works for me but I did the test on a 32 bit machine and the mentioned error comes out.

Is there a way to optimize the code or another method that allows me to load this large file into a datatable?

    
asked by Luis Carlos Donado Avella 11.01.2018 в 22:38
source

2 answers

2

To be able to mount a lighter structure you could create a class that internally stores only the string and returns through properties the different values:

class Registro
{
    readonly string _value;

    private Registro(string value)
    {
        _value = value;
    }

    #region Conversión implícita con string

    public static implicit operator string(Registro d)
    {
        return d.ToString();
    }

    public static implicit operator Registro(string d)
    {
        return new Registro(d);
    }

    #endregion

    #region Propiedades que devuelven los campos

    public int CodDep => int.Parse(_value.Substring(0, 2));
    public int CodMun => int.Parse(_value.Substring(2, 3));
    public int CodZon => int.Parse(_value.Substring(5, 2));
    public string CodPue => _value.Substring(7, 2);
    public int Mesa => int.Parse(_value.Substring(9, 6));
    public int CodJal => int.Parse(_value.Substring(15, 2));
    public int Comunicado => int.Parse(_value.Substring(17, 4));
    public int CodCirc => int.Parse(_value.Substring(21, 1));
    public int CodPar => int.Parse(_value.Substring(22, 3));
    public int CodCan => int.Parse(_value.Substring(25, 3));
    public int Votos => int.Parse(_value.Substring(28, 8));
    public int CodTX => int.Parse(_value.Substring(36, 7));

    #endregion
}

Thus, the file reading method could return a list of elements Registro much lighter than DataTable :

private static IEnumerable<Registro> GetRecords(string ruta)
{
    var listaTx = new List<Registro>();

    using (StreamReader sr = new StreamReader(ruta))
    {
        while (sr.Peek() >= 0)
        {
            Registro linea = sr.ReadLine();
            if (!string.IsNullOrEmpty(linea)) listaTx.Add(linea);
        }
    }
    return listaTx;
}

To retrieve all records in the file simply:

var data = GetRecords(rutaFichero);

Although if the objective is to filter this data to generate a new file it would be better to carry out the filtering when reading the file, in this way you would not need to store in memory the complete data set, if not just the ones you are going to use.

For this you could create an overload of the reading method that accepts a filtering condition:

private static IEnumerable<Registro> GetRecords(string ruta, Func<Registro, bool> condicion)
{
    var listaTx = new List<Registro>();

    using (StreamReader sr = new StreamReader(ruta))
    {
        while (sr.Peek()>=0)
        {
            Registro linea = sr.ReadLine();
            if (linea != null && condicion(linea)) { listaTx.Add(linea); }
        }
    }
    return listaTx;
}

In this way, for example, to obtain all records with CodMun=83 , it would be enough to do:

var data = GetRecords(rutaFichero, r => r.CodMun == 83);

Much faster and with much less memory consumption.

    
answered by 12.01.2018 в 13:26
0

You could use the iterator yield to read line by line and avoid loading the entire file in memory

EDITED

I will be more specific than before I was in the curro and I have not been able to give much example. I have created an example with a reading of a normal text file with 1,000,000 lines (250MB) and I read it and insert it in a textBox.

The code that loads the file in memory completely is the following:

using (StreamReader sr = new StreamReader(cPruebaTxt))
        {
            foreach (string line in GetDataLines(sr))
            {
                richTextBox1.Text += line + Environment.NewLine;
            }
        }

This code uses several GB of memory.

On the other hand, using the statement yield the code would look like this:

private void LoadFile_Click(object sender, EventArgs e)
    {

        using (StreamReader sr = new StreamReader(cPruebaTxt))
        {
            foreach (string line in GetDataLines(sr))
            {
                richTextBox1.Text += line + Environment.NewLine;
            }
        }

    }

     private IEnumerable<string> GetDataLines(StreamReader sr)
    {
        string line = "";

        while ((line = sr.ReadLine()) != null)
        {  
                yield return line; 
        }

    }

This code has not risen from 50MB of memory usage, but it takes longer to run if we're just going to read. However, if what we do is process the read lines, the time would be quite equal, and in the case of very large files the first option would give a Out of memory while the second one will achieve its objective.

I leave you a small example: link

    
answered by 12.01.2018 в 12:23