optimize reading and inserting records from a .txt to sql in Java

1

good morning companions

I have a code that is responsible for processing a .txt file and inserting it into a SQL database; all this in java using netbeans

my code is as follows:

 String line = null;
 int exito=0;
 int numeroBatch=0;
 int counter = 0;
 DateFormat dateFormat = new SimpleDateFormat("yyyy_MM_dd");
 Date date = new Date();
 Calendar c = Calendar.getInstance(); 
 c.setTime(date); 
 c.add(Calendar.DATE, -3);
 date = c.getTime();
 String fechaBuscar=dateFormat.format(date);

 BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream("Tmp\unzip\"+tempRead+".txt"),"iso-8859-1"));
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
                                cadenaConexion="jdbc:sqlserver://localhost\MSSQLSERVER;databaseName=test";
                                usuarioDB="test";
                                passDB="testing123";
                    Connection con=DriverManager.getConnection(cadenaConexion,usuarioDB,passDB);   
                    Statement stmt=con.createStatement(); 

                    boolean empiezaArchivo=false;

                    while((line = bufferedReader.readLine()) != null) {

                        counter++;
                        if(counter >= 4&&!"EOF".equals(line)&&empiezaArchivo!=false)
                        {
                          String lineaParsea=line;
                          String[] data = lineaParsea.split("\|");
                          System.out.println("generando registro "+(counter-3)+" de archivo "+(tempRead+1)+" para base de datos...");

                          try
                            {
                             con=DriverManager.getConnection(cadenaConexion,usuarioDB,passDB);


                                String query="insert into datos (dato,fecha_bajada) values('"+data[0]+"','"+fechaBuscar+"')"; 


                                stmt.addBatch(query);
                                if(numeroBatch==2000)
                                {
                                    stmt.executeBatch();
                                    numeroBatch=0;

                                    stmt=con.createStatement();
                                }
                                numeroBatch++;

                                exito= 1;
                            }
                            catch(Exception ex)
                            {
                                Logger.getLogger(BajarArchivosYProcesa.class.getName()).log(Level.SEVERE, null, ex);
                                exito= 0;

                            } 


                          if(exito==0)
                          {
                              break;
                          }
                        }
                        else if("EOF".equals(line))
                        {


                            System.out.println("Se termino de procesar el registro del archivo "+(tempRead+1));
                            stmt.executeBatch();
                                    numeroBatch=0;

                                    stmt=con.createStatement();
                            break;
                        }
                        else
                        {
                            //nada... con las primeras lineas no hacemos nada.. 
                            String inicio="DATOQUENECESITO|OTRO|OTRO2";
                            if(line.toLowerCase().contains(inicio.toLowerCase()))
                            {
                                empiezaArchivo=true;
                            }
                        }
                    }

As you will see the txt file I have to walk to get only the data I want (the record comes like this, I DO NOT NEED ANOTHER | ANOTHER and I only take the first data) and then I use a batch of 2000 records to process the insert.

The code as such serves me; but the problem is that it takes a lot, the txt file has more than 15 million records, and a total of 5 txt files ; for each file it takes more less a day.

Is there a better way to process the txt and insert them in the SQL database?

Thanks for your support

    
asked by jorge marquez 23.01.2018 в 17:29
source

1 answer

1

Apparently, what slows down your code is the execution of one insert per line, which results in 15 million inserts executed per file. This could be fixed if instead of doing an insert per line, insert several values in the same insert, as for example, as follows:

BufferedReader br = new BufferedReader(new FileReader(filename));

String line;
StringBuffer sb = new StringBuffer();
Date hoy = new Date();
String hoyString = new SimpleDateFormat("yyyy-MM-dd").format(hoy);
int VALORES_POR_INSERT=2000;
List<String> lInserts = new ArrayList<>(15000000/VALORES_POR_INSERT); //Asi no tiene que estar creciendo mucho

sb.append("INSERT INTO A(id, fecha) VALUES ");
int iteracion = 0;
while((line = br.readLine()) != null) {
    String[] split = line.split("\|");
    sb.append("('")
        .append(split[0])
        .append("', '")
        .append(hoyString)
        .append("'),");

    ++iteracion;

    if(iteracion == VALORES_POR_INSERT) {
        iteracion = 0;
        sb.deleteCharAt(sb.length()-1); //Quitamos la ultima ","

        lInserts.add(sb.toString());
        sb.setLength(0);
        sb.append("INSERT INTO A(id, fecha) VALUES ");
    }
}

I have generated a file of 15 million lines, and this algorithm (which can surely be improved) takes approximately 5 seconds (or what is the same, a temporary improvement of a 17000% approximately) to read it. Using a variation of this, and executing the queries at the end of the reading, you should improve your performance drastically. Another thing to try with different values for N_LINEAS that at the end is the one that says how many lines per insert we have. It is set to 2000 , but it can surely increase a lot, and therefore improve performance.

PD: In this code, I used the lInserts array as if it were stmt.addBatch(query); that is, an array of queries that will then be executed. You have to make the necessary changes to match your problem.

IMPORTANT This code assumes that you can read from line 0 of the file, make the necessary changes for your own configuration.

    
answered by 23.01.2018 / 20:02
source