Parsear csv with problematic quotes

0

I have to read the following csv:

,codigo,nom,cognom
,111,michael,salinas
,222,"luis","doh, \”jik"
,333,ram,"Lak""\""""\""""\"" , ""\""“one"

It is supposed to be 4 columns, but I have problems with the last row, to read the csv I am using a matcher:

Matcher m = Pattern.compile("\"([^\"]+?)\"|(?<=,|^)([^,]*)(?=,|$)").matcher(line);

But I can not get the last line to separate it well, it appears like this:

  • 333
  • ram
  • "Lak"
  • "\"
  • "\"
  • "\"
  • ","
  • "\"
  • "" one "

Any ideas on how to fix this?

    
asked by Green_Sam 27.04.2017 в 18:44
source

1 answer

1

Seeing the syntax you are using in your CSV, escaped embedded quotes are allowed as double ( "" ), and a slash ( \ ) is not taken as a special character.

To match a text in quotation marks, allowing "" within them, you can search for all characters other than " , optionally followed by any amount of "" and more characters. That is:

"[^"]*(?:""[^"]*)*"
  • Although there are more limited patterns for this, this is the most efficient way to do it, using a technique called unrolling the loop .

The complete regex, for elements with or without quotes, would be:

,("[^"]*(?:""[^"]*)*"|[^,]+)


Code

import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String[] csv =  new String[] {      // Líneas de la pregunta
    ",codigo,nom,cognom",
    ",111,michael,salinas",
    ",222,\"luis\",\"doh, \”jik\"",
    ",333,ram,\"Lak\"\"\\"\"\"\"\\"\"\"\"\\"\" , \"\"\\"\"“one\""
};

final String regex = ",(\"[^\"]*(?:\"\"[^\"]*)*\"|[^,]+)";
final Pattern pattern = Pattern.compile(regex);

for (String line : csv) {  // loop a cada línea
    System.out.println("Línea: " + line);

    final Matcher m = pattern.matcher(line);

    while (m.find()) {  // loop a cada coincidencia (cada elemento sin la coma)
        // Imprimimos el grupo 1 (lo que coincidió entre paréntesis)
        System.out.println("  Elemento: " + m.group(1));
    }
}

Exit

Línea: ,codigo,nom,cognom
  Elemento: codigo
  Elemento: nom
  Elemento: cognom
Línea: ,111,michael,salinas
  Elemento: 111
  Elemento: michael
  Elemento: salinas
Línea: ,222,"luis","doh, \”jik"
  Elemento: 222
  Elemento: "luis"
  Elemento: "doh, \”jik"
Línea: ,333,ram,"Lak""\""""\""""\"" , ""\""“one"
  Elemento: 333
  Elemento: ram
  Elemento: "Lak""\""""\""""\"" , ""\""“one"

Demo in ideone

    
answered by 08.05.2017 / 19:02
source