Separate text with commas, except in quotes, with regular expressions

4

I have a string, which I need to separate by comma, except that the comma is in quotes.

In my code, it separates me by a comma, but it also separates the comma that is inside the quote.

package testSplit;

import java.util.StringTokenizer;

/**
 * Created by J Michael on 26/04/2017.
 */
public class Test2 {

    public static void main(String[] args) {
        String text = ",10010222,\"The Royal Bank of Scotland, Niederlassung Deutschland\",10105,Berlin";
        StringTokenizer tokens = new StringTokenizer(text, ",|^[\",\"]$", false);
        System.out.println(tokens.countTokens());
        while(tokens.hasMoreTokens()){
            System.out.println(tokens.nextToken());
        }

    }
}

Other examples

To be more exact, I quote this new example, where I need 5 columns for each line.

String s7=",12070024,Deutsche Bank Privat und Geschäftskunden,01968,"Senftenberg, NL"";
--> colum0=(vacio), 
    colum1=12070024,
    colum2=Deutsche Bank Privat und Geschäftskunden,
    colum3=01968,
    colum4=Senftenberg, NL
String s8=",12070024,Deutsche Bank Privat" und" Geschäftskunden,16856,"Kyrätz, Prägnitz"";
--> colum0=(vacio),
    colum1=12070024,
    colum2=Deutsche Bank Privat" und" Geschäftskunden,
    colum3=16856,
    colum4=Kyrätz, Prägnitz
    
asked by Michael Salinas Rios 26.04.2017 в 04:15
source

2 answers

3

The key is not to try to separate, but to match each element.

And the trick is to add a comma before the text, then it's as simple as matching a comma followed by "[^"]*" or [^,]* .

This is the best way to obtain each element, ensuring that empty elements are also respected at the beginning or end of the text.


Regular Expression

,("[^"]*"|[^,]*)
  • , - matches a literal comma

  • ("[^"]*"|[^,]*)
    We use the parentheses, to capture what matches, and retrieve it with matcher#group(1) . Within the group, two separate options with | :

  • "[^"]*" - Opening quotation marks, followed by any number of characters that are not quotes, and closing quotes.
  • [^,]* - Any number of characters that are not commas.


Code

import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = ",(\"[^\"]*\"|[^,]*)";
final String text  = ",10010222,\"The Royal Bank of Scotland, Niederlassung Deutschland\",10105,Berlin";

final Pattern pattern = Pattern.compile(regex);
// le agregamos una coma al texto para que coincida con el primer elemento
final Matcher matcher = pattern.matcher("," + text);

int n = 0; //sólo para mostrar el número de elemento (opcional)

while (matcher.find()) {
    System.out.print  ("Elemento " + ++n + ": ");
    System.out.println(matcher.group(1));
}

Result

Elemento 1: 
Elemento 2: 10010222
Elemento 3: "The Royal Bank of Scotland, Niederlassung Deutschland"
Elemento 4: 10105
Elemento 5: Berlin

Demo

link


Option 2: omit the quotes in the result

If one of the elements is in quotation marks and you want to get the text without the quotes in the result, we can use one more group, to get only the text that is between the quotes. That is, we add a couple more parentheses:

,("([^"]*)"|[^,]*)

And in the code, we evaluate if matcher.group(2) has any value.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = ",(\"([^\"]*)\"|[^,]*)";
final String text  = ",12070024,Deutsche Bank Privat\" und\" Geschäftskunden,16856,\"Kyrätz, Prägnitz\"";

final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher("," + text);

String elemento;
int n = 0;

while (matcher.find()) {
    System.out.print  ("Elemento " + ++n + ": ");

    if (matcher.group(2) != null)
    {   // Elemento entre comillas?
        elemento = matcher.group(2); // Obtener el texto sin las comillas
    }
    else
    {
        elemento = matcher.group(1);
    }
    System.out.println(elemento);
}

Demo: link


Option 3: Allow escaped quotes inside quotes

To be able to allow escaped quotes with a \ , it is necessary to generate the exception for all \ s, and at the same time allow a bar followed by any character.

final String regex = ",(\"[^\\\"]*(?:\\.[^\\\"]*)*\"|[^,]*)";

Demo: link

    
answered by 26.04.2017 / 06:47
source
2

Use the following regular expression:

(?<=")[^"]*(?=",|"$)|[^,"]+

Example:

private static final Pattern REGEX_PATTERN = 
        Pattern.compile("(?<=\")[^\"]*(?=\",|\"$)|[^,\"]+");

public static void main(String[] args) {
    String input = ",10010222,\"The Royal Bank of Scotland, Niederlassung Deutschland\",10105,Berlin";
    Matcher matcher = REGEX_PATTERN.matcher(input);
    while (matcher.find()) {
        System.out.println(matcher.group());
    }
}

Output:

10010222
The Royal Bank of Scotland, Niederlassung Deutschland
10105
Berlin
    
answered by 26.04.2017 в 04:33