Split and Join from a text to words respecting the blanks in Java

1

What is the best way to separate words from a text that can contain multiple concatenated blanks, tabs, and line breaks?

Process the array of words and then attach them, respecting the same spaces they would have.

Dummy text:

Lorem ipsum dolor  sit amet, consectetur is adipiscing elit.
    
asked by Webserveis 26.06.2016 в 21:11
source

2 answers

4

One way is with regular expressions with lookahead

    String[] list = st.split("((?<=\s)(?!\s))|((?<!\s)(?=\s))");

This will return an array that will align delimiters (blanks) with words (not blanks). Its concatenation should correspond to the original string.

Explanation: we want to make a split using a "zero width" delimiter, corresponding to each position of the string where there is a white-not white transition or vice versa.

The pattern consists of the "OR" of these two possibilities: "( )|( )"

The first half (?<=\s)(?!\s) consists of: first, ?<= indicates a "look behind" (look back) positive, and the pattern \s is a target; then, this will matte when "back" (to the left of the current position) there is a target. Then I have ?! , a "look ahead" (look forward) negative, with the same pattern; this will matcheará when "forward" (to the right of the current position) DO NOT have a target.

The second half is, conversely, a negative behind look followed by a positive forward look (non-white followed by white).

Internal parentheses are required by the look-ahead, look-behind syntax. The external group what goes in the OR.

    
answered by 26.06.2016 / 22:14
source
0

My solution

To separate the words and keep the multiple blank spaces

String[] words = str.split(" ");

And to attach use the following function:

public static String join(String r[],String d) {
    if (r.length == 0) return "";
    StringBuilder sb = new StringBuilder();
    int i;
    for(i=0;i<r.length-1;i++)
        sb.append(r[i]).append(d);
    return sb.toString()+r[i];
}

Your use

String str = join(words, " ");

The tests I have done, regards the initial chain.

Update

Solution based on the response of @Leonbloy

String str = "Lorem ipsum    dolor sit amet, consectetur adipiscing elit\n Fusce erat mau[ris], pretium sed metus in, efficitur\n\nultima.";

To separate the words, taking into account, tabulators and line breaks.

String[] list = str.split("((?<=\s)(?!\s))|((?<!\s)(?=\s))");

To attach the String array to a String

StringBuilder builder = new StringBuilder();
for(String s : list) {
    builder.append(s);
}
String endStr =  builder.toString();

System.out.println(endStr);
    
answered by 26.06.2016 в 22:21