Extract part of a chain between two specific points

Question

Extract part of a chain between two specific points

Navigation

#1 by (5 votes)

4

Recently I found myself needing to extract all the values that are between two specified points of a chain, in this case everything that is inside the parentheses "()" .

What would be the most optimal way to do this?

string cadena = string.Empty, resultado = string.Empty;

I have an email that has a predefined format, in which only the values that you find between ()

change

Example cadena :

Hola, amigo X, ..........

bla bla bla bla
.......
('A','B','valorX','valorY',N...) //lo que quiero obtener.
.......
mas texto...
....

Se despide, atentamente, Pedro...

Looking for different ways of doing it, solve it using one of the following forms ":

1- Using Split :

resultado = cadena.Split('(', ')')[1];

or

resultado = cadena.Split("()".ToCharArray())[1];

2- With Regular Expressions Regex.Match :

resultado = Regex.Match(cadena, @"\(([^)]*)\)").Groups[1].Value;

3- With Substring by applying a little math:

int posInicial = cadena.LastIndexOf("(") + 1;
int longitud = cadena.IndexOf(")") - posInicial;

resultado = cadena.Substring(posInicial, longitud);

With each one of those ways of doing it you get the same result:

#resultado 'A','B','valorX','valorY',N...

To sincerity I find it difficult to understand how regular expressions work, I always see them as a bunch of indecipherable hieroglyphic codes ...

Then: Which would be the most optimal or appropriate way to do this?

string c#

asked by J. Rodríguez 11.06.2018 в 16:59

source

1 answer

Duplicate data - Double inner join Current date in input html

score 5 · Accepted Answer

Just make a complexity analysis.

The most efficient algorithm in terms of memory and speed would be the fourth. Basically you have to look at the linear time and memory consumption of each algorithm.

In the first algorithm:

cadena.Split('(', ')')[1];

The string is iterated in linear time, looking for the number of characters given in the split array (passed as parameters in the method) and for each iterates the list up to N , where N is the length of the chain. Now, he will need to run the list and create M temporary variables for each character in Split , and then create a list of values by indexing which access is in constant time O(1) .

As a result you will get O((N * M) + 1) where N is the length of string and M the amount of substrings generated in each operation of Split .

The second algorithm:

cadena.Split("()".ToCharArray())[1];

Basically it is the same procedure as the first algorithm, only that here, it will consume more memory, because you will have to create an array of characters and create a temporary variable and iterate the string that in this case has been "()" .

The third algorithm:

Regex.Match(cadena, @"\(([^)]*)\)").Groups[1].Value;

It's a double-edged sword. The complexity lies in the length or complexity of the rule, worth the redundancy. This should only be used if the rule is a bit complex, validate emails, addresses, number formats, mentions and hashtags, etc ... For example, if you were not to use Regex to validate mentions or hashtags in a string, you would have to create a giant algorithm and Intervals tree to obtain the indexes where each mention or hashtag is located. To work with strings of massive quantities, a lot of memory trying to get all the substrings that are mentions or hashtags in giant strings. Regular expressions should be used as a validator of complex strings, since they avoid creating a gigantic algorithm. Obviously in this case, it is the one with the most complexity and memory consumption.

For the fourth algorithm:

int posInicial = cadena.LastIndexOf("(") + 1;
int longitud = cadena.IndexOf(")") - posInicial;
resultado = cadena.Substring(posInicial, longitud);

You should iterate twice the length N of the string and then get the result in N , so the complexity would be O((2 * N) + N) .

So in a serious top:

O((2 * N) + N) the fourth algorithm.

O((N * M) + 1) the first algorithm.

O((N * M) + 1) the second algorithm. The first algorithm consumes more memory.

O(?) the fourth algorithm. Regex is the most complicated and the one that consumes more memory. In advance you can know which is the one that has the greatest complexity due to the process involved.

Keep in mind that in your example these times are insignificant (none of them reach 1ms of processing). So if you want to see the result in a better way, you would have to try it with a giant length for the chain). This answer is based on my experience in the algorithm, if someone is willing to document and contradict me or find an error, I am available to discuss it.

You can read the documentation for the analysis of Algorithms Understanding Big O Notation or This link is more complete .