Regex to remove accents and grave accent from a txt

1

I would like to know what regular expression I can use, I have a txt from which I have to remove the vowels labeled (Á, É, Í, Ó, Ú), single lines (',') and replace with the same vowel without tilde or space (_) but without changing the fact that it occupies a single space in the txt file. I used this definition:

System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex("[^a-zA-Z0-9]");

reg.Replace(str, " ");

But replacing a typed letter transforms the letter but adds a space to the right and does not cover the single tilde.

    
asked by R Galindo 04.01.2019 в 00:55
source

2 answers

2

A regular expression will be too expensive at runtime, the best thing you can do is use an encoding to replace those marked characters:

string text = "Estó es uná cádená con tildés";
byte[] bytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(text);
string cleanText = System.Text.Encoding.UTF8.GetString(tempBytes);

The result will be as follows:

  

This is a string with tildes

Here you can see a proven example in Rextester .

    
answered by 04.01.2019 / 01:08
source
0

A solution could also be to capture the possible spaces before and after the characters you want to replace. In this way you will eliminate multiple targets and when replacing you will be left with only one.

It may also be a good idea to add a + quantifier to the search for illegal characters. This way, if there are two tildes followed, you will not have two spaces left together. For example in: un ´árbol´

Try the following regular expression: \s*[^a-zA-Z0-9]+\s* and replace with a blank:

Entry:

Estó es uná cádená con tildés
y esto son tildes sueltas: una ' tilde
y otra ´. Y ahora vienen más datos.

Exit:

Est es un c den con tild s y esto son tildes sueltas una tilde y otra Y ahora vienen m s datos 

Demo

    
answered by 04.01.2019 в 09:15