RegExp that detects if it matches a word, considering duplicate spaces and / or letters

3

I worked with NodeJS , I had a system that verified whether a string contained a certain blocked word, however, it was easy to dodge the system. (It replaced all the unicode characters to "" , so it's very inefficient).

I would like to use RegExp if possible, (using regex101 to simulate it), the goal is to detect if a word has been written, but the user tries to prevent the system from detecting it.

Suppose that the word fuck is blocked, then when the user writes it (explicitly), the system must be positive, however, it must also do so if it repeats one (or several) letter / s (ex fuuck ), if you put a space (ej f uck ) or multiple. But it should not be positive in words like brainfuck .

What the system will do is, /<regex>/.test(string) , and when it returns true , the application will execute some external methods (from an external library).

Thanks in advance.

    
asked by Antonio Roman 03.03.2017 в 19:27
source

2 answers

3

Full word To differentiate fuck from brainfuck , we use \b , which match the full word limits ( or word boundaries .

/\bfuck\b/i


Repeated Characters . To match any number of repeated characters, we use the quantifier < a href="https://developer.mozilla.org/en/docs/Web/JavaScript/Guide/Regular_Expressions#special-plus"> + , which repeats the previous structure 1 or more times. Thus, /f+/ matches 1 to infinite "f" , or /f+u+/ could match "fffffffffuu" . More info on Repetition .

/\bf+u+c+k+\b/i


Intermediate spaces . To allow any number of spaces between the letters, we use a repeated space with a asterisk (0 or more times).

/\bf+ *u+ *c+ *k+\b/i


Other intermediate characters . For the solution you're looking for, instead of spaces, I think you should allow any non-alphanumeric character. \W matches characters that are not " catacteres de palabras ", that is, any character except [a-zA-Z0-9_] . In this way, it would match texts as "(f)(u)(c)(k)" . More info at Shorthands .

/\bf+\W*u+\W*c+\W*k+\b/i

Demo:

let pruebas = [
    "prueba",
    "palabra fuck bloqueada",
    "palabra brainfuck está bien",
    "con espacios f u    c k",
    "caracteres repetidos fuuuucccckkk!!",
    "con símbolos F::u--C**K!!!"
  ],

  regex = /\bf+\W*u+\W*c+\W*k+\b/i;

for (let string of pruebas) {
  console.log('"${string}" -->', regex.test(string));
}


Multiple words . In addition, more than one word can be included within the same regular expression, grouping with (?: expresión1 | expresión2 ) . For example, to match fuck or ban :

/\b(?:f+\W*u+\W*c+\W*k+|b+\W*a+\W*n+)\b/i
  

If you had an extremely long list, since I do not know the limits for number of characters or compiled regex (or how it would affect efficiency), you should try before implementing if you intend to use it with a lot of words.



Generate the expression by code . An essential point in this solution is to generate the regex dynamically. The following function takes an array of forbidden words and returns a RegExp object with the pattern of this response.

function regexDePalabrasProhibidas(arrListado) {
  let exprProhibida = arrListado.reduce(function(acum, item, index) {
    //unir las palabras con "|"
    return acum + (index ? "|" : "") +
      item.replace(/\w(?=(\w)?)/g, function(letra, tieneSiguiente) {
        //agregar "\W*" entre caracteres
        return letra + "+" + (tieneSiguiente ? "\W*" : "");
      });
  }, "");
  //regex con límites de palabra y agrupado
  return new RegExp("\b(?:" + exprProhibida + ")\b", "i");
}



// --- EJEMPLO ---
let listado = [
    "fuck",
    "ban",
    "palabra3",
    "palabra4"
  ],
  regex,
  pruebas = [
    "prueba",
    "palabra fuck bloqueada",
    "palabra brainfuck está bien",
    "con espacios f u    c k",
    "caracteres repetidos fuuuucccckkk!!",
    "con símbolos F::u--C**K!!!",
    "frase con (b)(a)(n)"
  ];

regex = regexDePalabrasProhibidas(listado);

document.body.innerHTML = 'Regex final: <code>/${regex.source}/${regex.flags}</code>';
for (let string of pruebas) {
  console.log('"${string}" -->', regex.test(string));
}
    
answered by 04.03.2017 / 15:15
source
0

I think the best way to approach this type of data recognition is using the Naive Bayes Classifier algorithm, where using comparative features you could classify the different text strings to categorize them as offensive or not. Your application would "learn" to differentiate them, based on previous experiences of identification of patterns (Machine Learning).

I recommend reading about this algorithm and getting inside Machine Learning, as it can be very useful for this type of behavior.

I quote the Wikipedia reference on Machine Learning:

  

Machine Learning (Machine Learning) is the subfield of computer science and a branch of artificial intelligence whose objective is to develop techniques that allow computers to learn. More specifically, it is about creating programs capable of generalizing behaviors from information provided in the form of examples. It is, therefore, a process of induction of knowledge .

    
answered by 03.03.2017 в 19:42