Python convert a string in CamelCase to separated by dashes

Question

Python convert a string in CamelCase to separated by dashes

Navigation

#1 by (6 votes)
#2 by (5 votes)

4

Hi, I'm needing to convert a string in CamelCase to separate scripts, I've been trying a bit of regular expressions but I can not get any part of it, the idea is to enter a string in CamelCase:

Entry:

'HolaMundoCruel'

Exit:

'hola-mundo-cruel'

Thanks

python python-3.x regex

asked by Ricardo D. Quiroga 08.04.2017 в 14:29

source

2 answers

6

Let's try to match the beginning of each word. There are 2 types of words in Pascal Notation :

Words started in uppercase, followed by at least one lower case

r"[A-Z][a-z]"

In this case, we are only interested in verifying that it is followed by a lowercase letter (it is the only relevant thing to put a hyphen before and lower case).

Although there could also be digits between the two letters, and we added it:

r"[A-Z]\d*[a-z]"

Acronyms (consecutive capital letters).

r"[A-Z][A-Z\d]*(?=[A-Z]|$)"

Matches 1 upper case, followed by uppercase or digits [A-Z][A-Z\d]* .

But also, that is followed by another capital letter or the end of the text (?=[A-Z]|$) .
That way, we avoid consuming the next word. For example,

That matches HTML in HTMLFormateado .
But also with HTML in FormatoHTML .

Putting the two previous expressions together in one, we are left with:

r"[A-Z](?:[A-Z\d]*(?=[A-Z]|$)|\d*[a-z])"

This expression already matches all cases. If we replace with r"-\g<0>" (a hyphen followed by the text that matched), we have:

>>> import re
>>> re.sub(r"[A-Z](?:[A-Z\d]*(?=[A-Z]|$)|\d*[a-z])", r"-\g", "FormatoHTMLConCSS")
'-Formato-HTML-Con-CSS'

Do not insert scripts at the beginning of the text

To avoid inserting hyphens at the beginning, we will pass a function as an argument to check, in each replacement, if match.start() is 0 . If it is the first word (it starts at position 0), we do not use a script, otherwise we precede a script.

Within the function, we use str.lower() to take to lowercase.

import re

patron = r"[A-Z]\d*(?:[A-Z\d]*(?=[A-Z]|$)|[a-z])"
pascal = re.compile(patron)

def pascal_kebab(cadena):
    def insertar_separador(match):
        return ("-" if match.start() else "") + match.group().lower()

    return pascal.sub(insertar_separador, cadena)

Final code

Convert from PascalCase to kebab-case.
We use exactly the same logic as in the last code, with a lambda.

When using a single regex, and not relying on lookbehinds, this feature has a better performance (30 % to 100% faster) than commonly used functions.

import re

pascal = re.compile(r"[A-Z]\d*(?:[A-Z\d]*(?=[A-Z]|$)|[a-z])")

def pascal_kebab(cadena):
    return pascal.sub(lambda m: ("-" if m.start() else "") + m.group().lower(), cadena)

Tests:

pruebas = ['VerHTMLDePag', 'Ver2HTMLDePag', 'Ver2HTMLPag2Info', 'HTMLFomatoPag',
           'HTMLConXML',   'HTML5FomatoPag','HTML5ConXML',      'HTML5ConCSS3',
           'HTML',         'VerQ',          'A2BFormato',       'Formato',
           'SFormato'
          ]

for prueba in pruebas:
    print("%-16s => %s" % (prueba, pascal_kebab(prueba)))

Result:

VerHTMLDePag     => ver-html-de-pag
Ver2HTMLDePag    => ver2-html-de-pag
Ver2HTMLPag2Info => ver2-html-pag2-info
HTMLFomatoPag    => html-fomato-pag
HTMLConXML       => html-con-xml
HTML5FomatoPag   => html5-fomato-pag
HTML5ConXML      => html5-con-xml
HTML5ConCSS3     => html5-con-css3
HTML             => html
VerQ             => ver-q
A2BFormato       => a2b-formato
Formato          => formato
SFormato         => s-formato

Demo:

link

answered by 08.04.2017 в 15:31

Read characters from a text string before a space with Javascript Event handler when changing the attribute of a label

score 5 · Accepted Answer

You can use re.sub to subtitle each match (in this case a capital letter inside the chain) by another given string (in this case '-'). To remove capitals you can use the lower method of the class str :

import re

pattP = re.compile(r'(.)([A-Z][a-z]+)')
pattF = re.compile('([a-z0-9])([A-Z])')

def camel_a_guiones(cadena):
    return pattF.sub(r'-', pattP.sub(r'-', cadena)).lower()

print(camel_a_guiones('HolaMundoCruel'))

Another alternative using re.finditer to separate the words (this is valid also if we want to obtain a list of the words contained in the camel). Having this it is enough to rejoin them using the method join() of str :

import re

patt = re.compile(r'.+?(?:(?<=[a-z0-9])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)')
def camel_a_guiones(cadena):
    return '-'.join(m.group(0) for m in re.finditer(patt, cadena)).lower()

print(camel_a_guiones('HolaMundoCruel'))

Output of both:

hello-world-cruel