Python convert a string in CamelCase to separated by dashes

4

Hi, I'm needing to convert a string in CamelCase to separate scripts, I've been trying a bit of regular expressions but I can not get any part of it, the idea is to enter a string in CamelCase:

Entry:

'HolaMundoCruel'

Exit:

'hola-mundo-cruel'

Thanks

    
asked by Ricardo D. Quiroga 08.04.2017 в 16:29
source

2 answers

5

You can use re.sub to subtitle each match (in this case a capital letter inside the chain) by another given string (in this case '-'). To remove capitals you can use the lower method of the class str :

import re

pattP = re.compile(r'(.)([A-Z][a-z]+)')
pattF = re.compile('([a-z0-9])([A-Z])')

def camel_a_guiones(cadena):
    return pattF.sub(r'-', pattP.sub(r'-', cadena)).lower()

print(camel_a_guiones('HolaMundoCruel'))

Another alternative using re.finditer to separate the words (this is valid also if we want to obtain a list of the words contained in the camel). Having this it is enough to rejoin them using the method join() of str :

import re

patt = re.compile(r'.+?(?:(?<=[a-z0-9])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)')
def camel_a_guiones(cadena):
    return '-'.join(m.group(0) for m in re.finditer(patt, cadena)).lower()

print(camel_a_guiones('HolaMundoCruel'))

Output of both:

  

hello-world-cruel

    
answered by 08.04.2017 / 17:09
source
6

Let's try to match the beginning of each word. There are 2 types of words in Pascal Notation :


  • Words started in uppercase, followed by at least one lower case

    r"[A-Z][a-z]"
    

    In this case, we are only interested in verifying that it is followed by a lowercase letter (it is the only relevant thing to put a hyphen before and lower case).

    Although there could also be digits between the two letters, and we added it:

    r"[A-Z]\d*[a-z]"
    

  • Acronyms (consecutive capital letters).

    r"[A-Z][A-Z\d]*(?=[A-Z]|$)"
    

    Matches 1 upper case, followed by uppercase or digits [A-Z][A-Z\d]* .

    But also, that is followed by another capital letter or the end of the text (?=[A-Z]|$) .
    That way, we avoid consuming the next word. For example,

    • That matches HTML in HTMLFormateado .
    • But also with HTML in FormatoHTML .

  • Putting the two previous expressions together in one, we are left with:

    r"[A-Z](?:[A-Z\d]*(?=[A-Z]|$)|\d*[a-z])"
    


    This expression already matches all cases. If we replace with r"-\g<0>" (a hyphen followed by the text that matched), we have:

    >>> import re
    >>> re.sub(r"[A-Z](?:[A-Z\d]*(?=[A-Z]|$)|\d*[a-z])", r"-\g", "FormatoHTMLConCSS")
    '-Formato-HTML-Con-CSS'


    Do not insert scripts at the beginning of the text

    To avoid inserting hyphens at the beginning, we will pass a function as an argument to check, in each replacement, if match.start() is 0 . If it is the first word (it starts at position 0), we do not use a script, otherwise we precede a script.


    Within the function, we use str.lower() to take to lowercase.

    import re
    
    patron = r"[A-Z]\d*(?:[A-Z\d]*(?=[A-Z]|$)|[a-z])"
    pascal = re.compile(patron)
    
    def pascal_kebab(cadena):
        def insertar_separador(match):
            return ("-" if match.start() else "") + match.group().lower()
    
        return pascal.sub(insertar_separador, cadena)
    


    Final code

    Convert from PascalCase to kebab-case.
    We use exactly the same logic as in the last code, with a lambda.

    • When using a single regex, and not relying on lookbehinds, this feature has a better performance (30 % to 100% faster) than commonly used functions.
    import re
    
    pascal = re.compile(r"[A-Z]\d*(?:[A-Z\d]*(?=[A-Z]|$)|[a-z])")
    
    def pascal_kebab(cadena):
        return pascal.sub(lambda m: ("-" if m.start() else "") + m.group().lower(), cadena)
    


    Tests:

    pruebas = ['VerHTMLDePag', 'Ver2HTMLDePag', 'Ver2HTMLPag2Info', 'HTMLFomatoPag',
               'HTMLConXML',   'HTML5FomatoPag','HTML5ConXML',      'HTML5ConCSS3',
               'HTML',         'VerQ',          'A2BFormato',       'Formato',
               'SFormato'
              ]
    
    for prueba in pruebas:
        print("%-16s => %s" % (prueba, pascal_kebab(prueba)))
    

    Result:

    VerHTMLDePag     => ver-html-de-pag
    Ver2HTMLDePag    => ver2-html-de-pag
    Ver2HTMLPag2Info => ver2-html-pag2-info
    HTMLFomatoPag    => html-fomato-pag
    HTMLConXML       => html-con-xml
    HTML5FomatoPag   => html5-fomato-pag
    HTML5ConXML      => html5-con-xml
    HTML5ConCSS3     => html5-con-css3
    HTML             => html
    VerQ             => ver-q
    A2BFormato       => a2b-formato
    Formato          => formato
    SFormato         => s-formato
    

    Demo:

    link

        
    answered by 08.04.2017 в 17:31