Let's try to match the beginning of each word. There are 2 types of words in Pascal Notation :
Words started in uppercase, followed by at least one lower case
r"[A-Z][a-z]"
In this case, we are only interested in verifying that it is followed by a lowercase letter (it is the only relevant thing to put a hyphen before and lower case).
Although there could also be digits between the two letters, and we added it:
r"[A-Z]\d*[a-z]"
Acronyms (consecutive capital letters).
r"[A-Z][A-Z\d]*(?=[A-Z]|$)"
Matches 1 upper case, followed by uppercase or digits [A-Z][A-Z\d]*
.
But also, that is followed by another capital letter or the end of the text (?=[A-Z]|$)
.
That way, we avoid consuming the next word. For example,
- That matches
HTML
in HTMLFormateado
.
- But also with
HTML
in FormatoHTML
.
Putting the two previous expressions together in one, we are left with:
r"[A-Z](?:[A-Z\d]*(?=[A-Z]|$)|\d*[a-z])"
This expression already matches all cases. If we replace with r"-\g<0>"
(a hyphen followed by the text that matched), we have:
>>> import re
>>> re.sub(r"[A-Z](?:[A-Z\d]*(?=[A-Z]|$)|\d*[a-z])", r"-\g", "FormatoHTMLConCSS")
'-Formato-HTML-Con-CSS'
Do not insert scripts at the beginning of the text
To avoid inserting hyphens at the beginning, we will pass a function as an argument to check, in each replacement, if match.start()
is 0
. If it is the first word (it starts at position 0), we do not use a script, otherwise we precede a script.
Within the function, we use str.lower()
to take to lowercase.
import re
patron = r"[A-Z]\d*(?:[A-Z\d]*(?=[A-Z]|$)|[a-z])"
pascal = re.compile(patron)
def pascal_kebab(cadena):
def insertar_separador(match):
return ("-" if match.start() else "") + match.group().lower()
return pascal.sub(insertar_separador, cadena)
Final code
Convert from PascalCase to kebab-case.
We use exactly the same logic as in the last code, with a lambda.
- When using a single regex, and not relying on lookbehinds, this feature has a better performance (30 % to 100% faster) than commonly used functions.
import re
pascal = re.compile(r"[A-Z]\d*(?:[A-Z\d]*(?=[A-Z]|$)|[a-z])")
def pascal_kebab(cadena):
return pascal.sub(lambda m: ("-" if m.start() else "") + m.group().lower(), cadena)
Tests:
pruebas = ['VerHTMLDePag', 'Ver2HTMLDePag', 'Ver2HTMLPag2Info', 'HTMLFomatoPag',
'HTMLConXML', 'HTML5FomatoPag','HTML5ConXML', 'HTML5ConCSS3',
'HTML', 'VerQ', 'A2BFormato', 'Formato',
'SFormato'
]
for prueba in pruebas:
print("%-16s => %s" % (prueba, pascal_kebab(prueba)))
Result:
VerHTMLDePag => ver-html-de-pag
Ver2HTMLDePag => ver2-html-de-pag
Ver2HTMLPag2Info => ver2-html-pag2-info
HTMLFomatoPag => html-fomato-pag
HTMLConXML => html-con-xml
HTML5FomatoPag => html5-fomato-pag
HTML5ConXML => html5-con-xml
HTML5ConCSS3 => html5-con-css3
HTML => html
VerQ => ver-q
A2BFormato => a2b-formato
Formato => formato
SFormato => s-formato
Demo:
link