Working with Pandas and regular expressions
The idea here is the definition of a dictionary list: patrones
that will contain a regular expression pattern and an anonymous replacement function. And we apply the list on dataframe
only in cases where df['TYPE'] == ' company'
. At the panda level we use str.replace()
to perform the replacement, for example if the pattern is found:
^(\d{3})(\d{3})(\d{3})(\d{1,2})$
that is, from the beginning of the chain three groups of three numbers and a group of 1 or 2 more numbers and, if there is a match, we will obtain each of these groups separately and we can format them like this:
"{0}.{1}.{2}-{3}".format(m.group(1),m.group(2),m.group(3),m.group(4))
from io import StringIO
import pandas as pd
csv = StringIO(u'''\
NIT, TYPE
8600219985, company
9001899451, company
19479647, person
19065171, company
79896134, person
87111760819, person
''')
df = pd.read_csv(csv, dtype=str, names=["NIT", "TYPE"], header=0)
patrones = [
{"patron": r"^(\d{3})(\d{3})(\d{3})(\d{1,2})$", "repl": lambda m: "{0}.{1}.{2}-{3}".format(m.group(1),m.group(2),m.group(3),m.group(4))},
{"patron": r"^(\d{3})(\d{3})(\d{2})$", "repl": lambda m: "{0}.{1}.{2}".format(m.group(1),m.group(2),m.group(3))}
]
for p in patrones:
df.loc[df['TYPE'] == ' company', 'NIT'] = df.loc[df['TYPE'] == ' company', 'NIT'].str.replace(p["patron"], p["repl"])
print(df)
NIT TYPE
0 860.021.998-5 company
1 900.189.945-1 company
2 19479647 person
3 190.651.71 company
4 79896134 person
5 87111760819 person