Formatting pyramid panda dataframe

1

I have a set of business tax identification data.

from io import StringIO
import pandas as pd

csv =StringIO(u'''\
NIT, TYPE
8600219985, company
9001899451, company
19479647, person
19065171, company
79896134, person
87111760819, person
''')

and I want to format the records that match TYPE = company, so that it looks like this:

   NIT             TYPE
860.021.998-5   company
900.189.945-1   company
19479647        person
190.651.71      company
79896134        person
87111760819     person

Any ideas on how to do this with python pandas?

    
asked by Yan Chirino 10.05.2018 в 21:46
source

2 answers

1

Working with Pandas and regular expressions

The idea here is the definition of a dictionary list: patrones that will contain a regular expression pattern and an anonymous replacement function. And we apply the list on dataframe only in cases where df['TYPE'] == ' company' . At the panda level we use str.replace() to perform the replacement, for example if the pattern is found:

^(\d{3})(\d{3})(\d{3})(\d{1,2})$

that is, from the beginning of the chain three groups of three numbers and a group of 1 or 2 more numbers and, if there is a match, we will obtain each of these groups separately and we can format them like this:

"{0}.{1}.{2}-{3}".format(m.group(1),m.group(2),m.group(3),m.group(4))

from io import StringIO
import pandas as pd

csv = StringIO(u'''\
NIT, TYPE
8600219985, company
9001899451, company
19479647, person
19065171, company
79896134, person
87111760819, person
''')

df = pd.read_csv(csv, dtype=str, names=["NIT", "TYPE"], header=0)
patrones = [
              {"patron": r"^(\d{3})(\d{3})(\d{3})(\d{1,2})$", "repl": lambda m: "{0}.{1}.{2}-{3}".format(m.group(1),m.group(2),m.group(3),m.group(4))},
              {"patron": r"^(\d{3})(\d{3})(\d{2})$", "repl": lambda m: "{0}.{1}.{2}".format(m.group(1),m.group(2),m.group(3))}              
  ]

for p in patrones:
  df.loc[df['TYPE'] == ' company', 'NIT'] = df.loc[df['TYPE'] == ' company', 'NIT'].str.replace(p["patron"], p["repl"])

print(df)

             NIT      TYPE
0  860.021.998-5   company
1  900.189.945-1   company
2       19479647    person
3     190.651.71   company
4       79896134    person
5    87111760819    person
    
answered by 10.05.2018 / 22:32
source
2

We can write a function that receives a string with the unformatted number and returns it formatted. In this case, in the absence of more information, I suppose that the choice of format depends on whether the number has 10 digits or less:

def format_number(n):
  if len(n) == 10:
    return "{}.{}.{}-{}".format(n[0:3], n[3:6], n[6:9], n[9])
  elif len(n)<10:
    return "{}.{}.{}".format(n[0:3], n[3:6], n[6:])
  else:     
    return n    # Resto de casos no sé cómo tratarlos, lo dejo como estaba

Now we can apply to each row a function (lambda) that substitutes the NIT so it returns the function format_number , for the rows in which TYPE is "company".

csv =StringIO(u'''\
NIT, TYPE
8600219985, company
9001899451, company
19479647, person
19065171, company
79896134, person
87111760819, person
''')

# Leer el csv (corrijo el nombre de las columnas para que no
# contengan espacios indeseados
df = pd.read_csv(csv, dtype=str, names=["NIT", "TYPE"], header=0)

# Aplicar la transformación deseada en la columna NIT
df.NIT = df.apply(lambda x: format_number(x.NIT) if "company" in x.TYPE else x.NIT, axis=1)

Result:

    
answered by 10.05.2018 в 23:22