Show only repeated elements with Pandas

1

Suppose I have a list in .csv with different columns, one of them is "teacher name", the second "subject" and the third is "year".

There may be something similar to this:

Nombre profesor  Asignatura  Año
Juan             Mates       2002
Pedro            Lengua      2003
Luisa            Mates       2005
Natalia          Inglés      2002
Juan             Inglés      2008
Natalia          Física      2004
Juan             Inglés      2018
Luisa            Mates       2018

EDITO QUESTION Is there any way to show through pandas commands only those teachers who do more than one subject ? That is to say something like this:

Nombre profesor  Count     
Juan             2     
Natalia          2

Since Juan comes out 3 times but twice his subject is the same, English and the same thing happens to Luisa, who comes out twice but the subject is the same, mates, therefore it does not count.

It occurs to me to use groupby()

datos = pd.read_csv('fichero.csv')
p = datos.groupby('Nombre_profesor','Asignatura').size().loc[lambda x: x>1]

But if I'm not mistaken this shows those teachers who repeat themselves and who do the same subject. That is, just the ones that I want to discard therefore I do not know what to do.

Thanks!

    
asked by NEA 15.11.2018 в 23:56
source

1 answer

2

First we reproduce your example:

import pandas as pd
from io import StringIO

txt = """
Nombre_profesor  Asignatura  Año
Juan             Mates       2002
Pedro            Lengua      2003
Luisa            Mates       2005
Luisa            Mates       2006
Natalia          Inglés      2002
Juan             Inglés      2008
Natalia          Física      2004"""

datos = pd.read_table(StringIO(txt), sep="\s+")

What you are looking for is to obtain teachers who have more than one different subject. Modify your example, to see a case of a teacher with the same subject in two different years (should count 1).

To obtain a dataframe with the amounts of different subjects per teacher, we can do:

p = datos.groupby('Nombre_profesor').Asignatura.nunique().reset_index()
print(p)

  Nombre_profesor  Asignatura
0            Juan           2
1           Luisa           1
2         Natalia           2
3           Pedro           1

Notice that Luisa has the same material in two different years and still the account gives 1. The following is to filter those cases where this amount is greater than 1:

print(p.loc[p['Asignatura'] > 1])

  Nombre_profesor  Asignatura
0            Juan           2
2         Natalia           2

Explanation:

  • With datos we group by Nombre_profesor by groupby()
  • Of each group using Asignatura.nunique() we count the different subjects. With reset_index() we leave Asignatura as one more column
  • Finally p.loc[p['Asignatura'] > 1] we filter only those cases that have more than one subject
answered by 16.11.2018 / 02:30
source