Group by 2 independent columns

0

I have a Dataset with millions of records that I want to group using pyspark by 2 independent columns, I'll give you an example:

I have:

ID Col A Col B

1 Alicia Madrid

2 Pepe Barcelona

3 Pepe Madrid

4 Juan Cadiz

5 Alicia Sevilla

6 Marta Bilbao

7 Alicia Vigo

8 Marta Sevilla

And I want to

ID Col A Col B Group

1 Alicia Madrid A

2 Pepe Barcelona A

3 Pepe Madrid A

4 Juan Bilbao B

5 Alicia Sevilla A

6 Marta Bilbao B

7 Jaime Vigo C

8 Marta Cadiz B

.

Group A, which contains the values of column A: Alica and Pepe, because they share the value "Madrid" in column B.

Group B is formed by Juan and Marta, because they share Bilbao

Any ideas?

    
asked by Antonio Asensio 19.11.2018 в 18:02
source

0 answers