I have a Dataset with millions of records that I want to group using pyspark by 2 independent columns, I'll give you an example:
I have:
ID Col A Col B
1 Alicia Madrid
2 Pepe Barcelona
3 Pepe Madrid
4 Juan Cadiz
5 Alicia Sevilla
6 Marta Bilbao
7 Alicia Vigo
8 Marta Sevilla
And I want to
ID Col A Col B Group
1 Alicia Madrid A
2 Pepe Barcelona A
3 Pepe Madrid A
4 Juan Bilbao B
5 Alicia Sevilla A
6 Marta Bilbao B
7 Jaime Vigo C
8 Marta Cadiz B
.
Group A, which contains the values of column A: Alica and Pepe, because they share the value "Madrid" in column B.
Group B is formed by Juan and Marta, because they share Bilbao
Any ideas?