I'm new to Apache Spark. I have a query regarding RDDs and transformations. I have a PairRDD with data loaded already. And now what I need, is: through a transformation (maptoPair), so that I can get another pairRDD, the tuple that meets a certain condition and also add additional information to the tuple (that's why a map and not a filter) . This condition is based on for each tuple (K, V) obtaining from another different rdd (also key-value), all tuples containing K, and all tuples containing V, and then obtaining the intersection. The code that I have raised so far is the following.
JavaPairRDD<Long,Long> originalRDD = .... //se carga de un dataset
JavaPairRDD<Long,Long> otro = ......; //de aca saco las tuplas
JavaPairRDD<Tuple2<Long, Long>, Long> result = otro
.mapToPair(tupla-> {
JavaRDD<Long> aux1;
JavaRDD<Long> aux2;
aux1 = originalRDD.filter(T -> T._1.equals(tupla._1)).values().flatMap(f -> f);
aux2 = originalRDD.filter(T -> T._2.equals(tupla._2)).values().flatMap(f -> f);
JavaPairRDD<Long,Long> auxfinal = aux1.intersect(aux2);
//aca vendria el codigo que resta que agrega la infoadicional a la tupla (no es relevante al caso este codigo)
//devolver la info al RESULT
});
If I program it in this way, the executors that take this lambda function inform the Driver that new Tasks must be created referring to the intersects and the filters ?? or the same executor would be responsible for doing EVERYTHING? (losing total degree of parallelism at Job level) Since in no bibliography I have found a case like this to know what happens when nesting transformations (nesting). Thank you very much in advance!