I have read numerous documentations about the differences between fine grain and coarse grain parallelism, but I do not get to understand it very well, here is an example of what I have seen:
"An application shows fine grain parallelism if its subtasks should be reported many times per second, coarse grain parallelism is considered if they are not communicated many times per second (...)" Source: Wikipedia.
When implementing, for example, a Matrix x Vector multiplication, how does a fine grain and a coarse grain implementation differ?
I have already made a fine-grained solution creating a thread for each row of the matrix and then operate with it, but if I now want to make a coarse-grained solution, how would I have to implement it? ?
What I have thought has been, in my case, using the Subramanian equation with Coef. lock 0 for example to get the number of threads needed and then divide the dimension that has the matrix between the number of threads to launch a thread by block and not by rows as in the fine grain.
Let's see if I can find out once and for all how each parallelism works.