Oct 30, 2023
In my understanding:
Caching, Skew & data locality does not cause shuffle. Shuffle & data distribution may look similar but they are two different terms. The explanation that you have given for Caching & Skew will result in Spill to disk not a shuffle. Data locality might involved getting data from remote node/executor but that is not a shuffle. It might be a side effect of previous shuffle.
Would love your inputs on these.