--

In my understanding:

Caching, Skew & data locality does not cause shuffle. Shuffle & data distribution may look similar but they are two different terms. The explanation that you have given for Caching & Skew will result in Spill to disk not a shuffle. Data locality might involved getting data from remote node/executor but that is not a shuffle. It might be a side effect of previous shuffle.

Would love your inputs on these.

--

--

Amit Singh Rathore
Amit Singh Rathore

Written by Amit Singh Rathore

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML

Responses (1)