Spark Interview Question — XIII

Amit Singh Rathore
3 min readFeb 13, 2024

The next part of the Spark Interview question series

Part I | Part II | Part III | Part IV | Part V | Part VI | Part VII | Part VIII | Part IX | Part X | Part XI | Part XII

Explain AQE Skew Join Optimization and its limitations.

When AQE skew-join optimization is enabled, Spark detects a skewed partition during shuffle then splits it into two or more partitions and also multiplies the other side of a join for the same key partition. This helps Spark in evenly consuming resources and finish quickly.

Firstly, it does not work for complex queries when there are more than two shuffles involved in the plan. Secondly, during its execution, when it determines that handling skew join involves extra shuffle, it aborts handling Skew and goes with the plan that has less shuffle in this case, the original plan without Skew optimization.

Spark listeners help us to detect using OnStageCompleted() and OnTaskEnd().

Options:

  • spark.sql.adaptive.forceOptimizeSkewedJoin
  • Skew Join hints
  • refactor code to partition job horizontally or vertically

How does lazy evaluation help spark optimization?

1. Pipeline merging: It combines two consecutive transformations like map and filter into a single operation and reduces the data movement between stages.

--

--

Amit Singh Rathore

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML