Spark Interview Question — XIII

3 min readFeb 13, 2024

The next part of the Spark Interview question series

Explain AQE Skew Join Optimization and its limitations.

When AQE skew-join optimization is enabled, Spark detects a skewed partition during shuffle then splits it into two or more partitions and also multiplies the other side of a join for the same key partition. This helps Spark in evenly consuming resources and finish quickly.

Firstly, it does not work for complex queries when there are more than two shuffles involved in the plan. Secondly, during its execution, when it determines that handling skew join involves extra shuffle, it aborts handling Skew and goes with the plan that has less shuffle in this case, the original plan without Skew optimization.

Spark listeners help us to detect using OnStageCompleted() and OnTaskEnd().

Options:

spark.sql.adaptive.forceOptimizeSkewedJoin
Skew Join hints
refactor code to partition job horizontally or vertically

How does lazy evaluation help spark optimization?

1. Pipeline merging: It combines two consecutive transformations like map and filter into a single operation and reduces the data movement between stages.

2. Predicate Pushdown: Predicates point to where or filter clause used for the number of rows returned and using lazy evaluation spark can push predicate close to the data source reducing the amount of data that needs to be processed.

3. Join optimization: Lazy evaluation helps Spark to rearrange the order of operations and choose the most efficient join strategy based on data size.

4. Short-Circuit Evaluation: In case of action only requires a subset of data, the spark can stop the computation early once the necessary data is available.

5. Pruning Unused Data: Lazy Evaluation allows Spark to skip unnecessary computations/data by analyzing the physical plan.

How we can do Upsert in Spark?

Spark does not inherently support upsert operations because it operates on immutable DataFrames. Therefore, performing an upsert in Spark requires a combination of operations to simulate this functionality. There are two approaches

Union & deduplication
Join

Explain UnionByName vs Union.

The columns of the two DataFrames involved in the union are the same in terms of their names and data types but not in terms of their order in the DataFrames. This led to this symptom we have seen. So if we are facing something similar or want to avoid it before this happens, we should always use unionByName as it is much safer.

Explain the importance of the order of columns in InserInto.

If the order of columns in the dataframe and table do not match, we will end up inserting the wrong data in the table without our application throwing any error at all.

With a partitioned table, if we have a high cardinality column in the position of the partitioning column our application will have to spend a lot more time to create partitions and hence will run longer than it would otherwise.

While using insertInto(), we must ensure that the position of columns in our dataframe matches the position of the columns in the table.