Published inDev Genius·Apr 30Sparkmagic + Livy + SparkTurn notebooks into a Spark IDE Sparkmagic is a project to interactively work with remote Spark clusters in Jupyter notebooks through the Livy REST API. It provides a set of Jupyter Notebook cell magics and kernels to turn Jupyter Notebook into an integrated Spark environment for remote clusters. Sparkmagic is…Open Source3 min readOpen Source3 min read
Apr 13On-call week in Data EngineeringIssues worked in last five days in my Job Last week I was On call for the Data Platform Engineering team. In this blog, I cover the issues that came during this week and their solution. Pandas to pandas_on_spark The job was scheduled from Airflow. The task running job failed. The task log…Data5 min readData5 min read
Published inDev Genius·Apr 12Debugging a memory leak in Spark ApplicationReal-life experience Credits: Shrikant Prasad for working on this. Recently one of the applications from HDP (Spark 2.3) was onboarded to Spark 3.2 on YARN. The first job ran and it threw an error. Since the job had 4 retries it tried and failed 4 times. The job owner looked…Spark4 min readSpark4 min read
Published inDev Genius·Mar 17[15 more] Signs of a professional Pyhton programmerFollowing up to the previous blog Click Here for Part I Using formatting with f — F string is a better & preferred way of string formatting over .format. This has support for variable substitution, expression evaluation, decimal rounding & precision. …Python3 min readPython3 min read
Published inDev Genius·Mar 15[Solution] Spark — debugging a slow ApplicationFollow up blog to fix slow jobs This blog is a follow-up to this blog where I list reasons for slow Spark Job. Input / Source Input Layout Partitioned data The right partitioning scheme allows Spark to read only specific data. …Spark5 min readSpark5 min read
Published inDev Genius·Mar 13Spark — debugging a slow ApplicationReasons that make an application slow Spark has a lot of native optimization tricks (like Catalyst, CBO, AQE, Dynamic Allocation, and Speculation) up its sleeves to make the job run faster. Still many a time we will see our jobs getting slow & slow…Spark5 min readSpark5 min read
Published inDev Genius·Mar 12Spark Errors — UnclutteredUnderstanding spark errors When any Spark application fails, we should identify the errors and exceptions that caused the failure. We can find the exception messages in the spark driver or executor logs. Useful information is also logged into the spark UI. At times the error messages could mean different things…Spark6 min readSpark6 min read
Published inDev Genius·Mar 11Spark — SpillA side effect Spark does data processing in memory. But not everything fits in memory. When data in the partition is too large to fit in memory it gets written to disk. Spark does this to free up memory in the RAM for the remaining tasks within the job. It…Apache Spark2 min readApache Spark2 min read
Published inDev Genius·Mar 10Shuffle in SparkData rearrangement in partitions Shuffle is the process of re-distributing data between partitions for operation where data needs to be grouped or seen as a whole. Shuffle happens whenever there is a wide transformation. In Spark DAG (Operator Graph), two stages are separated by shuffle boundaries. …Spark4 min readSpark4 min read
Published inDev Genius·Mar 8Spark partitioningControlling the number of partitions in spark for parallelism A partition in spark is a logical chunk of data mapped to a single node in a cluster. Partitions are basic units of parallelism. Each partition is processed by a single task slot. In a multicore system, total slots for tasks…Spark5 min readSpark5 min read