Open in app

Sign In

Write

Sign In

Amit Singh Rathore
Amit Singh Rathore

2.3K Followers

Home

About

Published in

Dev Genius

·Apr 30

Sparkmagic + Livy + Spark

Turn notebooks into a Spark IDE Sparkmagic is a project to interactively work with remote Spark clusters in Jupyter notebooks through the Livy REST API. It provides a set of Jupyter Notebook cell magics and kernels to turn Jupyter Notebook into an integrated Spark environment for remote clusters. Sparkmagic is…

Open Source

3 min read

Sparkmagic + Livy + Spark
Sparkmagic + Livy + Spark
Open Source

3 min read


Apr 13

On-call week in Data Engineering

Issues worked in last five days in my Job Last week I was On call for the Data Platform Engineering team. In this blog, I cover the issues that came during this week and their solution. Pandas to pandas_on_spark The job was scheduled from Airflow. The task running job failed. The task log…

Data

5 min read

Data

5 min read


Published in

Dev Genius

·Apr 12

Debugging a memory leak in Spark Application

Real-life experience Credits: Shrikant Prasad for working on this. Recently one of the applications from HDP (Spark 2.3) was onboarded to Spark 3.2 on YARN. The first job ran and it threw an error. Since the job had 4 retries it tried and failed 4 times. The job owner looked…

Spark

4 min read

Debugging a memory leak in Spark Application
Debugging a memory leak in Spark Application
Spark

4 min read


Published in

Dev Genius

·Mar 17

[15 more] Signs of a professional Pyhton programmer

Following up to the previous blog Click Here for Part I Using formatting with f — F string is a better & preferred way of string formatting over .format. This has support for variable substitution, expression evaluation, decimal rounding & precision. …

Python

3 min read

Python

3 min read


Published in

Dev Genius

·Mar 15

[Solution] Spark — debugging a slow Application

Follow up blog to fix slow jobs This blog is a follow-up to this blog where I list reasons for slow Spark Job. Input / Source Input Layout Partitioned data The right partitioning scheme allows Spark to read only specific data. …

Spark

5 min read

[Solution] Spark — debugging a slow Application
[Solution] Spark — debugging a slow Application
Spark

5 min read


Published in

Dev Genius

·Mar 13

Spark — debugging a slow Application

Reasons that make an application slow Spark has a lot of native optimization tricks (like Catalyst, CBO, AQE, Dynamic Allocation, and Speculation) up its sleeves to make the job run faster. Still many a time we will see our jobs getting slow & slow…

Spark

5 min read

Spark — debugging a slow Application
Spark — debugging a slow Application
Spark

5 min read


Published in

Dev Genius

·Mar 12

Spark Errors — Uncluttered

Understanding spark errors When any Spark application fails, we should identify the errors and exceptions that caused the failure. We can find the exception messages in the spark driver or executor logs. Useful information is also logged into the spark UI. At times the error messages could mean different things…

Spark

6 min read

Spark Errors — Uncluttered
Spark Errors — Uncluttered
Spark

6 min read


Published in

Dev Genius

·Mar 11

Spark — Spill

A side effect Spark does data processing in memory. But not everything fits in memory. When data in the partition is too large to fit in memory it gets written to disk. Spark does this to free up memory in the RAM for the remaining tasks within the job. It…

Apache Spark

2 min read

Spark — Spill
Spark — Spill
Apache Spark

2 min read


Published in

Dev Genius

·Mar 10

Shuffle in Spark

Data rearrangement in partitions Shuffle is the process of re-distributing data between partitions for operation where data needs to be grouped or seen as a whole. Shuffle happens whenever there is a wide transformation. In Spark DAG (Operator Graph), two stages are separated by shuffle boundaries. …

Spark

4 min read

Shuffle in Spark
Shuffle in Spark
Spark

4 min read


Published in

Dev Genius

·Mar 8

Spark partitioning

Controlling the number of partitions in spark for parallelism A partition in spark is a logical chunk of data mapped to a single node in a cluster. Partitions are basic units of parallelism. Each partition is processed by a single task slot. In a multicore system, total slots for tasks…

Spark

5 min read

Spark partitioning
Spark partitioning
Spark

5 min read

Amit Singh Rathore

Amit Singh Rathore

2.3K Followers

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML

Following
  • JIN

    JIN

  • Tony

    Tony

  • NK

    NK

  • Capital One Tech

    Capital One Tech

  • Ran Isenberg

    Ran Isenberg

See all (88)

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech

Teams