Open in app

Sign In

Write

Sign In

Amit Singh Rathore
Amit Singh Rathore

1.6K Followers

Home

About

Published in Dev Genius

·3 days ago

Pyspark UDF

expressive & reusable functions User Defined Functions in Apache Spark allow extending the functionality of Spark and Spark SQL by adding custom logic. These are similar to functions in SQL. We define some logic in functions and store them in the database and use them in queries. UDFs is that…

Spark

4 min read

Pyspark UDF
Pyspark UDF
Spark

4 min read


Published in Dev Genius

·3 days ago

Advance Data Structures for Data Engineering — Part I

Some data structures which are used in tools used in the field of Data Engineering Directed Acyclic Graph A DAG is a specific type of graph. It can be seen as a graphical representation of causal effects i.e. a node in a DAG is the result of an action/relation of its predecessor node…

Data

4 min read

Advance Data Structures for Data Engineering — Part I
Advance Data Structures for Data Engineering — Part I
Data

4 min read


Published in Dev Genius

·3 days ago

Python Packages for Data Engineers

common packages used in python while working as Data Engineers Python is a versatile language when it comes to the data engineering field. Its modular approach allows us to have a specific codebase handling a specific type of work. I have worked as a data engineer for the last 5…

Python

3 min read

Python Packages for Data Engineers
Python Packages for Data Engineers
Python

3 min read


Published in Dev Genius

·3 days ago

25 signs of an experienced Python developer

Practices that make a good python developer Using a virtual environment: This shows that you isolate environments to avoid issues with dependencies and package versions. Also if we stick to requirements & constraints files, it shows that we care about how the app should be run in another place. pip…

Python

4 min read

Python

4 min read


Published in Dev Genius

·5 days ago

Spark SQL Query Plan

digging into the query planning Spark runs commands in distributed mode, where the code is sent to executors (closer to data). This pegs the question of how an SQL statement is converted to code and sent to executors. In this blog, we will see how Spark executes SQL. In the…

Spark

3 min read

Spark SQL Query Plan
Spark SQL Query Plan
Spark

3 min read


Published in Dev Genius

·6 days ago

Running TPC-DS benchmarks for Spark

Document performance gain between spark versions TPC-DS is a leading benchmark for OLAP systems. Its data is modeled as a data warehouse. Transaction Processing Performance Council — Decision Support The TPC-DS schema is a snowflake schema. It consists of multiple dimensions and fact tables. Each dimension has a single-column surrogate…

Spark

7 min read

Running TPC-DS benchmarks for Spark
Running TPC-DS benchmarks for Spark
Spark

7 min read


Published in Dev Genius

·Jan 27

Hive Metastore

A bridge between files and exposed table Hive Metastore service is responsible for managing and persisting metadata, which helps us in creating a tabular representation of files. HMS uses a relational database to store these metadata. In embedded mode, it uses Derby databases to store the metadata. …

Apache Spark

4 min read

Hive Metastore
Hive Metastore
Apache Spark

4 min read


Published in Dev Genius

·Jan 19

Datahub — An introduction

Most trusted open-source data catalog Modern data teams have more data & personas. The data landscape is complex. With so much data & complexity around tooling, it is very hard to find the right data. It’s even harder to trust the finding. This created the need to have a new…

Data

7 min read

Datahub — An introduction
Datahub — An introduction
Data

7 min read


Published in Dev Genius

·Jan 15

Data Build Tool (dbt)

Transformation in Modern data stack dbt is a development framework that helps us transform raw data into meaningful data objects such as database tables, views, etc, using simple SELECT queries. With dbt we can develop, test, document, and deploy the entire data transformation pipeline as code. …

Open Source

7 min read

Data Build Tool (dbt)
Data Build Tool (dbt)
Open Source

7 min read


Published in Dev Genius

·Jan 14

Apache Superset — Intro

Telling the story of your data through visualization Apache Superset is an open-source data exploration and visualization platform designed to be visual, intuitive, and interactive. It enables users to analyze data using its SQL editor, and easily build charts and dashboards. Superset allows us to do following Data Visualization Data exploration Data Analysis Features Open source, lightweight …

Open Source

3 min read

Apache Superset — Intro
Apache Superset — Intro
Open Source

3 min read

Amit Singh Rathore

Amit Singh Rathore

1.6K Followers

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML

Following
  • Tony

    Tony

  • JIN

    JIN

  • StephenwithaPhD

    StephenwithaPhD

  • adrian cockcroft

    adrian cockcroft

  • Daniel Borowski

    Daniel Borowski

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech