Published inDev Genius·Nov 20Member-onlySpark Operator — BasicsManaging Spark Jobs as a K8s object Spark Operator allows seamless integration between Apache Spark and Kubernetes. It follows the K8s operator pattern to manage the lifecycle of Spark applications. When using this, a Spark application is declared using YAML files. Components & Architecture The Kubernetes Operator for Apache Spark comprises several key…Apache Spark8 min readApache Spark8 min read
Published inDev Genius·Nov 12Member-onlySpark Interview Question — XINext installment of the interview series Part I | Part II | Part III | Part IV | Part V | Part VI | Part VII | Part VIII | Part IX | Part X What is Arrow & how does it improve Python UDF in Spark? Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing. Before Arrow…Apache Spark4 min readApache Spark4 min read
Published inDev Genius·Nov 11Advance Data Structures for Data Engineering — Part IIProbabilistic data structures used in big data Read Part I here. 1. HyperLogLog Counting unique items usually requires the amount of memory proportional to the number of items we want to count because we need to remember the elements we have already seen in the past in order to avoid counting them…Data Engineering4 min readData Engineering4 min read
Nov 6Data Engineering Best PracticesCompilation of some good practices for DE Data engineering is a critical aspect of any data-driven organization. Building robust and efficient data pipelines requires a combination of expertise, well-defined processes, and best practices. …Data Engineering5 min readData Engineering5 min read
Published inDev Genius·Nov 2Member-onlySpark Interview Questions — Xnext part of the interview series Part I | Part II | Part III | Part IV | Part V | Part VI | Part VII | Part VIII | Part IX | Part X What Aggregate Strategies are provided in Spark and how it chooses one? (Hash vs Sort aggregate) The Sort Aggregate requires the rows to be sorted by the grouping key so that…Spark4 min readSpark4 min read
Published inDev Genius·Oct 24Spark Threat ModellingSecurity consideration of spark cluster & its components The Spark ecosystem is an integral part of most of the analytics workloads at big companies. While generally this is deployed in private network zones but still its security sanitization is still needed to avoid any data breach. …Apache Spark4 min readApache Spark4 min read
Published inDev Genius·Oct 24Spark Structured Streamingprocess streaming data using spark dataframe API Stream Processing can mean different things to different people. Some look at this with the lens of Real-time vs Schedule where they associate it with latency. Some people link this to continuous/ongoing processing of records. While a majority of people link streaming to…Apache Spark7 min readApache Spark7 min read
Published inDev Genius·Oct 24Member-onlySpark Interview Questions — IXNext blog in the spark interview series. Part I | Part II | Part III | Part IV | Part V | Part VI | Part VII | Part VIII | Part IX | Part X What are the different types of window operations available in Spark Streaming? There are four different types of window operations available in Spark Streaming: 1. Tumbling…Apache Spark5 min readApache Spark5 min read
Published inDev Genius·Oct 16Spark Interview Questions — VIIIAnother part of the Spark interview series. Part I | Part II | Part III | Part IV | Part V | Part VI | Part VII | Part VIII | Part IX | Part X What is the difference between Select vs SelectExpr in Spark? selectExpr() is a powerful method for column selection and transformation when you need to…Apache Spark6 min readApache Spark6 min read
Published inDev Genius·Oct 8Spark Interview Questions — VIIThe next part of the series. Part I | Part II | Part III | Part IV | Part V | Part VI | Part VII | Part VIII | Part IX | Part X What happens when we give Join hints on both sides of join? When the hints are specified on both sides of the Join, Spark selects the hint…Spark4 min readSpark4 min read