Apache DataFusion — Modern query engine for performance
LLVM for modern data products
Modern data products are, increasingly, being built on robust frameworks that improve data analytics throughput. One such framework is Apache DataFusion. It is a high-performance query engine that leverages the Apache Arrow in-memory format. In this blog, we will explore some of the open-source data products that utilize DataFusion. Examples are from different categories (unique business challenges) and how DataFusion solves them.
Understanding DataFusion
Apache Arrow DataFusion is a high-performance, extensible query execution framework written in Rust. It provides a modern, in-memory, distributed SQL engine that powers efficient analytical workloads. Its primary goal is to process large datasets stored in columnar storage formats like Apache Parquet. Unlike traditional systems, DataFusion leverages modern hardware capabilities such as SIMD (Single Instruction, Multiple Data) for faster computations.
Technical details for DataFusion
- Built with Rust, DataFusion ensures memory safety without garbage collection overhead, making it ideal for high-performance applications. It also enables thread safety for concurrent query execution.DataFusion leverages Apache Arrow’s in-memory columnar format, which enhances CPU cache efficiency and speeds up query execution.
- DataFusion processes data in columnar batches, which aligns well with modern storage formats like Parquet and ORC.
- Columnar execution minimizes memory usage and speeds up operations by reducing I/O overhead.
- It uses SIMD instructions to perform operations on multiple data points simultaneously, significantly boosting performance for aggregation and filtering tasks.
- Zero-copy data interchange between components.
- Its TableProvider trait allows seamless integration with various data formats like CSV, JSON, AVRO, etc.
- DataFusion can be extended using UDF, UDAF, TableProvider, CatalogProvider or by creating custom data sources.
- Easy to embed and customize.
By utilizing DataFusion, developers can focus on building unique features without having to reinvent the wheel regarding query execution and optimization.
SQL & DataFrame API, full query planner and a multi-threaded execution engine
Optimizations done by DataFusion Query engine:
- LogicalPlan(DataFlow Graph) rewrites like projection pushdown, filter pushdown, limit pushdown, expression simplification, common subexpression elimination, join predicate extraction, correlated subquery flattening, and outer-to-inner join conversion.
- ExecutionPlan rewrites like eliminating unnecessary sorts, maximizing parallel execution, and determining specific algorithms such as Hash or Merge joins.
- Async execution engine to avoid blocking I/O
- Each operator will produce partitions in parallel
- Heavily optimized multicolumn sorting implementation.
- Two-phase parallel partitioned hash grouping implementation with vectorized execution.
- Reorder Joins, Predicate pushdown through join, transitive join predicate, sin-memory hash joins
- Normalized Sort Keys / RowFormat
- Leveraging Pre-existing Sort Order & Streams for partially sorted inputs
- Incremental Window Functions evaluation
- Pushdown and Late Materialization
Open-Source Data Products Built on DataFusion
Distributed SQL Engine
- DataFusion Ballista leverages DataFusion’s capabilities to execute queries in parallel across distributed environments and hence allows it to scale horizontally across multiple nodes. Alternative to Apache Spark. It is a full-fledged runtime for DataFusion on top of Apache Arrow.
- Dask SQL uses DataFusion’s Python bindings for SQL parsing, query planning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
- DataFusion Ray is another distributed query engine that uses DataFusion’s Python bindings.
Cube.js: The Semantic Layer
Cube uses Apache DataFusion to analyze incoming SQL queries and find the best query plan out of a wide variety of possible plans to execute. It also uses DataFusion for post-processing any regular query.
DataFusion Comet (Spark Accelarator)
Apache DataFusion Comet is an Apache Spark plugin that uses Apache DataFusion as a native runtime to achieve improvement in terms of query efficiency and query runtime.
Blaze (Spark + Native engine)
Blaze accelerates query processing for Apache Spark by leveraging DataFusion’s native vectorized execution engine. It focuses on improving performance for Spark-based analytical workloads.
Arroyo (Streaming Solution)
Arroyo uses DataFusion to power its SQL engine, particularly for parsing SQL queries and generating execution plans.
InfluxDB (Time Series Database)
Query processing in InfluxDB 3.0 uses DataFusion to query data via three different language frontends (an internal gRPC API, SQL, and InfluxQL) as well as write and merge Parquet files (“reorg”).
CnosDB (Time series DB for IoT)
CnosDB uses DataFusion for efficient querying and analytics on large-scale time-stamped data, making it ideal for IoT applications.
OpenObserve
An observability platform offering real-time monitoring, alerting, and dashboards. OpenObserve uses DataFusion to query Parquet files in object storage, reducing costs and simplifying deployment compared to traditional observability tools like Splunk or DataDog.
LanceDB (Vector Embedding store)
LanceDB uses DataFusion to enable SQL querying across its custom columnar data format, Lance, providing fast retrievals and efficient data versioning.
SpiceAI (AI Agent Platform)
SpiceAI integrates DataFusion for query federation across data lakes and warehouses. It supports semantic data models and caching to reduce latency in LLM responses.
Open Lake (Rust Implementation of Delta & Iceberg)
Both Databricks and Apache Iceberg use Apache Datafusion as their backend data processing engine for their respective rust-based tooling implementation (Delta-Rust, Iceberg-Rust)
Databend (MPP OLAP similar to Snowflake)
Databend uses DataFusion as its query engine to optimize performance across cloud-native environments.
Vectorized Log Observability (VectorDB)
VectorDB provides structured logging and observability solutions by integrating DataFusion for efficient querying of log data at scale, suitable for debugging distributed systems.
Conclusion
Apache DataFusion has matured beyond early adopters and is now a viable choice for many for building highly performant analytic systems.