Iceberg + Dremio = Open Lakehouse
Why we should try this?
Lakehouse is aimed at taking the best of a Data Lake (handle structured and unstructured data with low storage cost) and a Data Warehouse (good query performance and ACID guarantees).
A Data Lakehouse has 5 key components:
- Storage Layer (S3, HDFS)
- File Format (parquet, orc)
- Table Format (Iceberg, delta)
- Catalog (Hive, Nessie, Unity)
- Lakehouse Engine (Spark, Dremio)
In this blog, we will see the major features of Iceberg & Dremio. Together they can be the foundation of the lakehouse platform.
Iceberg
Apache Iceberg is an open-source table format designed for data lakehouse architectures. This table format provides an abstraction layer on data lake, allowing us to leverage database-like features on it.
Apache Iceberg table follows a 3 layers architecture:
- Iceberg Catalog — It is used to map table names to locations and must be able to support atomic operations to update referenced pointers if needed. Options for catalogs are Hive, Nessie, Glue, LakeFS, REST, and DynamoDb etc.
- Metadata Layer (with metadata files, manifest lists, and manifest files) — The Metadata Layer stores information about the table schema, configurations for the partitioning, etc.
- Data Layer — This layer is associated with the raw data files.
The write operation
- The writer visits the catalog to get the current metadata file location.
- The writer reads the metadata file to understand the table’s current schema and partition scheme to prepare for the later data writing.
- The writer writes new data files following the partition scheme.
- The writer creates manifest files in Avro format. A manifest file contains the data file location plus the file’s statistics.
- The writer creates the manifest list to keep track of the manifest files. This file contains the manifest files’ location, the number of data files/rows added or deleted, the lower and upper bounds of the partition columns, etc.
- The writer writes the new metadata file with the latest snapshots and all previous snapshots. This file includes the table base location, manifest list location, snapshot ID, sequence number, updated timestamp, etc. The writer also marks the newly created snapshot as the current snapshot.
- The writer updates the catalog’s current pointer point to the newly created metadata file.
The read operation
- The reader first visits the catalog to find the table’s current metadata file location.
- The reader collects the table’s schema to check how the data is organized.
- The reader retrieves the current snapshot. The older snapshot can be chosen for a time travel query.
- The reader will locate the manifest list associated with that snapshot.
- The reader reads the manifest list to locate the manifest files’ location. It also collects the lower and upper bounds values in the partition column of each manifest file. The reader can apply the partition filter to prune unnecessary manifest files.
- The reader opens each file to read. A single manifest file contains the information for all the data files it tracks.
- If applicable, the reader will apply the partition pruning using the lower/upper bound partition values for each entry to prune the unneeded data files.
- The reader starts to read data files using the files’ path collected from the manifest file.
- The result is returned to the client.
Following features of Icebergs makes a compelling case for it being the table format for the lakehouse.
Partition & Schema Evolution
Apache Iceberg comes with a feature known as partition evolution, which allows users to modify their partitioning scheme at any time without the need to rewrite the entire table. Partition evolution facilitates the optimization of data management, as it enables users to easily revert any changes to the partitioning scheme by simply rolling back to a previous snapshot of the table. This flexibility is a considerable advantage in managing large-scale data efficiently.
Schema evolution gives us the ability to add, drop, rename, reorder, and update data seamlessly without unforeseen side effects (all done without needing to rewrite the table in full).
Hidden Partitioning
Apache Iceberg introduces a unique feature called hidden partitioning. With Apache Iceberg’s hidden partitioning, the system can partition tables based on the transformed value of a column, with this transformation tracked in the metadata, eliminating the need for physical partitioning columns in the data files. This allows us to apply filters directly on the original columns and still benefit from the optimized performance of partitioning. This automatically handles partitioning to skip unnecessary files and speed up query execution.
Versioning & Time Travel
Versioning is an invaluable feature that facilitates isolating changes, executing rollbacks, simultaneously publishing numerous changes across different objects, and creating zero-copy environments for experimentation and development. While each table format records a single chain of changes, allowing for rollbacks, Apache Iceberg uniquely incorporates branching, tagging, and merging as integral aspects of its core table format.
Iceberg has support for Nessie, an open-source project that extends versioning capabilities to include commits, branches, tags, and merges at the multi-table catalog level,
These advanced versioning features in Apache Iceberg are accessible through ergonomic SQL interfaces, making them user-friendly and easily integrated into data workflows. This also allows time travel. This allows us to quickly switch between different table versions and compare changes. This also enables rollback to previous table versions in case errors are introduced.
Lakehouse Management
Apache Iceberg supports managing tables alongwith compaction, sorting, snapshot cleanup, and more. Iceberg has support from many like Dremio, Tabular, Upsolver, AWS, and Snowflake, each providing varying levels of table management features.
Snapshot Isolation
Iceberg provides snapshot isolation for data operations, enabling consistent and repeatable reads. This feature allows for concurrent reads and writes without impacting the integrity of the data being analyzed.
Incremental Updates
It supports atomic and incremental updates, deletions, and upserts, facilitating more dynamic and fine-grained data management strategies. This capability is crucial for real-time analytics.
Open source
Apache Iceberg open-source. The ecosystem is expanding daily, with an increasing number of tools offering both read and write support, reflecting the growing enthusiasm among vendors.
Dremio
Dremio is an attempt at building a data lakehouse platform that consolidates numerous functionalities, typically offered by different vendors, into a single solution.
- Data virtualization
- Semantic layer for data integration & transformation
- Unifies SQL query engine for diverse data source
- Support for Nessie data catalog
Following features of Dremio differentiates it with other offering.
Apache Arrow
Dremio’s SQL query engine is based on Apache Arrow. Arrow is an in-memory data format increasingly recognized as the de facto standard for analytical processing. Its transport protocol (Apache Arrow Flight) significantly reduces serialization/deserialization bottlenecks within a distributed systems.
Columnar Cloud Cache
Dremio addresses implements its Columnar Cloud Cache (C3) feature, which caches frequently accessed data on the NVMe memory of nodes within the Dremio cluster. This caching speeds up data access to data during subsequent query executions that require the same information.
Reflections
Dremio has a concept of Reflections, which simplifies query acceleration. Data Reflections in Dremio are pre-computed results of datasets stored in an optimized format. Reflections can be applied to any table or view within Dremio, allowing for the materialization of rows or the aggregation of calculations on the dataset.
Semantic Layer
This feature allows users to organize and document data from all sources into a single, coherent layer, facilitating data discovery. Whether our data is in cloud storage like AWS S3, Azure Data Lake, on-premises in HDFS, or in various SQL and NoSQL databases, Dremio can connect to these sources.
Dremio enables robust data governance through role-based, column-based, and row-based access controls, ensuring users can only access the data they are permitted to view. It also Offers features like data masking, where sensitive information can be hidden from certain users, and encryption, ensuring data is securely stored and transmitted.
Hybrid Architecture
Dremio allows us to access on-premise data sources in addition to cloud data. This flexibility allows Dremio to unify on-premise and cloud data sources.
In summary, Together, Apache Iceberg and Dremio offer a robust foundation for data management. It also enables collaborative analytics, wide accessibility, and better efficiency.