5 Trends in the Data Lakehouse Space
September 01, 2024
- Free Copy of Apache Iceberg: The Definitive Guide
- Free Apache Iceberg Crash Course
- Free Copy of Apache Iceberg: The Definitive Guide
- Free Apache Iceberg Crash Course
The data lakehouse is emerging and evolving as the next iteration of analytical data architecture. It builds on previous approaches by integrating the data lake and data warehouse, which were traditionally separate, and reimagining the tightly coupled components of a data warehouse (storage, table format, catalog, processing) into a modular, deconstructed form. In this new architecture, the data lake is the storage layer, allowing you to stitch together diverse, interoperable components modularly. Let’s explore the current trends in the lakehouse space to see where we stand today.
1. Storage Vendors Becoming Data Warehouses
Before data lakehouses became the dominant trend, the industry was marked by a rush to the cloud, driven by the elastic scalability offered by vendors like AWS, Azure, and Google Cloud. Each of these vendors had its own data warehouse solution (Redshift, Synapse, and BigQuery, respectively). In the early days of the lakehouse, these vendors enhanced their object storage products—already used in data lakes to store both structured and unstructured data—to better support lakehouse use cases.
However, as regulations around the globe and the rising costs of elastic cloud infrastructure at scale prompted a reevaluation of on-premises and hybrid scenarios, a new trend emerged. Object storage vendors like Minio, Vast Data, NetApp, and Pure Storage began to pair their existing products with software and hardware innovations to become full-fledged data analytics platforms. The missing piece was a robust processing layer that could operate both in the cloud and on-premises, a gap that Dremio, the Lakehouse Platform, filled, enabling these products to offer comprehensive data lakehouse solutions in the cloud, on-premises, and in hybrid environments, democratizing the market.
This competition is beneficial for consumers, as it fosters innovation. The data lakehouse shift is driving significant advancements in how data is stored, retrieved, and processed.
2. The Evolution of Table Formats
Table formats are crucial to ensuring interoperability within the data lakehouse ecosystem. Interoperability hinges on tools supporting various metadata standards and the unique features each format offers. Initially, three primary formats emerged: Apache Iceberg, Delta Lake, and Apache Hudi. Recently, Apache Paimon, born out of the Apache Flink project, has entered the scene.
Over time, Apache Iceberg has solidified its position as the industry “default” for consuming large datasets at scale with maximum interoperability, thanks to its extensive ecosystem. This status has been reinforced by the support and strategic moves from major players like Snowflake and Databricks. While Delta Lake maintains a popular option due to its deep integration with Databricks and its comprehensive Python support—an area where PyIceberg is rapidly catching up. Iceberg and Delta Lake have emerged as the primary choices for data consumption. Iceberg stands out for its rich ecosystem, robust SQL support, and unique partitioning features, while Delta Lake is also used for its seamless Databricks integration and Python support.
Apache Hudi has carved out a niche in the low-latency streaming ingestion space, particularly for scenarios involving frequent updates and deletes. Apache Paimon addresses a similar use case in streaming ingestion, so it will be interesting to see how Hudi and Paimon coexist in this segment of the market.
The distinctions between these formats are becoming increasingly blurred as tools like Apache XTable (incubating) enable conversion between formats, and Delta Lake’s UniForm feature allows for limited co-location of Iceberg and Hudi metadata with tables that are natively Delta Lake.
3. Iceberg Table Management
One of the unique features of Apache Iceberg is its independence from any major vendor control, which sets it apart from other table formats. While many tools can read and write to both Apache Iceberg and Delta Lake, the Iceberg ecosystem also extends to table management, offering a wide variety of options for optimizing the performance and storage of Iceberg tables. Previously, Tabular was the main player in this space, but after its acquisition by Databricks, it ceased taking on new customers. This shift has created an opportunity for other solutions from companies like Dremio, Upsolver, AWS, and Snowflake to step in and provide enterprise-grade solutions for automating the management of Apache Iceberg lakehouses.
4. The Catalog Wars Have Begun
As the focus on table formats has settled, a new battle has emerged: the catalog wars. Table Formats are crucial for recognizing a group of Apache Parquet files as a single table, complete with statistics to enable efficient querying, and catalogs are key to tools discovering them. Catalogs play a key role in listing available tables on your lakehouse, making them essential for tools like Dremio and Apache Spark to discover and query your data. Beyond this, catalogs could solve another major issue: the portability of governance. Traditionally, data governance involved securing files at the storage layer and separately governing tables across different tools and engines, as table formats lack inherent security mechanisms. Catalogs can potentially centralize governance rules that can be enforced across multiple tools.
In this space, attention is focused on solutions like Nessie, Apache Polaris (incubating), Gravitino, and Unity Catalog, as they address this governance challenge. Nessie, in particular, offers unique value with its “git-for-data” approach, allowing for git-like versioning semantics across multiple tables. This enables multi-table transactions, rollbacks, and easy isolation of data workloads for experimentation.
5. The Data Lakehouse Platform
The Lakehouse architecture is composed of various deconstructed components that come together to create data warehouse-like functionality. Any modular system like this creates a demand for a platform that can integrate all these pieces into a unified, user-friendly experience while retaining as much modularity as possible. As data platforms transition to the lakehouse model, they are evolving their feature sets to become comprehensive lakehouse platforms. Currently, Dremio stands out as a leading platform in this space, offering the following features:
- Compatibility with your choice of storage layer, whether in the cloud or on-premises
- Support for multiple table formats (Read/Write/Manage support for Iceberg, Read support for Delta Lake)
- Integrated Apache Iceberg Catalog with Automated Table Management, Role-Based Access Control (RBAC), and Git-For-Data features
- A catalog that supports connections from any tools compatible with Nessie, allowing you to use different engines alongside Dremio
- Support for working with various other Apache Iceberg catalogs
- The ability to enrich your Lakehouse data with data from databases, data warehouses, and file-based datasets on your data lake.
Dremio provides a fast, open, and easy-to-use lakehouse platform that allows you to leverage your existing databases, data lakes, and data warehouses in the cloud or on-prem from a single interface. It also offers flexible deployment options in the cloud or on-premises. With its flexibility and data virtualization capabilities, Dremio is well-positioned to serve as the gateway to the Lakehouse.
Conclusion
The evolution of the data lakehouse marks a significant shift in how we approach analytical data architecture. By merging the strengths of data lakes and data warehouses, the lakehouse model offers a modular, interoperable framework that can adapt to diverse business needs. As we’ve seen, storage vendors are expanding into full-fledged data lakehouses, table formats like Apache Iceberg and Delta Lake are maturing, and the landscape of table management and catalog solutions is rapidly evolving.
These trends indicate that the data lakehouse is not just a passing phase but a robust and flexible architecture that will continue to grow and influence the future of data management. With platforms like Dremio leading the way, offering comprehensive solutions that integrate storage, table management, and cataloging, the data lakehouse is well on its way to becoming the standard for modern data architectures. As the ecosystem around the lakehouse continues to innovate, businesses will find new opportunities to optimize their data strategies, ensuring that they can scale, adapt, and thrive in an increasingly data-driven world.