Smarter Data Layout — Sorting and Clustering Iceberg Tables

August 05, 2025

Smarter Data Layout — Sorting and Clustering Iceberg Tables

So far in this series, we’ve focused on optimizing file sizes to reduce metadata and scan overhead. But how data is laid out within those files can be just as important as the size of the files themselves.

In this post, we’ll explore clustering techniques in Apache Iceberg, including sort order and Z-ordering, and how these techniques improve query performance by reducing the amount of data that needs to be read.

Why Clustering Matters

Imagine a query that filters on a customer_id. If your data is randomly distributed, every file needs to be scanned. But if the data is sorted or clustered, the engine can skip over entire files or row groups — reducing I/O and speeding up execution.

Clustering benefits:

Fewer files and rows scanned
Better compression ratios
Faster joins and aggregations
More efficient pruning of partitions and row groups

Sorting in Iceberg

Iceberg supports sort order evolution, which lets you define how data should be physically sorted as it’s written or rewritten.

You can define sort orders during write or compaction:

import org.apache.iceberg.SortOrder
import static org.apache.iceberg.expressions.Expressions.*;

table.updateSortOrder()
  .sortBy(asc("customer_id"), desc("order_date"))
  .commit();

Use Cases for Sorting

Time-series data: sort by event_time to improve range queries
Dimension filters: sort by commonly filtered columns like region, user_id
Joins: sort by join keys to speed up hash joins and reduce shuffling

Z-order Clustering

Z-ordering is a multi-dimensional clustering technique that co-locates related values across multiple columns. It’s ideal for exploratory queries that filter on different combinations of columns.

Example:

table.updateSortOrder()
  .sortBy(zorder("customer_id", "product_id", "region"))
  .commit();

Z-ordering works by interleaving bits from multiple columns to keep related rows close together. This increases the chance that queries filtering on any subset of these columns can benefit from data skipping.

Note: Z-ordering is supported by Iceberg through integrations like Dremio’s Iceberg Auto-Clustering and Spark jobs using RewriteDataFiles.

Choosing Between Sort and Z-order

Use Case	Best Technique
Filtering on one key column	Simple Sort
Range queries on timestamps	Sort on time
Multi-column filtering	Z-order
Joins on a key column	Sort on join key
Complex OLAP-style filters	Z-order

When to Apply Clustering

Clustering is typically applied:

During initial writes, if the engine supports it
As part of compaction jobs, using RewriteDataFiles with sort order
In Spark, you can specify sort order in rewrite actions:

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .sortBy("region", "event_time")
  .execute();

Make sure the sort order aligns with your most frequent query patterns.

Tradeoffs

While clustering helps query performance, it comes with tradeoffs:

Sorting increases job duration: Sorting is more expensive than just rewriting files
Clustering can become outdated: Evolving data patterns may require adjusting sort orders
Not all engines respect sort order: Make sure your query engine leverages the layout

Summary

Smart data layout is essential for fast queries in Apache Iceberg. By leveraging sorting and Z-order clustering:

You reduce the volume of data scanned
Improve filter selectivity
Optimize performance for a wide variety of workloads

In the next post, we’ll look at another silent performance killer: metadata bloat, and how to clean it up using snapshot expiration and manifest rewriting.