Liquid Clustering in Databricks

Ashish Shukla

10/23/20232 min read

Partitioning is the most important concept as optimization technique in big data. Partitioning is the concept where we segregate the data based on partitioning keys, and it provide the feasibility to read the data faster. But in the traditional partitioning mechanism, sometimes we may face situation of uneven distribution of data among each partitioning which ends up with the under-partitioning and over-partitioning. Along with that, changing the existing partitioning key is also very tedious task which puts the developer to rewrite the complete table. So, there are some existing problems which the traditional partitioning.

Databricks has come up with the logic of liquid partitioning to address these issues. As the name suggest, liquid means to provide more flexibility over partitioning. Databricks has introduced liquid clustering in runtime 13.3 and above version. Liquid Partitioning has the following features:

· Fast write and similar reads.

· Self-tuning

· Handles under-partitioning and over-partitioning automatically.

· Automatic partial clustering of new data.

· Liquid clustering provides resistance over skewness which produce consistent file size and low write amplification.

· Liquid clustering helps the user to change the existing clustering column in the table without rewriting the complete data again.

· It also provides the better concurrency while reading the data.

Syntax with liquid clustering:

Below is the syntax of liquid clustering with new table:

-- Create an empty table

CREATE TABLE table1(col0 int, col1 string) USING DELTA CLUSTER BY (col0);

-- Using a CTAS statement

CREATE EXTERNAL TABLE table2 CLUSTER BY (col0) -- specify clustering after table name, not in subquery LOCATION ‘table_location’ AS SELECT * FROM table1;

-- Using a LIKE statement to copy configurations

CREATE TABLE table3 LIKE table1;

Change Liquid Clustering Keys on Existing Clustered Table:

ALTER TABLE table_name CLUSTER BY (new_column1, new_column2);

For best performance, databricks recommends scheduling regular Optimize jobs to cluster data.

OPTIMIZE table_name;

Liquid clustering is incremental, meaning that data is only rewritten as necessary to accommodate data that needs to be clustered. Data files with clustering keys that do not match data to be clustered are not rewritten.

Limitations of Liquid Clustering:

The following limitations exist:

· You can only specify columns with statistics collected for clustering keys. By default, the first 32 columns in a Delta table have statistics collected.

· You can specify up to 4 columns as clustering keys.

· Structured Streaming workloads do not support clustering-on-write.

Contact us

Whether you have a request, a query, or want to work with us, use the form below to get in touch with our team.

Liquid Clustering in Databricks

Syntax with liquid clustering:

Contact us

Location

Contacts