Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Grow your Fabric skills and prepare for the DP-600 certification exam by completing the latest Microsoft Fabric challenge.

Reply
amaaiia
Helper V
Helper V

Delta Lake VACUUM, how does it work?

I don't have very clear how VACUUM command works. Reading documentation, I see that this command removes parquet files that are not being used for versions above the retention period, but I'm testing and it's no working. From what I have read, if a table A has 7 version, version 1 timestamp is 2024-04-09 10:00:00, version 2 timestamp is 2024-04-16 13:20:00 and I run VACUMM command for 7 days retention period in 2024-04-16 13:29:00, version 0 and 1 parquet files that are not referenced in higher version should be removed. Isn't it?

 

This is what I have done:

1. V0-1, 2024-04-09 10:00:00: write new table

2. V2, 2024-04-16 13:20:00: optimize table

3. V3-4, 2024-04-16 13:21:00: VACUUM table (no parquet files are removed, I thought because is the last version)

4. V5, 2024-04-16 13:25:00: overwrite of the table

5. V6-7, 2024-04-16 13:29:00: VACUUM table

amaaiia_0-1713276308809.png

 

 

Then, if we check the parquet files that each version is using, we can see that the parquet file ending with fb.parquet is only being used bellow version 1, whose date is more dan 7 days ago:

amaaiia_0-1713276254426.png

 

 

So, why hasn't this parquet file been removed?

 

Doc: https://docs.delta.io/latest/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-table

amaaiia_0-1713277442881.png

 

 

1 ACCEPTED SOLUTION
frithjof_v
Skilled Sharer
Skilled Sharer

I'm new to Delta Lake, but I think this link explains what the vacuum retention threshold means:

https://docs.databricks.com/en/delta/vacuum.html#example-syntax-for-vacuum

 

"Use the RETAIN keyword to specify the threshold used to determine if a data file should be removed. The VACUUM command uses this threshold to look back in time the specified amount of time and identify the most recent table version at that moment. Delta retains all data files required to query that table version and all newer table versions."

 

My interpretation:
When we vacuum a table at 2024-04-16 13:29:00 with a retention threshold of 7 days, then we basically tell the vacuum operation to look back to the timepoint 2024-04-09 13:29:00 and identify what was the most recent table version (the "active" table version) at the timepoint 2024-04-09 13:29:00.

In your case, at that timepoint, I think your version 1 was the most recent version of the table. So the parquet files needed to query version 1 of the table will not be deleted.
If there existed any parquet files which were only needed for querying version 0 (but not needed for querying version 1 or more recent versions), then I think those parquet files would have been deleted. 

  

I think if you go to the Lakehouse explorer, and click on the ellipsis (...) next to the table name, and then "View files", you should see the actual parquet files which are still existing.


If any parquet files were deleted by the vacuum operation, then I think you will not be able to find those parquet files here anymore.

frithjof_v_0-1713300631056.png

However, I think the version history of your delta table may still contain references to deleted parquet files.
This depends on the log history retention period, if I understand correctly (see next comments).

 

When you use 

 

 

%%sql
DESCRIBE HISTORY table_name;

 

 

I think the history can still include references to some parquet files which are now deleted, because the history log is not deleted by the vacuum operation (see next comments). However the actual data files (parquet files) in your lakehouse table's file directory should be deleted by vacuum operations (according to vacuum retention threshold).

View solution in original post

7 REPLIES 7
frithjof_v
Skilled Sharer
Skilled Sharer

I'm new to Delta Lake, but I think this link explains what the vacuum retention threshold means:

https://docs.databricks.com/en/delta/vacuum.html#example-syntax-for-vacuum

 

"Use the RETAIN keyword to specify the threshold used to determine if a data file should be removed. The VACUUM command uses this threshold to look back in time the specified amount of time and identify the most recent table version at that moment. Delta retains all data files required to query that table version and all newer table versions."

 

My interpretation:
When we vacuum a table at 2024-04-16 13:29:00 with a retention threshold of 7 days, then we basically tell the vacuum operation to look back to the timepoint 2024-04-09 13:29:00 and identify what was the most recent table version (the "active" table version) at the timepoint 2024-04-09 13:29:00.

In your case, at that timepoint, I think your version 1 was the most recent version of the table. So the parquet files needed to query version 1 of the table will not be deleted.
If there existed any parquet files which were only needed for querying version 0 (but not needed for querying version 1 or more recent versions), then I think those parquet files would have been deleted. 

  

I think if you go to the Lakehouse explorer, and click on the ellipsis (...) next to the table name, and then "View files", you should see the actual parquet files which are still existing.


If any parquet files were deleted by the vacuum operation, then I think you will not be able to find those parquet files here anymore.

frithjof_v_0-1713300631056.png

However, I think the version history of your delta table may still contain references to deleted parquet files.
This depends on the log history retention period, if I understand correctly (see next comments).

 

When you use 

 

 

%%sql
DESCRIBE HISTORY table_name;

 

 

I think the history can still include references to some parquet files which are now deleted, because the history log is not deleted by the vacuum operation (see next comments). However the actual data files (parquet files) in your lakehouse table's file directory should be deleted by vacuum operations (according to vacuum retention threshold).

That's true. I was confused with the version timestamp, VACUUM keeps files for the active version at that time, I thought it kept files of versions with higher timestamp only. I've tried with another table with more versions and It works as you say.

If some parquet files have been deleted by vacuuming, I think you could get an error if you try to query the table versions which reference the deleted parquet files.

 

For example, if some parquet files needed to query table version 0 has been deleted, then this query should give an error:

 

 

%%sql
SELECT * FROM table_name VERSION AS OF 0

 

 

 

I think such an error will contain this text:
"It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved."

Reading this article, it seems that the log history is kept for 30 days by default:

https://docs.databricks.com/en/delta/history.html

 

And the vacuum retention threshold is 7 days by default.

 

I haven't tested this, but if I understand this article correctly, then the ability to do time travel is actually limited by both the log history retention and the vacuum operation's retention threshold.

 

The way I understand it, if you do vacuuming with 7 days retention threshold, you can only do time travel within the last 7 days. 

If you don't do vacuuming, I think you can only do time travel within the number of days your log history is kept (default is 30 days). 

 

As the article says:

"Databricks does not recommend using Delta Lake table history as a long-term backup solution for data archival. Databricks recommends using only the past 7 days for time travel operations unless you have set both data and log retention configurations to a larger value."

 

There seems to be something called checkpoints, which seem to affect the actual time when the log history (older than the log retention period) gets deleted.
https://docs.databricks.com/en/delta/history.html#what-are-transaction-log-checkpoints

 

Anyway, if we do vacuuming with 7 days retention threshold, then I guess vacuuming will be the limiting factor for the time travel, as the log retention period is 30 days by default.

 

I guess we can adjust the vacuum retention threshold and log retention period, if we want to expand (or limit) the possibility for time travel.

So, if that's true about checkpoints, how is it possible that I can get VERSION 1 for a table where first checkpoint is done after this version timestamp?

 

This is the HISTORY of my table. As you can see, version 1 is from 2024-03-09:

amaaiia_0-1713336129842.png

Then, I check parquet files and I see there's a checkpoint on 2024-03-14:

amaaiia_1-1713336216406.png

As a I've understood, even if the log files still exist, as there's a checkpoint, the previous logs shouldn't be available. So, why can I get VERSION 1 of table data if the first checkpoint points to VERSION 10?

amaaiia_3-1713336539538.png

 

I'm not sure, but I think the deletion of log history happens at the time when the checkpoint is created, and we need to take the log retention period into account.

 

It seems this checkpoint was created at March 14.

 

So I think, when this checkpoint was created on March 14, any log history which is older than mid February (March 14 minus 30 days) would be deleted (provided the log retention period is 30 days).

 

When a new checkpoint is created, it will delete log history older than 30 days counting from the time when the checkpoint gets created.

 

If you want all log history before the checkpoint to be deleted, then I think you need to set log retention period equal to 0.

 

--

 

On a side note: I'm not entirely sure if we can force a checkpoint to be created, or if the creation of checkpoints is managed entirely by the delta lake for us. I am guessing that if you make frequent changes to your table (meaning more frequent delta log files) then the checkpoints will also be created more frequently.

Because the checkpoints optimize query speed, the checkpoints makes it possible for the engine to avoid having to traverse too many delta log files at query time. At this moment I am just guessing.

 

--

 

I think it is something like that. But I'm not sure. Will be great if someone can confirm 😀

Helpful resources

Announcements
Europe Fabric Conference

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

Expanding the Synapse Forums

New forum boards available in Synapse

Ask questions in Data Engineering, Data Science, Data Warehouse and General Discussion.

RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

MayFBCUpdateCarousel

Fabric Monthly Update - May 2024

Check out the May 2024 Fabric update to learn about new features.