Differences between Data Warehouse and Lakehouse

DennesTorres · ‎10-02-2023

Hi,

I noticed some differences between the lakehouse and the data warehouse and I'm not much sure why these differences are there.

The lakehouse:

The files are compacted (snappy format).
The files are created based on a maximum size of 1GB, avoiding the small files problem
The files use the VORDER format
The folder with the table contains an _delta_log folder and the files

The Data Warehouse:

The files are not compacted
The files are prone to the small files problem (the sample fact_sale table has 26 files in a warehouse and only 1 in a lakehouse).
The files are not in the VORDER format (or at least, this is not defined in the metadata).
The folder with the table contains an _delta_log folder and another folder, the files are inside the 2nd folder

The questions about these differences:

How the differences are positive for each one, lakehouse and data warehouse, although they are the opposite? For example, why the warehouse is not affected by small files (or it is?) ? Why the warehouse doesn't need the VORDER ?
After an update in the data warehouse table, new parquet files were created, as expected. But the delta_log files were not updated. Why?
After creating a shortcut from the Data Warehouse to a lakehouse, the VACUUM didn't work. I imagine it doesn't because the new parquet files were not updated in the delta_log, am I correct? How to fix this?
The Tables format are different, I mean, there is a sub-folder in the data warehouse, while this sub-folder doesn't exist in the lakehouse. Is this why the VACUUM doesn't work?
Should the VACUUM work, or is there some other method in place to replace its need ?

Thank you in advance !

jcvega · ‎05-12-2024

@v-nikhilan-msft @DennesTorres Hi, Dennes.. now May 2024.. do you have more clarity about your original questions? can you share this here pls.. after read all post.. It is still not clear if there is a different storage between DWH and LH and the impact in our designs

DennesTorres · ‎05-12-2024

Hi,

During the evolution happening these months, my conclusion is that the Data Warehouse is like a PaaS version of a lakehouse.

The Data Warehouse has features to auto-manage the underlying onelake storage. It has some internal structures we don't have access to and this auto-manage of the onelake storage, which we shouldn't try to handle directly.

The lakehouse, on the other hand, makes the onelake storage an open book to us and some management tasks becomes our responsibility.

For example, in the lakehouse, we can make time-travel as we wish and the maintenance to improve the performance - and prevent time-travel - is up to us. In the Data Warehouse, the time-travel is already limited to 7 days, because the underlying maintenance is automatically applied by the data warehouse.

You can check the time-travel details here: https://www.youtube.com/watch?v=ZVgizgPQRS8&list=PLNbt9tnNIlQ5TB-itSbSdYd55-2F1iuMK&index=9

In the end, a lakehouse is great for a Bronze layer because the Files area, a Data Warehouse is great for the Gold layer because the support to multiple schemas and RLS. The rest, depends a lot.

Kind Regards,

Dennes

v-nikhilan-msft · ‎10-03-2023

Hi @DennesTorres ,
Thanks for using Fabric Community.

The differences between the lakehouse and the data warehouse are due to the fact that they are different types of data storage systems with different design goals and trade-offs.

The lakehouse is designed to provide a unified data platform that combines the benefits of data lakes and data warehouses. It uses a combination of Delta Lake and Azure Synapse Link to provide fast and efficient querying of large datasets, while also supporting real-time data ingestion and processing.

The data warehouse, on the other hand, is designed to provide a scalable and highly available data warehousing solution that can handle large volumes of structured data. It uses a columnar storage format and a massively parallel processing architecture to provide fast query performance and high scalability.

1) How the differences are positive for each one, lakehouse and data warehouse, although they are the opposite? For example, why the warehouse is not affected by small files (or it is?) ? Why the warehouse doesn't need the VORDER ?

The data warehouse is not as sensitive to the small files problem as the lakehouse because it uses a different storage mechanism. The data warehouse stores files in blobs, which are large containers that can hold multiple files. This means that the number of files in the data warehouse does not have as big of an impact on performance as it does in the lakehouse.
You cannot use VORDER for tables in Warehouse. VORDER is a file format that is optimized for columnar storage, but it is not supported by Warehouse.Warehouse uses a different storage mechanism for structured data, which is optimized for performance for complex queries on structured data.

2) After an update in the data warehouse table, new parquet files were created, as expected. But the delta_log files were not updated. Why?

Regarding this issue there is already an ongoing issue in Fabric and there is a bug created for the same.
You can check the issues in this link:
https://support.fabric.microsoft.com/en-IN/known-issues//

3) After creating a shortcut from the Data Warehouse to a lakehouse, the VACUUM didn't work. I imagine it doesn't because the new parquet files were not updated in the delta_log, am I correct? How to fix this?

Yes you are correct. As I already mentioned there is an ongoing bug regarding delta_log, the VACUUM command may not work because the new parquet files were not updated in the delta_log.

Conclusion

The lakehouse and data warehouse formats have different strengths and weaknesses. The lakehouse format is designed for a wide range of workloads, including large data queries, incremental updates, and machine learning. The data warehouse format is designed for workloads that require high performance
Hope this helps. Please do let us know if you have any further queries.

DennesTorres · ‎10-09-2023

Hi,

I'm missing something about your answer, or you are revealing something extremely new, almost opening the pandora box.

In order to explain, I need to remind two main points:

- The concept broadly accepted about onelake is that it's a single storage for all the different tools we will use. It doesn't mind if it's a lakehouse or data warehouse, the data will be stored in delta.

- It goes to the level I watched a speaker mentioning in one official microsoft session "The choice depends only on the kind of tool you like to use, because both are stored in delta on onelake"

But on your answer you are revealing:

"The data warehouse is not as sensitive to the small files problem as the lakehouse because it uses a different storage mechanism. The data warehouse stores files in blobs, which are large containers that can hold multiple files."

This makes a huge difference, changes the base concept of onelake about having everything in a single place and format.

The fact the warehouse uses a different storage format is critical because we need to understand this format. Different formats leads to different results and this affects the architectural changes we should make. It's much more than the basic technical differences we already know about.

The Kusto database I already know from another answer here and other sources as well it uses a custom format and put a shortcut in the onelake to be acessible. But the warehouse keep the files in the onelake, we can see the files in the onelake explorer. If the files are kept in the onelake, it's not clear how in one moment or another it doesn't endup suffering the same problems related to this kind of storage - Delta.

Is there a deeper documentation about this format?

Am I misunderstanding something ?

Kind Regards,

Dennes

v-nikhilan-msft · ‎10-12-2023

Hi @DennesTorres ,

Apologies for the delay in reply from our side.
I have taken help from the internal team and got the information as below:

Firstly, the Data Warehouse files will be compacted shortly and they are in fact VOrdered (this was a recent change we introduced a few weeks ago). The reason you're observing no updates to the delta log when a warehouse table is updated is because we do not currently reflect updates and deletes in the delta log, but we will very soon. We do not support VACUUM but we will do our own garbage collection shortly as well. So, this concludes that Warehouse uses VORDER.

To understand better about the differences between Lakehouse and Warehouse please refer this document: Choosing between Lakehouse and Warehouse in Microsoft Fabric | LinkedIn
Hope this helps. Please let us know if you have any further queries.

DennesTorres · ‎10-12-2023

Hi,

There are some additional welcome information, there are some promisses of features to come, but it's still very confusing.

I understand the fact the updates and deletes are not reflected on delta logs is about to change, ok. But the fact it's happening now means there is another internal storage supporting the data warehouse.

A delta table with the delta logs not reflecting updates and deletes would be showing wrong results. But the results are correct, so these updates and deletes are reflected somewhere else, in an internal structure which the lakehouse doesn't have.

The article you are pointing to me is very basic and exactly the kind of dangerous article I mentioned. One of the items for the choice is "Language preference for data management".

In this way, the article is considering the underlying data has the same storage, delta. However, as I pointed on previous paragraph, delta is not the only and main storage of the warehouse, otherwise it would not work on the conditions it is today (not updating the delta log).

I understand probably the information we are touching is under NDA. The problem is: if the data warehouse has an additional layer of storage and management, as all the clues and your own information suggest, this means it will perform differently than the lakehouse, in different scenarios. It's important for data engineers to understand the technical details of the differences otherwise we may endup on the wrong choice and only discovering when it's too late (and expensive) to change.

I hope my suggestions could help to expose a bit more of the data warehouse architecture so we don't need to rely on articles like this one you pointed to me, which makes everyone believes both (lakehouse and warehouse) have the same technical internal architecture and will behave in the same way so it's just a matter of the "Language preference for data management"

Thank you!

v-nikhilan-msft · ‎10-17-2023

Hi @DennesTorres ,
Apologies for the delay in response.
Regarding the Warehouse data log, it is an internal format but is very similar to Delta and we publish logs that are compliant with the Delta Lake Log. Results from Warehouse and Lakehouse will be the same but it is a different engine, so performance and language options could be different. Currently the Delta logs for a Warehouse will only contain inserts that have occurred - updates and deletes will also be reflected shortly.

Appreciate if you could share the feedback on our feedback link , which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.
Feedback Link : Home (microsoft.com)

Hope this helps. Please let us know if you have any further queries.

v-nikhilan-msft · ‎10-04-2023

Hi @DennesTorres ,
We haven’t heard from you on the last response and was just checking back to see if your query has been resolved. Please let us know if you have any further queries.

v-nikhilan-msft · ‎10-05-2023

Hi @DennesTorres ,
Following up to check if your query has been resolved. Please do let us know if you have any further queries.

Differences between Data Warehouse and Lakehouse

Helpful resources

New forum boards available in Synapse

New forum boards available in Real-Time Intelligence.

Fabric Monthly Update - May 2024

Jumpstart your career with the Fabric Career Hub

Differences between Data Warehouse and Lakehouse

Helpful resources

New forum boards available in Synapse

New forum boards available in Real-Time Intelligence.

Fabric Monthly Update - May 2024