Ways to Efficiently Load Large Amount of Data from...

v-cyu · ‎04-29-2024

We are looking for way to efficiently load data from Cosmos structured stream to Fabric Lakehouse/KQL Database.

For a day worth of data, there are 12 structured stream where they are ~1 tb/stream and ~2 billion rows/stream.
So far, we have tested at the scale of 2 hour, 1 day, and 7 days worth of data.

Current Fabric Capacity: F256

We have been relying on Data Pipeline - Copy Activity to copy from each Cosmos structured stream into a Lakehouse table/KQL

Questions:

1. Are there ways to configurate copy activity to load multiple structured streams at once?

Right now, it seems that 1 copy activity can only been configurated to copy from 1 structured stream.

Because of that, for example, to load 1 day of data, we need to combine the 12 structure streams belonging to the same day into 1 structured stream and load that structured stream.

2. Is there Partition option in copying from Cosmos Structured stream?

While running the copy activity, we are seeing the above performance tip.

However, there does not seem to exist the "Partition Option" setting from the copy activity of Cosmos structure stream.
There is indeed the Partition index option but this will limit the Copy activity to copy only rows with that particular index, and this is not what we are looking for.

3. What would be the best practice to efficiently load data this big?
To test capacity, we have extended the test to load 7 days worth of data using above method.

Load to Lakehouse table took 32 hours, and to KQL database took 21 hours.

However, while doing so, the load to KQL caused us to exceed our capacity limit, putting the whole enrivonment to a halt.
We are eventually looking at the possibility to load 18 month of data.

Any pointers/recommendation/Documentation could be helpful.

v-gchenna-msft · ‎04-30-2024

Hi @v-cyu ,

Thanks for using Fabric Community.
At this time, we are reaching out to the internal team to get some help on this .
We will update you once we hear back from them.

v-gchenna-msft · ‎05-02-2024

Hi @v-cyu,

We got a response from internal team -

This Cosmos Structured Stream source... this is not Cosmos DB is it not? This sounds like the legacy internal tool to MSFT called Cosmos. If that's the case there's not much we can do on that front but nonetheless, 29TB is not that big in the grand scheme of things so KQL Database should have no issues with this. What I suspect is going on here is that you have a "small files" problem. If that 29TB is comprised of billions of small files stored on ADLS Gen1, simply reading in all these individual items will lead to memory consumption issues. In other words, if these were nice compressed parquet files we'd have a better luck. Can you confirm this hypothesis? Can you check the individual file sized and overall number of file to ingest?

Can you please share the above details?

v-gchenna-msft · ‎05-03-2024

Hi @v-cyu ,

We haven’t heard from you on the last response and was just checking back to see if you got a chance to look into my last response.

Thanks

v-gchenna-msft · ‎05-06-2024

Hi @v-cyu ,

We haven’t heard from you on the last response and was just checking back to see if you got a chance to look into my last response.

Thanks

Ways to Efficiently Load Large Amount of Data from Cosmos Structured Streams?

Helpful resources

New forum boards available in Synapse

New forum boards available in Real-Time Intelligence.

Fabric Monthly Update - May 2024

Jumpstart your career with the Fabric Career Hub

Ways to Efficiently Load Large Amount of Data from Cosmos Structured Streams?

Helpful resources

New forum boards available in Synapse

New forum boards available in Real-Time Intelligence.

Fabric Monthly Update - May 2024