Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Grow your Fabric skills and prepare for the DP-600 certification exam by completing the latest Microsoft Fabric challenge.

Reply
v-cyu
Employee
Employee

Ways to Efficiently Load Large Amount of Data from Cosmos Structured Streams?

We are looking for way to efficiently load data from Cosmos structured stream to Fabric Lakehouse/KQL Database. 

 

For a day worth of data, there are 12 structured stream where they are ~1 tb/stream and ~2 billion rows/stream.
So far, we have tested at the scale of 2 hour, 1 day, and 7 days worth of data.

 

Current Fabric Capacity: F256


We have been relying on Data Pipeline - Copy Activity to copy from each Cosmos structured stream into a Lakehouse table/KQL 

Questions:

1. Are there ways to configurate copy activity to load multiple structured streams at once?

Right now, it seems that 1 copy activity can only been configurated to copy from 1 structured stream.

Because of that, for example, to load 1 day of data, we need to combine the 12 structure streams belonging to the same day into 1 structured stream and load that structured stream.

 

2. Is there Partition option in copying from Cosmos Structured stream?

vcyu_1-1714428216354.png

While running the copy activity, we are seeing the above performance tip.

However, there does not seem to exist the "Partition Option" setting from the copy activity of Cosmos structure stream.
There is indeed the Partition index option but this will limit the Copy activity to copy only rows with that particular index, and this is not what we are looking for. 

vcyu_2-1714428320126.png

 


3. What would be the best practice to efficiently load data this big? 
To test capacity, we have extended the test to load 7 days worth of data using above method. 

Load to Lakehouse table took 32 hours, and to KQL database took 21 hours.

However, while doing so, the load to KQL caused us to exceed our capacity limit, putting the whole enrivonment to a halt. 
We are eventually looking at the possibility to load 18 month of data.

Any pointers/recommendation/Documentation could be helpful. 

4 REPLIES 4
v-gchenna-msft
Community Support
Community Support

Hi @v-cyu ,

Thanks for using Fabric Community.
At this time, we are reaching out to the internal team to get some help on this .
We will update you once we hear back from them.

Hi @v-cyu,

We got a response from internal team -

This Cosmos Structured Stream source... this is not Cosmos DB is it not? This sounds like the legacy internal tool to MSFT called Cosmos. If that's the case there's not much we can do on that front but nonetheless, 29TB is not that big in the grand scheme of things so KQL Database should have no issues with this. What I suspect is going on here is that you have a "small files" problem. If that 29TB is comprised of billions of small files stored on ADLS Gen1, simply reading in all these individual items will lead to memory consumption issues. In other words, if these were nice compressed parquet files we'd have a better luck. Can you confirm this hypothesis? Can you check the individual file sized and overall number of file to ingest? 


Can you please share the above details?

Hi @v-cyu ,

We haven’t heard from you on the last response and was just checking back to see if you got a chance to look into my last response.

Thanks

Hi @v-cyu ,

We haven’t heard from you on the last response and was just checking back to see if you got a chance to look into my last response.

Thanks

Helpful resources

Announcements
Expanding the Synapse Forums

New forum boards available in Synapse

Ask questions in Data Engineering, Data Science, Data Warehouse and General Discussion.

RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

MayFBCUpdateCarousel

Fabric Monthly Update - May 2024

Check out the May 2024 Fabric update to learn about new features.