Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
Erokor
Resolver II
Resolver II

Any suggestions for Data Load Best Practices when Source system Impact is of Concern?

Hello,

I've got a Data source that is accessed by way of ODBC Import and impact of the source must be negligable.

I have been able to build a dataset and utilized bootstrapping techniques to partition the load of the data into a Dataset. The challenge that has become apparrent is that not everyone in the business requires the 10 years of history (or 100's of millions of records), but I cannot tax the ODBC Datasource for another Dataset.

 

I've thought of using Dataflows, but I cannot load all of the data into a Dataflow historically in a partioned manner (like datasets can), and would require this capability in order to mitigate impact on the source system. One way I was entertaining is to use the Datasource's SQL View and altering it to pull in year by year... but that isn't ideal either as it would require manual steps yearly to accommodate.

 

Is their some hidden method and/or best pratice to manage data at this scale in a partitioned manner? I would really love the ability to load partitions of data into a dataflow manually... but I don't see this as possible currently. In addition, being able to access the partitions of the Dataset endpoint would be another great way to allow the report creators the ability to adjust how much data they want in a dataset further upstream (import parts of an existing dataset for example).

 

Any help or direction would be greatly appreciated.

 

 

2 REPLIES 2
collinq
Super User
Super User

Hey @Erokor ,

 

If your main goal is to minimize the load on the original datasource AND have the ability for users to pick which year(s) of data that they want AND have some control then my solution would be laborious but effective.  You could make 10 separate dataflows.  Each one using only the year that you want.  You can pull in each year, one at at into each dataflow.  That way, you don't have to refresh them anymore - once you have the old data then you are good (I am assuming that your old data is static).  

This is particularly effective and easy if you are using SQL (which I am inferring from your other comments).  Just create the query on each dataflow to only get the year that you are looking to get.

In the end, you can leave them all separate, or you can create 1 dataflow that uses each of the other dataflows to pull the data together.

And, this means that the only one that is refreshing is the current year dataflow.  Which you can set to refresh only once a day in the middle of the night or something that keeps the original data source from being overwhelmed.




Did I answer your question? Mark my post as a solution!

Proud to be a Datanaut!
Private message me for consulting or training needs.




Thanks @collinq, this is definitely something we've been considering - it will work flawlessly if the data stays static. I'd like to use DirectQuery/Hybrid tables and composite modelling, however, the datasource we're using struggles on this front (Progress OpenEdge).  We're also validating some of the incremental refresh requirements to see if there might be some additional performance savings. Any other additional input would be appreciated!

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors