cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
najmead
Frequent Visitor

De-duplicating incrementally refreshed data

I have an issue with slow dataset loads, and have been exploring the use of incremental refreshes to improve performance.  However, my data is quickly getting out of sync with source.  My understanding of the incremental refresh process is that assuming I set a refresh window of 1 month, if a record that is created two months ago gets updated in source, it will essentially get loaded a second time -- since the original record has now fallen outside the refresh window.  This would certainly explain why my data is getting out of sync.

 

Obviously, I can extend the window, but I'm finding this doesn't really solve the problem -- in order to extend it sufficiently to cover my needs, performance starts to degrade again.

 

Consequently, is there any reason I can't de-duplicate the data with a step in Power Query?  eg, if I add steps after the filter for RangeStart and RangeEnd to sort by last_mod_date and then deduplicate by the primary key... will that work?  Or does the Power Query only apply on the partition that is being refreshed?

2 REPLIES 2
lbendlin
Super User
Super User

Obviously, I can extend the window, but I'm finding this doesn't really solve the problem 

You may be able to find a balance there that works good enough.

 

In any case, you are hitting on the true core of the issue here - incremental refresh is exactly that - incremental.  It expects data to be immutable once it has been added.  What you expect to get (differential refresh) is not what incremental refresh is about, and is not what it can deliver.

 

There is a small remedy in using the "detect data changes"  feature but it comes with a gigantic price tag. I'll let you find out what that is.

 

What you want to do is continue with your incremental refresh, and  occasionally (once a month for example) force a reprocess of a couple of the "older" partitions as well.  Doesn't even have to be a complete flush and fill of the dataset , you can use the XMLA endpoint to target individual partitions.

najmead
Frequent Visitor


@lbendlin wrote:

There is a small remedy in using the "detect data changes"  feature but it comes with a gigantic price tag. I'll let you find out what that is.

I'm sorry, but I have no idea what this is referring to.

 

Helpful resources

Announcements
September Update

Check it Out!

Click here to learn more about the September 2022 updates!

Power BI Show episode 9

The Power BI Community Show

Watch the playback when Priya Sathy and Charles Webb discuss Datamarts! Kelly also shares Power BI Community updates.

Power BI Dev Camp Session 26

New Date - Check it Out!

Mark your calendars and join us on Thursday, October 6 at 11a PDT for a great session with Ted Pattison!

Top Solution Authors
Top Kudoed Authors