Re: Incremental refresh causing duplicate rows

Dayna · ‎02-23-2022

Hello,

I've implemented incremental refreshes on one of our fact tables, called 'Transactions'.

I can see on a given day there's been a change to some rows, and when this dataflow is consumed within the desktop, I'm receiving duplicate rows but the 'modified' and 'created' dates are different. One shows 30/01/2022 the other is 01/02/2022.

My incremental refresh is configured to store data for the last two years, and refresh rows for the last three months based on 'created' date, and automatically detect changes is enabled and set to watch 'modified' date.

I know if I do a manual refresh, it'll likely fix the issue as it'll do a normal load from the report, but I don't want to do this, I'd rather incremental worked as I would hope.

When I query the row within the service, only the one row is returned - I add this step after the 'canary load' step at the end of all the steps on the workspace.

Can anyone assist, or is this expected behaviour?

Many thanks,

Dayna

gustavofreitas · ‎08-29-2023

Hi Dyana,

My solution for this case was make a buffer for the sorted set by the modified date (descending), and then remove the duplicates.

...

#"Linhas classificadas" = Table.Buffer ( Table.Sort(#"Colunas removidas", {{"Id", Order.Ascending}, {"Modificado", Order.Descending}})),

#"Duplicatas removidas" = Table.Distinct(#"Linhas classificadas", {"Id"}) ...

Obs.: without applying the buffer, power bi does not guarantee that the first row of sorting will be kept.

Dayna · ‎02-28-2022

Hello,

I'm not seeing duplicates in the service though, only when consumed within the report. Is this expected behaviour for the incremental refresh?

Thanks,
Dayna

lbendlin · ‎02-28-2022

Can you confirm that your M query is shaped according to the rules I mentioned?

Dayna · ‎02-28-2022

I've not configured this other than in the UI within the PowerBI service, it has automatically created this:

Table.SelectRows(#"Changed column type 2", each DateTime.From([pro2created]) >= RangeStart and DateTime.From([pro2created]) < RangeEnd)

lbendlin · ‎02-28-2022

That looks accurate. How do you know that records are duplicated?

Dayna · ‎02-28-2022

When I load in this dataflow into a report, my numbers for a specific day are inflated. When digging deeper, I can see multiple records for the same ID, which should be impossible as it's unique.

When I consume the dataflow in a new report, it's duplicated when filtering on one of these ID codes. Doing the same in the service though only brings back one result, which is strange...

hmorrow · ‎06-20-2022

Did you get a resolution on this? I'm having the same issue when setting incremental refresh on Power BI Online premium workspace. Any insight you can share would be helpful. Thank you.

Dayna · ‎06-21-2022

Sadly not, I ended up disabling the incremental data refreshes.

hmorrow · ‎06-21-2022

That's too bad. Only way around I found is to remove duplicates and keep the most recent when bringing in the dataflow to a dataset. However, that's not the best option since you'd have you do that for each report you build...

lbendlin · ‎02-26-2022

Duplication can only happen when partitions store overlapping date ranges. You must make sure that of the RangeStart and RangeEnd parameters one is inclusive and one is exclusive

For example

Date >=RangeStart and Date < RangeEnd

Date >RangeStart and Date <=RangeEnd

are both possible (but not at the same time!)

However

Date >=RangeStart and Date <=RangeEnd

will lead to duplicate rows and must be avoided.

YuriMN · ‎04-15-2024

This worked for me, thank you!

Jeanxyz · ‎08-16-2023

I have a similar issue. The data refreshing failed because Power BI service identified duplicate Issue IDs (which is not allowed in data model because issue ID is my primary key column). When I checked the source data (Jira system), there is no dulicate I have set up the filter query as below:

Last_Update_Date >=RangeStart and Last_Update_Date < RangeEnd

My guess is during the incremental data refreshing, someone has updated an issue in Jira system causing the Last_Update_Date to be changed during the incremental refreshing process . As a result, this issue is included both in archive data set and refresh data set, causing the same issue to be loaded twice.

Not sure if this is the root cause and if there is a solution to the issue.

@lbendlin

lbendlin · ‎08-16-2023

This scenario is described in the documentation. Keep in mind that Incremental Refresh expects immutable data. If your data changes after the fact (which would technically require a differential refresh) then you need to supplement the incremental refresh with an occasional Full refresh of all partitions.

Jeanxyz · ‎08-18-2023

I read the documentations, indeed I need to enable 'get the latest data with direct query' in refresh setup page. This is not practical because 1) I will need to update my PB license to a premium per user one and the same for all users who need access to the Power BI report on service, 2) I need to change the storage mode of relevant tables into 'direct query' mode, this also means I need to rewrite all the DAX measures from scratch.

I'm wondering if there is any other solution. My only need is to shorten the data refreshing cycle from 25 minutes to 5-10 minutes. I'm connecting Power BI to Jira data using a odata connector.

lbendlin · ‎08-18-2023

Your data source is not compatible with Microsoft's implementation of Incremental refresh. Run a process outside of Power BI (for example on SQL Server) that can properly handle data updates. You can then connect to that SQL server in Direct Query mode for close to real time performance.

DanielAus · ‎08-08-2023

Thank you - this was causing my issue, I never would have thought to look for this.

Incremental refresh causing duplicate rows

Helpful resources

Fabric certifications survey

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly