Solved: Never-ending incremental refresh in the Service

PeKaDe · ‎10-11-2021

Hi Guys,

Could you share some experience, perhaps some tips & tricks, concerning situations when [full/first] incremental refresh in the Service is taking very long?

We built a new report recently, which refreshes in between 20 and 30 minutes in Power BI Desktop (2 x Azure SQL sources, 20M each). This report is configured as incremental and after publishing, when I start the refresh for the first time, it times-out after 5 hours of refreshing.

My first suspicion was obviously the DB but I no longer think that, since I can refresh fully in Power BI Desktop in the reasonable time. When I run the same SQL in Azure Data Studio, it's even quicker (around 15 min. to finish both queries).

Last week I ran an initial incremental refresh in the Service for a different report and it loaded 33M records in 11 minutes. Same workspace, same SQL Server, same DB. Same report author.

My queries are not complicated at all, it's "select * from Table with (nolock)" which obviously is pretty intensive but on the other hand it needs to be done at least once before the incremental refresh kicks in. I introduced (nolock) hoping to speed up the query (there are multiple updates on records in my table all the time) but the report still times-out in the service after 5 hours (with or without a (nolock)).

Is Azure SQL DB at 100% during my query execution? Well, it obviously is, but the same is true when this query is run from either Desktop or Data Studio - so 100% doesn't seem to be a problem, it's just that the query never finishes. Is the query stuck? Well, it's not blocked, sometimes it waits with NULL wait_type, sometimes with IO_QUEUE_LIMIT, sometimes with PAGEIOLATCH_EX - but again, the same is true when I run this query from elsewhere and it finishes nicely every time.

One thing I see which troubles me a bit, is that when I check:

SELECT text, wait_type, *

FROM sys.dm_exec_requests 

CROSS APPLY sys.dm_exec_sql_text(sql_handle)

... it always shows my main query without any parameters. I actually thought that the initial incremental refresh goes year by year.

It's not the first report we built or deployed to a given workspace, configured as incremental and fully refreshed from the same SQL Server. Such a situation only happened once in the past when I would mark "Detect data changes" checkbox and the report would never finish refreshing. But that issue is long gone, I no longer try any checkboxes 🙂

Any ideas what else I could check and try? Or how to even isolate the problem?

I couldn't find any logs/reports that would give my any visibility of how the actual machines behind the Power BI Service are performing.

I would appreciate any hints.

Thank you,

PKD

EDIT: Actually, turning off the second data source (as an attempt to isolate the problem) did not help at all. Report was re-published and the refresh timed-out as usual after being stuck for a couple of hours on select * from...

EDIT 2: I managed to look behind the scenes a bit. Having connected the logs from the Power BI Service, I can see that during the 4 hours (this is a time it has been refreshing so far) 16 one-year-big partitions have been processed (successfully). I don't know what's happening but it seems to me, that during this full refresh Power BI generated 100 queries, each one for a year, that look like this:

let __AS_Query__ = let RangeStart=DateTime.FromText("19220101T00:00:00"),  RangeEnd=DateTime.FromText("19230101T00:00:00")  in  PolicyBased_<I replaced some id here>, __AS_Table__ = Table.FromValue(__AS_Query__), __AS_Compact__ = Table.RemoveColumns(__AS_Table__, Table.ColumnsOfType(__AS_Table__, { type table, type record, type list })), __AS_Effective__ = Table.TransformColumnNames(__AS_Compact__, Text.Clean), __M = #shared[[List.ConformToPageReader],[Graph.Nodes],[Value.Lineage]]?, __AS_TransformClassification__ = (value) => Value.ReplaceType(Table.Distinct(List.Accumulate(__M[Graph.Nodes](__M[Value.Lineage](value)), #table({"ColumnName", "LabelId"}, {}), (current, node) => Table.Combine({current, Table.SelectColumns(Table.ExpandRecordColumn(Table.SelectRows(Value.Traits(node), each ([Provider] = "SQL" or [Provider] = "MicrosoftInformationProtection") and [Identifier] = "Classification"), "Value", {"LabelId"}), {"ColumnName", "LabelId"}, MissingField.UseNull)}))), type table [ColumnName=nullable text, LabelId = text]), __AS_WithClassification__ = __M[List.ConformToPageReader]({__AS_Effective__, __AS_TransformClassification__}), __AS_Result__ = if __M[List.ConformToPageReader] = null then __AS_Effective__ else __AS_WithClassification__ in __AS_Result__

It looks like Power BI is querying year by year for 100 years. This 100 is caused by a setting which I introduced as part of the incremental refresh to tell Power BI that it should retain data for a 100-year-old period. It may sound silly, but there's no setting to say "take all data". Anyway, the thing is that my DB only contains data for 6 years (2015-2021) so obviously all the queries for years 1921 to 2014 should complete in no time. Provided that there is an index on ModifiedAt column which there is (ModifiedAt is used for incremental refresh and the whole RangeStart and RangeEnd processing).

Does that help? Can anyone suggest what can be going on here?

PeKaDe · ‎10-19-2022

Hi @Federico , good point. Let me note down some insights I've had since:

as suggested below, incremental refresh might not work at all with direct query/direct SQL. This said, it shouldn't be a problem. I learned since, that I don't need a select statement, and [no lock] is not really required. So you should be well off connecting to your SQL source in a standard way.
please check my other thread here: https://community.powerbi.com/t5/Desktop/Incremental-Refresh-is-there-a-way-to-handle-record-updates.... It basically shows that incremental refresh cannot be used in some very common scenarios of updating past data. If you have frequent updates to old data, incremental refresh will simply not work and it may take you months to realise why. Some details are in the other thread.
from my observations, if you use "detect data changes", provide many partitions (e.g. daily or weekly partition) and the nature of your data updates is that many partitions may be affected, then incremental refresh can turn out to be slower than the full refresh each time. That's not scientifically proven, just my experience.
and finally, rebuild your indexes (needless to say, make sure that you have proper indexes on the column used for the incremental refresh AND for detect data changes if you use it). This turned out to be the main reason for slowliness in my case.

My final point is (again speaking from experience): connecting 2 tables with 25+M records in a relationship is going to kill your CPU & memory. Not even a premium capacity is going to help you. I ended up denormalizing the data and solving all the memory problems by that. This may be irrelevant in your case, I'm only saying this because after I "solved" the incremental refresh problem (i.e. turned it off), I immediatelly found myself having memory problems. There might be better solutions of course.

Good luck 🙂

View solution in original post

federicogulle · ‎10-18-2022

Hi, I would like to know if you can solve your problem and if you acn share the solution you found. Is happening the same to me. I have 25 million registers from two different origins and it keeps refreshing until the time-out error occurs.

Thank you

PeKaDe · ‎10-19-2022

Hi @Federico , good point. Let me note down some insights I've had since:

as suggested below, incremental refresh might not work at all with direct query/direct SQL. This said, it shouldn't be a problem. I learned since, that I don't need a select statement, and [no lock] is not really required. So you should be well off connecting to your SQL source in a standard way.
please check my other thread here: https://community.powerbi.com/t5/Desktop/Incremental-Refresh-is-there-a-way-to-handle-record-updates.... It basically shows that incremental refresh cannot be used in some very common scenarios of updating past data. If you have frequent updates to old data, incremental refresh will simply not work and it may take you months to realise why. Some details are in the other thread.
from my observations, if you use "detect data changes", provide many partitions (e.g. daily or weekly partition) and the nature of your data updates is that many partitions may be affected, then incremental refresh can turn out to be slower than the full refresh each time. That's not scientifically proven, just my experience.
and finally, rebuild your indexes (needless to say, make sure that you have proper indexes on the column used for the incremental refresh AND for detect data changes if you use it). This turned out to be the main reason for slowliness in my case.

My final point is (again speaking from experience): connecting 2 tables with 25+M records in a relationship is going to kill your CPU & memory. Not even a premium capacity is going to help you. I ended up denormalizing the data and solving all the memory problems by that. This may be irrelevant in your case, I'm only saying this because after I "solved" the incremental refresh problem (i.e. turned it off), I immediatelly found myself having memory problems. There might be better solutions of course.

Good luck 🙂

PeKaDe · ‎10-17-2021

Hi @v-chenwuz-msft,

Thank you for your answer. I was not aware that Direct Query is not accepted in the incremental refresh. The article you quoted is clear on that (thanks!) - it's such a pity that Power BI Desktop is not. Yes, there is a warning that the query might not be foldable. But being a user it's hard to imagine what would be wrong with "select * from <table> with (nolock)" in terms of foldability. It would be really simple to just say in the Desktop that incremental and direct query don't work together. Well... I know now.

Back to a problem: can you actually explain what is going on in the background? The fact that the engine needs to pull all the rows, even year by year, does not scare me too much. I'm refreshing those 20M rows in a couple of minutes. But what is the reason that the full refresh is taking so long, especially when it's querying "partitions" that have no data at all? I'm seeking to understanding if there's something wrong with my data source/data set or can I just assume that Power BI is completely misbehaving if you use direct query with incremental. Again - it's not slow, there's clearly something wrong, because in 5 hours it managed to finish processing 6 yearly partitions for which there was no data at all (because data starts in 2015).

Thanks in advance for helping me understand the real reason behind this behaviour.

v-chenwuz-msft · ‎10-14-2021

Hi @PeKaDe ，

I guess you have read this article already.

As you said ,there is no parameters in your main qurey, so I think query folding isn’t occurring in the Power BI Desktop model.

Incremental refresh can't use a native SQL query. So, it would force the Power Query mashup engine to retrieve all source rows, and then apply filters to determine incremental changes.

(https://docs.microsoft.com/en-us/power-bi/guidance/power-query-folding#guidance )

Maybe when you create your model had do something prevent folding

https://docs.microsoft.com/en-us/power-query/power-query-folding#transformations-that-prevent-foldin...

I recommend you to check your model when it can be foled. hope this will help you.

Best Regards

Community Support Team _ chenwu zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.