Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hi community,
I'm currently looking for improvements on my incremental refresh setup, and hopefully some of you may be able to help me.
Here is the context:
We have one dataflow entity which is sourced by a SQL Server view (using native wizard and the view is well prepared, meaning no M transformation at all in the entity, query folding is ensured).
The view itself is a bunch of inner joins and sum/group by functions. I have integrated into the view the 2 required columns to make the incremental refresh happen: the partition timestamp and the capture data change timestamp.
Having started the SQL profiler, I can see many queries (1 per partition) against the database with the following form:
select top 1
field1,
field2,
field3
from (
select
field1,
field2,
field3
from
hereismysourceview
where
partitiontimestamp >= convert(datetime2, '2020-05-01') and < convert(datetime2, '2020-06-01')
)
order by
cdctimestamp desc
We can see it uses the main query, add a where clause for the current partition and restrict to have the most recent timestamp part of the partition.
The issue here is to query each time the fact table. The query has to perform aggregates, joins and filtering to produce the first record... While the cdc timestamp is available in a side table, as we keep record of all datawarehouse refreshes on a control table. It would be a lot more effective to be able to query this table alone, instead of querying my complex view again and again.
I have seen this in the dataset documentation: https://docs.microsoft.com/en-us/power-bi/service-premium-incremental-refresh#custom-queries-for-det.... Using the XMLA endpoint (still in preview and it seems only for datasets), we could have a lot more sophisticated management of the partitions, which sound really interesting. And there is a really cool feature: To add a custom query to detect the data change. Anyone here has an experience with this? Any idea if this is available for dataflows?
Or any idea how I could improve the situation with the current tools available? As the performance here is really poor. I'm checking changes on 25 partitions. The check on each partition takes almost the same time to run than having loaded the full dataset(!).
Any advice or brilliant idea is more than welcome 😉
Cheers
@tac any chance you found the answer to this? I have the exact same situation and have been unable to find any information on it.
Hi joelt,
We managed the limitations of the dataflow by..... removing dataflows....
The main concern was the one described in my original message, but we noticed as well a big limitation about enabling the incremental refresh on the dataset when consuming information located in a dataflow... As this was not possible, we took the decision to redesign our flows without using dataflows. We have now a direct link SQL Server => PBI Datasets and it works well with the incremental refresh, having setup a custom polling expression, meaning it does not query the fact table to identify if the partition should be refreshed or not. The key thing when updating the polling expression is to encapsulates everything into one M expression and not using shared expression references. I was getting some weird errors when trying to do so.
Hope it helps.
Good luck.
Hi @tac ,
You may need to improve the dataflow performance, see: https://community.powerbi.com/t5/Service/SQLServer-performace-issue-will-dataflows-help/td-p/783770.
Or use incremental refresh for dataflow, see more detail:https://docs.microsoft.com/en-us/power-bi/service-dataflows-incremental-refresh.
Best Regards,
Amy
Community Support Team _ Amy
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi @v-xicai ,
Thanks for your message and links to the documentation. I could say I did most of the possible improvements in the dataflow side, knowing I have more or less no transformation in the entity subject to the incremental refresh.
However, when reading the documentation, something caught my attention:
We could read:
"
Tip
The current design requires that the column to detect data changes be persisted and cached into memory. You may want to consider one of the following techniques to reduce cardinality and memory consumption:
"
But how we could do this? At that time, I'm returning for each row the CDC timestamp but I would be happy to be able to replace by a M function or something similar. But how?
Thanks for your help
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.
User | Count |
---|---|
49 | |
18 | |
17 | |
16 | |
8 |