Re: Defragment not working

hoosha_11 · ‎04-29-2024

Hi,

As explained by @cpwebb in this article: https://blog.crossjoin.co.uk/2022/11/14/why-you-should-defragment-your-fact-tables-if-youre-using-in...

I used "Defragment" for a table in a dataset and that significantly reduced the dictionary size.

I did the same for another table in a different dataset multiple times, but that did not change the dictionary size at all.

Any idea how I can reduce this size? Does it have anything to do with disabling MDX?

Thanks a lot!

cpwebb

I guess the reason scenario (1) works is that, initially, there's no data in your model and since the tables are partitioned that will reduce the amount of parallelism indirectly. If you are ok with your model not being queryable during refresh you could set up incremental refresh but always do a refresh of type ClearValues (which deletes all the data from your model) first. That feels a bit hacky though and I suspect there's no going on here that needs investigation.

The Parallel Loading Of Tables setting in Power BI Desktop is the easy way to control the amount of parallelism at the table level: https://blog.crossjoin.co.uk/2022/10/31/speed-up-power-bi-dataset-refresh-performance-in-premium-or-... Setting this to One will ensure all your tables are refreshed sequentially.

What do you mean by "child" tables? Are these dimension tables? Are you deriving them from your fact tables somehow, maybe in Power Query?

hoosha_11

Thanks again @cpwebb , I followed the same steps to update the Parallel Loading Of Tables setting in Power BI Desktop.

Since the PeakMemory for the refresh as a whole is around 3.5 GB and I still face Memory capacity errors (even after tuning the power query for a table with very high MashupPeakMemory) , I looked into the calculated columns again and found something interesting:

We have 2 fact tables, and each fact table in linked to multiple dimention tables:

Fact table A with 350,000 rows.

Fact table B with 9 million rows.

There was a calculated column in Fact table A which was referencing both Fact table B and a dimension table linked to the table B.

Before removing that specific calculated column:

PeakMemory for the refresh as a whole: 3.5 GB

PeakMemory for Fact table A: 1.95 GB

PeakMemory for Fact table B: 2.85 GB

PeakMemory for the dimension table: 1.8 GB

After removing that specific calculated column:

PeakMemory for the refresh as a whole: 2.4 GB

PeakMemory for Fact table A: 1.4 GB

PeakMemory for Fact table B: 1.6 GB

PeakMemory for the dimension table: 175 MB

This shows how one calculated column could easily consume a lot of memory.

I'm going to schedule a refresh over the weekend (every 30 minutes) to see the results and whether I still face refresh failures.

If I still see refresh failures, considering the PeakMemory of around 2.4 GB and the A2 capacity that comes with 5 GB RAM, do you think I need to remove more calculated columns?

Thanks a lot!

cpwebb · ‎04-30-2024

If defragmenting your table doesn't reduce its size, then there probably isn't anything else you can do. Are you trying to reduce the size of your table? If so then you will need to look at things like splitting datetime columns into separate date and time columns or rounding numbers in decimal columns, both of which can reduce the cardinality of your columns. See articles like this for more details https://www.sqlbi.com/articles/optimizing-high-cardinality-columns-in-vertipaq/

hoosha_11 · ‎04-30-2024

Thanks a lot @cpwebb

Yes, I'm currently trying to reduce the size of tables. Despite removing all unused columns, optimizing calculated columns, and even transferring some columns to PQ and SQL Server, I continue to face the following error, although DAX studio VertiPaq shows only 800MB used memory and we have 5GB RAM with our embedded capacity:

"Resource Governing: This operation was canceled because there wasn't enough memory to finish running it."

In the Power BI service, I've scheduled a refresh every 30 minutes with incremental refresh enabled. Interestingly, I sometimes don't see this error for up to 10 consecutive hours (which means 20 refreshes). But, intermittently, the refreshes begin to fail, sometimes even failing twice in a row.

These failed refreshes consistently occur between 3 minutes and 3 minutes 30 seconds. So, if a refresh exceeds 4 minutes, it usually completes successfully.

I'm struggling to identify the root cause of this issue. If it is indeed related to calculated columns, I will transfer more columns to SQL and Power Query. But, I'm really not sure where to look.

Thanks for your time!

cpwebb · ‎04-30-2024

You're running into something called the Command Memory Limit, which is documented here: https://learn.microsoft.com/en-gb/power-bi/enterprise/troubleshoot-xmla-endpoint#resource-governing-... It's also something I will address in an upcoming blog post in a series I started here https://blog.crossjoin.co.uk/2024/04/28/power-bi-semantic-model-memory-errors-part-1-model-size/ (reading this post will provide some useful background).

If you're using an A2 with 5GB of RAM and your existing model is 800MB, then you have 5GB - (800MB plus some extra memory used by queries and sessions) left for your refresh, so say 4GB. The next thing you should do is run a Profiler trace against your model (see https://blog.crossjoin.co.uk/2020/03/02/connecting-sql-server-profiler-to-power-bi-premium/ for how to do this) and capture the Command End events when you run a refresh. From this you'll be able to see the peak memory used during the refresh as a whole and for just Power Query (see https://blog.crossjoin.co.uk/2023/04/30/measuring-memory-and-cpu-usage-in-power-bi-during-dataset-re...) and for just Power Query for individual table partitions (see https://blog.crossjoin.co.uk/2023/04/02/identifying-cpu-and-memory-intensive-power-query-queries-dur...). My guess is that the peak memory for the refresh as a whole is going over 4GB. If so, then you need to look at the Power Query memory usage number for individual partitions and see if there are any memory hungry Power Query queries - if so, they need to be tuned. If not then it's likely to be your calculated columns that are the problem still. Reducing the amount of parallelism during refresh may also help reduce the peak memory usage (see https://blog.crossjoin.co.uk/2022/10/31/speed-up-power-bi-dataset-refresh-performance-in-premium-or-...).

If peak memory for the refresh is a lot less than 4GB then it's possible that the problem isn't the refresh but that users are running memory-hungry queries during the refresh. Running a Profiler trace during a refresh and looking for Query Begin/End events will tell you if queries are being run; there is a new Profiler event coming very soon which will tell you more about query memory usage but before that happens it's impossible to know how much memory a query consumes in the Service but you should see it happen easily in Power BI Desktop by looking at what happens in Task Manager when you view or interact with a report. If this is the problem then you'll need to tune your model and measures to reduce memory usage. This antipattern is a very common example of DAX that can cause memory spikes: https://xxlbi.com/blog/power-bi-antipatterns-9/

If you are able to post some screenshots of your Profiler traces and the memory usage numbers please let me know - I'm curious to see what you find!

HTH,

Chris

hoosha_11 · ‎04-30-2024

Thank you very much @cpwebb . Very thorough and great information. I really appreciate it!

I used Profiler for two different datasets, and the following are the results:

1) The refresh failed for this dataset:

PeakMemory: 4.5GB , MashupPeakMemory: 1.49 GB

I checked the 'progress report end' for all the tables and observed that the MashupPeakMemory for most tables falls between 120MB and 250MB, except for one table, which was 1.4 GB!

2) This is another dataset with an incremental refresh, and the refresh failed:

PeakMemory: 3.95 GB , MashupPeakMemory: 2.12 GB

And same story with this dataset, as MashupPeakMemory for one of the tables was 1.65 GB!

I refreshed the same dataset again, and it refreshed successfully with a PeakMemory of 3.82GB, as follows:

As you mentioned, I think I need to tune the queries in Power Query for this table that is eating lots of memory, correct?

This is the query for that table:

Thanks again! much appreciated!

cpwebb · ‎05-01-2024

Thanks for the detailed information. I'm still learning a lot about this topic and it helps to have a real life example to work though!

It does sound like there's some tuning needed in your Power Query queries. Bear in mind that for the Command End for the refresh you're seeing peak values for all operations, so if you're refreshing 6 tables in parallel then that peak value could be 6 * (a reasonable amount of memory for a single table). From my limited experience I would say that 200-400MB is a reasonable value for memory used by Power Query for a single table.

That said, a MashupPeakMemory of 1.4GB or 1.6GB is definitely too high. For the Power Query query you posted, how much of that folds? Do the Merge and the Group operations fold? They would be the obvious memory-hungry operations; if they do, the next thing to investigate would be the steps where you're changing data types; moving them to the end of the query, or at least after the Merge, could help.

hoosha_11 · ‎05-01-2024

Thanks again @cpwebb for your valuable comment.

I removed all the transformations and started from scratch, and noticed that the groupby (Table.group) query is not folding. Also as you mentioned, changin data type needs to be done after the Merge.

I tried a lot, but I wasn't able to fold the Groupby function, so I created a view in SQL server for the same table with same steps and columns. After refreshing the dataset with this view, and tracing the Profiler, the MashupPeakMemory dropped from 1.4GB to 140 MB for that table, and PeakMemory in CommandEnd for all operations was 3.4 GB.

I followed the same steps for another small dataset, and even though the MashupPeakMemory for that specific table dropped from 1.65GB to 120 MB, the refresh failed again. After some tests, I found that removing 4 calculated columns resolved the issue (I had another table with 9 Million rows and 4 calculated columns).

Below is one of the calculated column that is causing the issue (all other columns have exactly the same function):

Even I had optimized this query before by changing CALCULATE to MAXX!

By removing these 4 calculated columns the PeakMemory for all operations dropped to 2 GB.

I will try reducing the amount of parallelism to see how much PeakMemory changes.

I guess the challenge is to identify which calculated column is consuming more memory. With a larger dataset containing more columns, it becomes difficult to pinpoint the memory-hungry calculated columns.

Thanks again!

cpwebb · ‎05-02-2024

I'm glad to hear you've made progress - pushing transforms as far upstream as possible is always a good thing.

Regarding your calculated columns, and DAX expression that uses FILTER() on a whole table always carries a risk of a memory spike. How many rows are in the FollowUps table? It could be that an expression something like this would be better from the point of view of memory:

Response(patient contact)_=
var CurrentInspectionId = SELECTEDVALUE(FollowUps[InspectionId])
return
CALCULATE(
MAX(FollowUps[Response]),
FollowUps[InspectionId] = CurrentInspectionId,
FollowUps[QuestionId] = "973")

Apologies for any typos/syntax errors, or if this is exactly what you had before you optimised.

I don't think there's a way of identifying how much memory each calculated column in a table uses during a refresh, so it will have to be a process of elimination - you'll need to add each calculated column individually, refresh and see how the memory usage is affected.

hoosha_11

Hi and thank you again, @cpwebb . I've continued working on this and managed to create a view of the table that previously had a query folding issue in SQL Server. Also we have 2 large tables, each containing around 9 million rows, where I've transferred almost all of the calculated columns for these tables into SQL server. Also adjusted the maximum parallelisim in PBI desktop to 3 and 1 before publishing.

Eventually, the PeakMemory dropped to around 3.4GB-3.6GB, but the refreshes still fail.

It's quite confusing, as I'm not sure where to look next.

Here are some findings from my tests:

1- When I define incremental refresh policy and publish the dataset, the first refresh, which creates the partitions, succeeds (tested multiple times), but subsequent refreshes fails.

2- Without defining an incremental refresh policy, the first refresh fails.

3- The CPU Time for the 2 large tables is 229000 and 103000. Even the manual refresh in PBI desktop for these 2 tables takes a considerable amount of time to complete.

4- Refreshed each table in SSMS and checked the profiler. We have 2 Fact tables, one with 300K rows and PeakMemory of 1.95GB and the other one with 9 million rows and PeakMemory of 2.9GB. There are 2 other child tables with PeakMemory of 1.7GB and 1.3GB and the remaining tables are around 200 MB.
5- Any time I publish the semantic model to PBI service, I observe the dictionary size for the ID columns of the 2 large tables is very big and I have to do the defragment, which sometimes works and reduces the size of dictionary and sometimes doesn't work.

Additionally, I would like to know if it's possible to refresh the tables sequentially in PBI service? Does defining maximum parallelism (let's say to one) affect the tables, or is it only for the partitions?

I've created a Powershell query in Azure Runbooks to refresh the tables, but defining "maxParallelism" is resulting in an error (perhaps I'm not defining it correctly).

Thanks again, your ideas and comments are really really appreciated!

Defragment not working

Helpful resources

Fabric certifications survey

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024