Solved: Unable to load model due to reaching capacity limi...

dbeavon3 · ‎04-10-2024

We've started to be severly affected by error messages in our capacity. They look like so:

Previously the service would slow down but would not generate errors. Can someone tell me if the rules have changed as they relate to these errors? Is Microsoft actively encouraging more customers to upgrade their capacity?

I have P1, and we have always had brief spikes in CPU usage, but it never caused these errors to be displayed until now. We generally have plenty of excess capacity for 95% of the work day. I review our capacity dashboard on a regular basis, and we expected to need to upgrade to P2 in a year or so, but certainly not right now.

Please let me know if the logic for this has (error rules). I am pretty convinced that the display of this error is working differently than in the past. If it changed we would like to go back to the old rules and/or customize the rules causing the error to be displayed. Our tolerance is pretty high. If users are impacted by delays for a minute or two, we don't want errors to be displayed since that leads to questions and support incidents.

lbendlin · ‎04-10-2024

You basically have two activities, background and interactive.

When things go south, they have different impact

1. Interactive gets throttled

2. Interactive gets rejected (your scenario)

3. background gets rejected

In my personal opinion this is the wrong thing to do (background should be rejected first).

 Is Microsoft actively encouraging more customers to upgrade their capacity?

They want you to move to Fabric and a consumption based pay model. They will say things like "billable" and "Auto-Scale" etc.

If users are impacted by delays for a minute or two, we don't want errors to be displayed  since that leads to questions and support incidents.

If you are at the interactive rejection phase, and your background operations continue to pile on CU debt you can assume that your entire capacity will be dead for the next six hours at least. You need to move everything off the capacity and onto your spare capacity until the burndown is completed. You have a spare capacity, right?

View solution in original post

dbeavon3 · ‎04-10-2024

Hi @lbendlin
This is very helpful. Thanks for the reply.

I see what you are saying about Microsoft's desire to "protect" me from getting all the way to phase 3. However it seems that the difference between phase 1 (throttling) and phase 2 (error/rejection) is ONLY based on an arbitrary threshold that is selected by Microsoft. We could continue to throttle longer, and queue up requests for another minute or two more in order to avoid a lot of the errors.

Do you know what threshold rules are used to get from phase 1 to phase 2? Is it configurable?

>> You have a spare capacity, right

uhhh

We are in discussions about either upgrading to P2 or creating a secondary P1. My preference is to have another P1. In most cases it is report developers on the desktop who are unwittingly causing problems. And they only do this because they are relying on P1-specific features (things not available in "shared capacity"... like importing PBI datasets as an MDX source into a new PQ model via the Analysis Services connector).

Having another P1 will allow us to "firewall" the developers away from our important workloads. IE. there will be a "high priority" capacity for production purposes - which should have minimal usage. And then the developers can feel free to swamp the "low priority" capacity ... without creating an impact on real business users.

lbendlin · ‎04-11-2024

We don't have a spare capacity. What we did is designate one of our capacities as the "dog house". Any misbehaving workspaces go there, and any innocent bystanders are moved to other capacities. Similiar to your firewall approach. We then work with the offenders to make their refreshes more efficient.

As for the thresholds - I recommend raising a ticket with Microsoft to ask them about the mechanics. The Fabric Capacity Metrics app (especially the current V17) has a lot (A LOT) of room for improvement.

lbendlin · ‎04-10-2024

You basically have two activities, background and interactive.

When things go south, they have different impact

1. Interactive gets throttled

2. Interactive gets rejected (your scenario)

3. background gets rejected

In my personal opinion this is the wrong thing to do (background should be rejected first).

 Is Microsoft actively encouraging more customers to upgrade their capacity?

They want you to move to Fabric and a consumption based pay model. They will say things like "billable" and "Auto-Scale" etc.

If users are impacted by delays for a minute or two, we don't want errors to be displayed  since that leads to questions and support incidents.

If you are at the interactive rejection phase, and your background operations continue to pile on CU debt you can assume that your entire capacity will be dead for the next six hours at least. You need to move everything off the capacity and onto your spare capacity until the burndown is completed. You have a spare capacity, right?

dbeavon3 · ‎04-25-2024

Hi @lbendlin

I've been speaking with someone who appears to be an AI support at PBI? 😅

It seems to be regurgitating the internal wiki's from the support organization. The information they are sending me seems to be more than what is found in the docs.

Based on the info, you will see below that there are named "policy limits" corresponding to the phases you shared when things go south ("interactive gets throttled" is known as the "Interactive Rejection" policy limit per the A/I-generated table below)

Carry Forward

Here is the table of Policy Limits in text format, for the sake of googling:

Usage	Policy Limits	Platform Policy Experience Impact
Usage <= 10 minutes	Overage protection	Jobs can consume 10 minutes of future capacity use without throttling.
10 minutes < Usage <= 60 minutes	Interactive Delay	User-requested interactive jobs are delayed 20 seconds at submission.
60 minutes < Usage <= 24 hours	Interactive Rejection	User-requested interactive jobs are rejected.
Usage > 24 hours	Background Rejection	All requests are rejected.

The AI also allowed me to share the following "carry forward" graph which is used to determine when the "Interactive Rejection" phase is reached. This graph is derived from a very unintuitive set of math formulas, and contrary to the behavior that customers would expect out of a "fixed" p1 capacity.

You can see the accrued/accumulated quantity of carry-forward. It accumulates at certain specific points of time and has a potential impact over an unpredictable amount of time in the future! This type of thing is impossible to manage without having visibility in real-time. I don't know what an I.T. department could do, aside from referring our users to the Microsoft AI bots, and asking for those bots to send the latest "carry-forward" chart. Hopefully that is automated... as long as Microsoft is using AI, the customers might as well benefit from receiving these carry-forward charts on demand!

I'm not thrilled with the way that Microsoft is changing the meaning of a P1 capacity at will, (and is simultaneously changing what it means to receive tech support as well).

lbendlin · ‎04-25-2024

Thank you for digging this up, very interesting. In general I think this is just putting artisinal sea salt into the wounds (instead of regular salt). As a capacity operator I don't care about Carry-Forward. At that point (when stuff goes south) I am already frantically redistributing workspaces to other capacities. What would be nice would be to know if the burndown estimates are accurate enough to predict when the capacity becomes usable again.

dbeavon3 · ‎04-25-2024

Hi @lbendlin

I haven't found any tools that give capacity information in realtime. IE. the dashboard can be used for seeing what went wrong on a prior day. But I don't have anything like an OS task manager to find out how my CPU is used right now.

I agree that in general it is better to get in front of the problem. But my standards are pretty low where Power BI is concerned. Given that we've only invested in a single P1, it is very predictable that the system will be sluggish at times. The main thing that causes alarm is when error messages are popping. That is the time when users start calling us!

The hardest part about understanding the error messages is the fact that you need to understand this business related to "carry-forward".

I recently learned that this stuff (interactive rejection policy for carry-forward) does have its own graph exposed to customers in the capacity dashboard:

The graph shows the accumulation of "carry forward", and will go over 100% when the service is transitioning to interactive rejection mode.

In order to get back to normal, you have to wait for "burn down" or whatever...

This stuff is all overly complex, IMHO, and not what anyone would expect from a "fixed capacity". I really hope the support team gets totally buried in tech support incidents from confused customers. It is up to Microsoft, but IMHO the PG needs to go back to the drawing board on some of these decisions. The scary part is that Microsoft thinks they can keep changing the meaning & implementation of a premium capacity any time they want. What stops them from unilaterally removing 10% of our CPU cycles and calling that a "P1" for the sake of the next month's invoice? It would be a case of shrinkflation. I doubt there is any auditor that reviews a "P1" to make sure it means what people think it means. When you buy ice-cream at the grocery store, you can look at the side of the box to see exactly how much you are getting, but with a P1 you really don't know!

BTW there is no distinct CPU allocated for background vs foreground. Used to be four cores each. All the simple concepts we used to be familiar with have been twisted into something unrecognizable. It started with removing the term "CPU" and replacing with "CU". 😉

Unable to load model due to reaching capacity limits

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024