Reply
Regular Visitor
Posts: 19
Registered: ‎04-06-2016
Accepted Solution

Query Dependencies

I have 2 main queries that all other queries rely on being up-to-date.  Some queries use those queries as Source, others use them simply as lookups.  It's impreative those 2 queries are updated before any other query.

 

I've looked everywhere and can't find any mention of query dependencies.

 

I'm experiencing issues where query A & B are stale when other queries run producing invalid reports.

 

Is there any way to ensure query A gets updated first, then query B, then all the others in parallel?

 

Thanks-

Mike

 

 

 


Accepted Solutions
Highlighted
Senior Member
Posts: 252
Registered: ‎12-07-2015

Re: Query Dependencies

@mhardy very interesting experiment. I have to check what's going on inside the Queries Editor that prevents refresh all to really "refresh all" queries. I vaguely recall some bug that may already be fixed.

 

 

 

You are right that query loading (when you click Apply) and query editing (Queries Editor preview etc) have different refresh models. The source of truth is the loaded result, and we may have some optimization running in the queries editor window to help with user interactivity. 

 

First let's talk about the "source of truth" - query loading. One rule of thumb to keep in mind: top level queries in PowerQuery are loaded in isolated evaluations. Unless due to secondary grouping effects (like on disk cache been shared), they have no knowledge of each other during evaluation time. If I am to simplify things down a lot, then you can have this mental model about query dependency: a query's dependents are flattened pre-loading and treated as one query. So if you have

A(JSONDocument)

B(JSONDocument, A)

These will get evaluated in order and you will likely see JSONDocument evaluated TWICE, so in both query A and B it is up-to-date. 

Of course things aren't that simple in practice and we have on-disk caching that help with the twice issue. So in reality whoever hits the data source first gets to fill the cache and the reminding load session won't request for it again. But that doesn't change the fact that the evaluation model will need JSONDocument twice. If you are to purge the cache at the right spot, for example, or to evaluate A and B in parallel, then you are likely to see JSONDocument been pulled twice. In any case, this model guarantees freshness of your data source each time LOAD is requested.

 

Coming back to the queries editor. There things are a little bit different. In order to help interactivity we prefetch a bunch of data in the background (there's an option to disable this in the options dialog). If the data is available when the user demands it (switching to a query for example), then we will not request it again. This bg data fetching mechanism will honor dependencies and tries its best to figure out which query and which step to evaluate first. Note the big difference here is we will now treat each STEP (i.e., let variable in the query) as a separate evaluation. So you may actually see requests been made as if there is no dependency since we are evaluating a step above the dependency been declared. On top of this, there is the same disk cache in effect. In short, things are much more complicated when you are inside the queries editor. But the principle is the same: evaluation are in isolation and "refresh" clears the disk cache.

 

This is a simplified model of how to think about PowerQuery dependencies. Much of this is hidden away from the user. Recently I've seen a lot more similar discussions here so maybe we should consider an UI to surface these things.

 

If you have more questions about your specific case, you can post the full M document here.

View solution in original post


All Replies
Super User
Posts: 10,536
Registered: ‎07-11-2015

Re: Query Dependencies

You can try this. Open the query editor and put your first two queries in a group by themselves and all the other queries in their own group. Move the group with the 2 queries to the top. I can't say this will fix your issue, but worth a shot. My gut says no, but you never know...


Did I answer your question? Mark my post as a solution!

Proud to be a Datanaut!


Regular Visitor
Posts: 19
Registered: ‎04-06-2016

Re: Query Dependencies

Unfortunately that had no effect.  Fiddler still shows the request I need to happen first, happening second.

 

Thanks though-

Super User
Posts: 2,146
Registered: ‎08-11-2015

Re: Query Dependencies

[ Edited ]

It seems they refresh in order of creation (not alphabetically or in the order organized/grouped in Query Editor)

 

Could you have somehow altered one of the main 2 (later on) which disrupted that order? Pure speculation...

 

I would have thought the software should recognize those dependencies though - strange!

 

Keep us posted if you find a solution.

 

Regular Visitor
Posts: 19
Registered: ‎04-06-2016

Re: Query Dependencies

After some investigation I think I've uncovered the issue. It's not in the main window, but the Query Editor (where I've spent most of my time lately) that's problematic.

 

The gist of it is: hitting Refresh Preview in Query Editor never hits the data source - only the cache.

 

Even with the Cache set to 0, sometimes you must Refresh Preview twice to see fresh data.

 

For anyone interested in my main window tests...

 

For test purposes we have 3 queries:

  • _Fields (Souce=Json.Document)
  • Lists (Souce=Json.Document)
  • Items (Source=Json.Document)

Lists & Items require _Fields to run first and be up-to-date.

 

Typical refresh as monitored by Fiddler shows:
Lists, _Fields, Items

 

Showing _Fields running second. However, in the M query Lists & Items do not actually access _Fields until after the initial Source=Json.Doc line. Is Lists using the old _Fields and Items the refreshed copy?

 

Forcing _Fields to be stale, then doing a full Refresh showed Lists being correct, using the fresh version of _Fields despite _Fields still running second. Here's the interesting part, to verify further I cleared the cache and set to 0. Now a full refresh looks like this:

Lists, _Fields, Items, _Fields, Lists, Items

 

So it's definitely seeing that dependency on _Fields, suspending execution, refreshing _Fields and continuing.

 

Highlighted
Senior Member
Posts: 252
Registered: ‎12-07-2015

Re: Query Dependencies

@mhardy very interesting experiment. I have to check what's going on inside the Queries Editor that prevents refresh all to really "refresh all" queries. I vaguely recall some bug that may already be fixed.

 

 

 

You are right that query loading (when you click Apply) and query editing (Queries Editor preview etc) have different refresh models. The source of truth is the loaded result, and we may have some optimization running in the queries editor window to help with user interactivity. 

 

First let's talk about the "source of truth" - query loading. One rule of thumb to keep in mind: top level queries in PowerQuery are loaded in isolated evaluations. Unless due to secondary grouping effects (like on disk cache been shared), they have no knowledge of each other during evaluation time. If I am to simplify things down a lot, then you can have this mental model about query dependency: a query's dependents are flattened pre-loading and treated as one query. So if you have

A(JSONDocument)

B(JSONDocument, A)

These will get evaluated in order and you will likely see JSONDocument evaluated TWICE, so in both query A and B it is up-to-date. 

Of course things aren't that simple in practice and we have on-disk caching that help with the twice issue. So in reality whoever hits the data source first gets to fill the cache and the reminding load session won't request for it again. But that doesn't change the fact that the evaluation model will need JSONDocument twice. If you are to purge the cache at the right spot, for example, or to evaluate A and B in parallel, then you are likely to see JSONDocument been pulled twice. In any case, this model guarantees freshness of your data source each time LOAD is requested.

 

Coming back to the queries editor. There things are a little bit different. In order to help interactivity we prefetch a bunch of data in the background (there's an option to disable this in the options dialog). If the data is available when the user demands it (switching to a query for example), then we will not request it again. This bg data fetching mechanism will honor dependencies and tries its best to figure out which query and which step to evaluate first. Note the big difference here is we will now treat each STEP (i.e., let variable in the query) as a separate evaluation. So you may actually see requests been made as if there is no dependency since we are evaluating a step above the dependency been declared. On top of this, there is the same disk cache in effect. In short, things are much more complicated when you are inside the queries editor. But the principle is the same: evaluation are in isolation and "refresh" clears the disk cache.

 

This is a simplified model of how to think about PowerQuery dependencies. Much of this is hidden away from the user. Recently I've seen a lot more similar discussions here so maybe we should consider an UI to surface these things.

 

If you have more questions about your specific case, you can post the full M document here.