Solved: Data Protection Firewall - Why 'TOP 1000'?

Ben-Dev · ‎01-24-2022

Hello,

The Data privacy analysis section of Why does my query run multiple times? says:

Data privacy does its own evaluations of each query to determine whether the queries are safe to run together. This evaluation can sometimes cause multiple requests to a data source. A telltale sign that a given request is coming from data privacy analysis is that it will have a “TOP 1000” condition (although not all data sources support such a condition). [...]

However, Behind the scenes of the Data Privacy Firewall doesn't seem to mention anything about the firewall needing to fetch actual data rows (e.g. to do a "TOP 1000"); rather, from the description given, it sounds like the firewall can compute whether two data sources can fold together based on data source-level (not data set-level) information, principally by looking at the configured privacy level settings for the sources (which doesn't involve needing to fetch actual row data).

Can anyone shed light on why the firewall might need to do a "TOP 1000" fetch?

Thank you,
Ben

Ehren · ‎01-26-2022

The mashup engine doesn't have any static info about what data sources are being accessed by a given query or step. That info is surfaced at runtime, when a given query is being evaluated. So in order for the firewall to perform the logic described in the "Behind the scenes" article, it pulls the first 1k rows for any partition it's interested in. Doing so results in the partition's data source(es) being surfaced.

View solution in original post

Ben-Dev · ‎01-26-2022

@Ehren, any chance you could weigh in on this to confirm/shed light? Thanks!

Ehren · ‎01-26-2022

The mashup engine doesn't have any static info about what data sources are being accessed by a given query or step. That info is surfaced at runtime, when a given query is being evaluated. So in order for the firewall to perform the logic described in the "Behind the scenes" article, it pulls the first 1k rows for any partition it's interested in. Doing so results in the partition's data source(es) being surfaced.

Ben-Dev · ‎01-27-2022

Thanks, @Ehren!

Ben-Dev · ‎01-26-2022

Thanks for digging into this, @Watsky.

Watsky · ‎01-26-2022

Hey @Ben-Dev ,

I'll be honest and tell you that I don't know the actual answer but I'll take a stab at what might be occuring. After you posted this I did what you probably did which was scour the web which you know turned up nothing for me. So, I ran some tests and came to a hypothesis on what is happening.

In the next paragraph, following the Data privacy analysis section, is the background analysis section which reads:

Similar to the evaluations performed for data privacy, the Power Query editor by default will download a preview of the first 1000 rows of each query step.

This tells me that this is a default action which is occuring. So, what is it being used for? If you run the diagnostics within Power Query and open the Diagnostics Partition, you will find a couple of columns that give us some clues. The Firewall Group Column is:

Categorization that explains why this partition has to be evaluated separately, including details on the privacy level of the partition.

This is our reason for the evaluation being done on the partition.

The Expression Column:

The expression that gets evaluated on top of the partition's query/step. In several cases, it coincides with the query/step.

This is where you will see the top 1000 being used:

For the few tests I ran against different data sources I found all of them appear to be pulling the top 1,000 values of metadata. It must be validating each step by looking at the expression results. Interestingly, each step that is being evaluated has the exact same expression (essentially top 1,000 meta-data for all fields). I'm wondering if this is also what attributes to slower refreshes with data privacy turned on. To sum up why, it's used to evaluate partition validity (step validity). At least, that's my theory.

Did my answer(s) help you? Give it a kudos by clicking the Thumbs Up!
Did my post answer your question(s)? Mark my post as a solution. This will help others find the solution.

Proud to be a Super User!

Data Protection Firewall - Why 'TOP 1000'?

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024