Re: Queries over large data tables

jPinhao · ‎04-01-2016

We have a large dataset of unstructured data (Azure Blob) and have started noticing that refreshing our model gets quite slow after a few thousand records are being loaded.

Our current query structure is:

#"Load Data"
    Loads data from the Azure Blob, ~1000 files
    Parses the files into a table with 3 columns (of list/record types which can be further expanded), ~700k rows

#"Sessions"
    Reference #"Load Data"
    Expand all 'Session' related columns

#"Users"
    Reference #"Load Data"
    Expand all 'User' related columns

#"Events"
    Reference #"Load Data"
    Expand all 'Event' related columns

#"Events By Name"
    Reference #"Events"
    Groups by 'event.name'- generates a column of tables to each event type's events and properties (these vary between events)

#"Event Name1" (2, 3, etc. one table per event type)
    Reference #"Events by Name"
    Expands that event name's Table, and generates a table with event.id and each of the properties for that event type

While running this and watching the resource monitor, the memory usage goes through the roof, and eventually tons of Hard Faults leading to Disk usage. From looking at the query execution popup, it seems a bunch of queries kick-off and run in parallel.

If I load the data from a local folder, they seem to all be fetching data, going through the files and loading the referenced common queries in parallel. I believe this is what's causing the memory usage to go haywire, the disk to kick in, and the queries to take hours to run.

I assumed referenced queries would run once first, and then have their resulting tables referenced by individual queries using it, but that doesn't seem to be the case. I've also tried using Table.Buffer as the last step of #"Load Data" and #"Events", in an attempt to make those queries be computed once and then shared across dependents, but that only seemed to make it worse. Are there ways to:

* Make a query only run once, and have it's result passed forward to any queries referencing it

* Prevent queries from running in parallel, and run sequentially instead

Am I just looking at this the wrong way? A lot of 'performance' articles I found only mention structuring your queries to allow Query Folding. However this is not a possibility for our current case, as the Azure Blob storage really just stores 'blob' files which have to be loaded and parsed locally.

It's being a real struggle to get these queries running on our current 700k test events, and we expect it to go up to millions in the real environment. Is our only option to treat the blobs and push the data into an SQL database and link our model to that instead?

pqian · ‎04-01-2016

@jPinhao I think in the last SU we turned on parallel loading of queries. What we didn't realize is in some context it actually hurts the performance. I believe we can add an UI option to go back to sequential loading. I'll follow up with the BI team internally about this.

jPinhao · ‎04-04-2016

Thanks for the info @pqian. For what it;s worth, I was having similar performance issues with the previous version (February update).

It might be nice having that control over linear / parallel loading, another option would be for Power BI to be able to predict memory usage and better manage execution mode and resource usage (although this would undoubtely be more complex, if even possible in some cases).

On the note of referenced queries seemingly running repeatedly as part of their users, can you confirm whether this is the case? Would this be a by-product of parallel evaluation, or has it always been the case with Power BI? Is there a way to prevent that from happening?

pqian · ‎04-04-2016

@jPinhao Referenced queries will execute once per query loaded, so it could be possible that they will hit the data source multiple times - in parallel loading cases they certainly will. It's a function of how each referencing queries ended up folding to the source and how the data is been cached.

For a more detailed explanation, see this post:

http://community.powerbi.com/t5/Desktop/How-to-Improve-Query-Reference-performance-for-large-tables/...

pqian · ‎04-14-2016

@jPinhao

While we are discussing options to turn off parallel processing with the teams involved, there's a workaround for your scenario. You can set an environment variable in the PBIDesktop.exe process to go back to sequential loading.

For example, in the command prompt, you can:

> SET PBI_EnableMultithreadedDataLoad=0

> PBIDesktop.exe

jPinhao · ‎04-21-2016

Thanks pqian!

This was quite helpul to better follow the execution order, and even helped us pinpoint a few inneficiencies in our queries 🙂 As it stands I don't think what we're trying to do with Power BI is quite possible, but it helps to know we do need an extra layer to process our data!

Queries over large data tables

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly