Improving Summarize Efficiency

jambrose · ‎05-18-2017

I'm trying to pull information from a large (~20 mil records, 48 colmuns) table, and I have a working method, but it's on the verge of what my machine can handle, and so I'm worried it won't scale up as I potentially need to expand the data. What I currently use is:

Summary Table = SUMMARIZE(Big_Table, Big_Table[ID], "First Value",
     MINX(TOPN(1, Big Table, [Sort1], ASC, [Sort2], ASC, [Sort3], ASC), [Value]) )

So essentially, from records with the same ID, I need to pull a single value based on the first record after sorting by three different columns. There are at most around 6000 records under one ID. The machine can do one of these operations at a time, but if I try to do 2 (for example, pulling both first and last value) in the same Summarize, I get out of memory errors. The machine has 16gb RAM, and I've checked to confirm it's actually capping out. I am new to the system, so bear with me if this is a dumb question, but why is this a taxing operation in the first place? I could understand a long processing time but not the huge memory load compared to the ~3gb size of the file. Second question is - is there a more efficient way to accomplish what I'm trying to do? I may need to do this on an even larger set of records, so any improvements that scale up are welcome. The three sorting columns are a date, an alphanumeric, and an integer, and I would prefer to avoid adding another column unless absolutely necessary.

Greg_Deckler · ‎05-18-2017

Generally any time you see a MINX it throws up warning flags around performance but in your case, you are only returning a 1 row table to it so it only has to iterate over 1 row. So, my guess is that it is all of the sorts you have going on. I wonder if it might be better for you to pull this in via another query and do what you are doing in DAX in M instead.

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

jambrose · ‎05-19-2017

I'd be open to trying it that way, but I haven't so far been able to find a way to do the same process (pull a top value based on sorting 3 other columns) on the query end of things. Any ideas?

jambrose · ‎05-24-2017

Well, I gave the query side a shot, and was able to accomplish the same task with the following code:

= Table.Group(#"Big_Table", {"ID"}, 
   {{"First Value", 
      each Table.First(Table.Sort(_, 
          {{"Sort1", Order.Ascending}, {"Sort2", Order.Ascending}, {"Sort3", Order.Ascending}})
      )[Value] }})

However, this still runs into memory load errors and fails to run on the full data set. Not sure what else to try here. Unless anyone else has ideas, I think I'll try moving to R for the processing steps and then import back to Power BI to generate reports.

v-huizhn-msft · ‎05-24-2017

Hi @jambrose,

You use the SSMS connect your Power BI model. Then profilie Power BI Desktop Model using SQL Server Profiler using the following steps. More details, please review this article.

Thanks,
Angelia

dedelman_clng · ‎05-19-2017

(CAVEAT: This is just an observation - I don't have much experience with SUMMARIZE and TOPN)

Have you tried doing a FILTER on BigTable inside the TOPN function? It looks like the TOPN is looking over the whole table, not just the part of the table that has the ID you are looking for. And if it's anything like SQL, the TOPN function has to do all kinds of distincts and interations in the execution plan, which is probably what is chewing up memory.

So something like this:

Summary Table =
SUMMARIZE (
    Big_Table,
    Big_Table[ID],
    "First Value", MINX (
        TOPN (
            1,
            FILTER ( Big_Table, Big_Table[ID] = EARLIER ( Big_Table[ID] ) ),            [Sort1], ASC,
            [Sort2], ASC,
            [Sort3], ASC
        ),
        [Value]
    )
)

If you get an error that there is no "EARLIER" to be referenced, try wrapping the MINX function in CALCULATE.

Hope this helps.

David

jambrose · ‎05-19-2017

I gave this a shot, and it didn't appear to make a significant difference in the memory load or execution time. My understanding is that having it nested inside the SUMMARIZE implicitly filters in some similar manner anyway. Or at least, the results I get (from either version) find the same value as desired - the top value out of only the rows with the given ID. Thanks for taking a stab though!

v-huizhn-msft · ‎05-19-2017

Hi @jambrose,

The momery consumption is mainly by your large amount of resource table. While I guess memory load is not only caused by Power BI operation but also other threads in your local computer based on my understanding. Please check your computer's tasks and memory.

After search, there are some article about high performance Power BI and DAX, maybe it's helpful.

Optimizing DAX expressions involving multiple measures.
Performance Tip for Power BI; Enable Load Sucks Memory Up.
Power BI Performance Tips and Techniques

Best Regards,
Angelia

Improving Summarize Efficiency

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly