Solved: Re: Terrible performance when normalizing a flat t...

Anonymous · ‎08-15-2018

Hi,

My questions follows on from the, very popular, topic of Flat Table vs Star Schema. So in part this could be a two part answer.

https://powerpivotpro.com/2016/02/data-modeling-power-pivot-power-bi/

Having read the above link I guess the 'rule of thumb' is to create a Star Schema data model in Power BI. Why? Well.. even though the in-memory engine can handle a large Flat Table some benefits of a Star Schema are:

1) Partitioning attributes into common groups (Dimension) allows for clarity in the reporting interface.

2) You are able to use a single Dimension (i.e. Country and it's heirarchical attributes) against more than one Fact table.

3) Going to the point of a Snowflake Schema is overkill as the in-memory engine can handle a Flat Table so a Star Schema is no problem, and exntexding it to a Snowflake Schema uses more joins which a negative effect.

Do you agree with my points so far? Please correct me if I am wrong and/or add more.

Looking to normalize a very large Flat Table (19 millions) into a single Fact table with 3 Dimension tables, I take the following steps in Power Query:

1) Disable the 'Enable Load' option for the intial, very large, Flat Table.

2) I then 'Reference' the same Flat Table 4 times (naming the resulting tables FACT, dimCandidate, dimExam, dimCentre).
3) I then remove the columns that don't belong to the particular dimension (i.e. I remove centre related columns from the dimCandidate dimension table). Then remove the dupicates from each dimension table. Finally, I create an Index.

4) I replace the dimension columns from the FACT table with the Index from the dimension table. An Inner Join is used to acheive this.

So now you are aware of how I went about normalizing the the Flat Table.

When I refresh the data (update the data) in Power Query, or when I click 'Close & Apply' from Power Query, or when I click refresh in Power BI, all 4 derived tables (1 FACT table and the dimCandidate, dimExam, dimCentre tables) are evaluated and loaded again. If the initial Flat Table had 19 millions rows in it then it seems the 4 derived table are having the full 19 millions rows loaded to them again. If the Flat table took 25 minutes to load, based on the SQL query used, then my effort of normalizing the Flat Table is now taking 4 x 25 minutes (actually it's taking longer - not sure why this is, maybe due to the joins when joining the dimension tables to the FACT table).

Now to my questions.
1) If what I have done is the correct why to go about normalizing a Flat Table to create a Star Schema then should I be expecting this very long, time consuming loading period as I am? If yes, then does anyone actually implement what I have and why?

2) Is there anything I can do to speed this whole process up? Do I need to go about creating a Star Schema if a different way? Do I need to implemented some extra measures (Query Folding or Table Buffers, etc..)?

3) Do anyone bother creating a Star Schema or are they finding what I'm finding and choose to stick with a Flat Table and let the in-memory engine handle the very large Flat Table?

Any advice or experiences you can share will be very much appreaciated?

Apologies for the long winded post but I want to set the scene. Thanks.

v-danhe-msft · ‎08-26-2018

Hi @Anonymous,

Could you please tell me if your problem has been solved? If it is, could you please mark the helpful replies as Answered?

Regards,

Daniel He

Community Support Team _ Daniel He
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

Seth_C_Bauer · ‎08-15-2018

@Anonymous Love the detailed post, I apologize in advance for the brevity of my response but I don't quite have the time now to go through a point by point answer. First, the break out of the Fact table to different dimensions makes sense if you need to add in additional ancillary tables that the dimensions would relate to. I don't want to get into the star vs. flat as there is a use case for both.

In your particular issue, I think it is best described in this blog by Chris Webb -> https://blog.crossjoin.co.uk/2016/11/20/referenced-queries-and-caching-in-power-bi-and-power-query/ It would appear that your loading times are due to PQ not being able to store the output of the intitial query in cache to be referenced by your downstream queries. As a result, your load time explodes. In your case, I would push the switch back up stream and have seperate queries or views bulit out in the database to pull the distinct values for each dimension and not rely on "reference" in PQ because of the behavior you describe.

To me this is a "loading data or Power Query" issue, and not really a "is a star schema valid or worth it", but that could just be me picking at symantics. Hope that helps, I found the blog very insightful.

Looking for more Power BI tips, tricks & tools? Check out PowerBI.tips the site I co-own with Mike Carlo. Also, if you are near SE WI? Join our PUG Milwaukee Brew City PUG

Anonymous · ‎08-15-2018

@Seth_C_Bauer - thanks for your response. I'm aware that you're one of the main contributors on this forum so its nice to have you chipping in. The link you provided I have already seen. I like to do my research before troubling people on here. To be sure though I wanted to ask the question for my specific case.

I was hoping to have more responses before I asked this follow-up question, but being keen for the answer I'll ask the question now.

So I've explained the problem I'm facing in normalizing a Flat Table of 19 millions rows into 1 FACT and 3 DIMENSION tables. It does sound like up-streaming is a solution. I guess we should aim to have all processes as close to source as possible. This brings me on to my follow-up question.

Let's say I created an SSAS Tabular cube that executed the exact same query thus bringing into SSAS Tabular the same single Flat Table of 19 million rows. Let's then say I applied the same steps to normalized the Flat Table into 1 Fact table and 3 Dimension tables, and applied the necessary join/linking between the Fact and Dimension tables. So you can see the exact same steps are being followed in Power Query, however we are now completing the process in SSAS Tabular rather than Power BI.

Would the SSAS Tabular version being quicker? If yes, why? Or, being the same underlying technologies (i.e. Power Query and the vertipaq in-memory database engine), would the process take the same time (or in my case never completes)?

Also bear in mind here the above question is asking based on Power BI and SSAS Tabular being run from the same laptop (so the same hardware providing the same power).

And.... if running my steps results in the same completion duration between Power BI and SSAS Tabular when being executed from the same laptop, then would the SSAS Tabular approach still be the winner as you can install SSAS Tabular on a more powerful server so it complete the steps mentioned (i.e. normalizing to a Star Schema), whereas the Power BI can only be run on, less powerful, end-users laptop?

Phew.... apologies for the questions but I wanted to cover each angle.

Again, it would be great to have plenty of feedback/experiences shared on this post/topic. Thanks.

Seth_C_Bauer · ‎08-15-2018

@Anonymous I hope others do chime in, its a good topic. From my understanding of things, PQ and the Vertipaq engine are two seperate entities. The issue you have rests in PQ. It hasn't even been loaded as a final model in the SSAS Vertipaq engine yet. So, I would expect the same issues with your other approach. So, again, I would push this way upstream so that all PQ has to do is connect to the data source and extract the 19 million rows from the fact table 1x, and the other tables would be much smaller as they are distinct lists of relevant values for the dimensions.

Based on my reading and understanding of PQ, it sounds like it is extracting the full 19million records for each reference, at which point you are just spending unnecessary cycles with that approach.

Maybe @ImkeF can shed some additional light or opinion. Otherwise I know @DataChant is an uber PQ user as well. They might have decenting opinions or other methods they know of to try.

Looking for more Power BI tips, tricks & tools? Check out PowerBI.tips the site I co-own with Mike Carlo. Also, if you are near SE WI? Join our PUG Milwaukee Brew City PUG

ImkeF · ‎08-15-2018

Hi there,

a bit busy currently, so will keep my answer short:

1) For my experience, the only dimension table you really need in Power BI is a calendar/date-table. All other dimension tables are only needed if you have to work with mulitple fact tables that you want to filter or slice and dice centrally.

2) Unfortunately, the caching of PQ lies totally in the dark, so its often difficult to predict. BUT: If you're accessing a SQL-source, your (simple) commands should fold back to the server, meaning that for the dimension tables, the DISTINCT should be done at the SQL-server-level and not in PBI. That indicates that you have some transformations in between that stop query folding to happen. Google a bit for "Power Query Folding" or post your M-code here so that I can check it out. Otherwise, a straightforward method would be to write the SQL-command for your dimension table directly in PBI:

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Anonymous · ‎08-16-2018

@Seth_C_Bauer - Yes, I would expect the same outcome through SSAS Tabular as I've had through Power BI when using Power Query for my normalizing process. However, I'm still going to implement the SSAS Tabular version and view the result. I'll also post the result here, once I get round to it, so others will know too.

@ImkeF - Yes, with the power/speed of the in-memory vertipaq engine I can see a single large flat table as acceptable when only working with a single fact table, however the requirements may start out with a single fact table but down-the-line there could be the requirement for another fact table, thus meaning you would need dimensions. I guess it's a toss-up between creating a Star Schema up front or creating when you know you will be working with multiple fact tables (I know you know this - just saying it in case it prompts you to point anything else out).

Also, in my scenario, there are additional transformations in my fact/dimension tables (removing columns/changing types, etc..) so I guess this is preventing the 'pay once' approach when referencing the original, single flat table.

For the suggestion of creating the fact and dimensions tables up-stream and maybe saving them on the source server, it's quite often the case that I do not have credentials to save the any objects (i.e. tables) on the source server. So the approach of using the SQL Statement window in the SQL Server Connection is a good idea. It will at least prompt to practice my sql.

Thanks for your responses.

v-danhe-msft · ‎08-26-2018

Hi @Anonymous,

Could you please tell me if your problem has been solved? If it is, could you please mark the helpful replies as Answered?

Regards,

Daniel He

Community Support Team _ Daniel He
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Terrible performance when normalizing a flat table in Power Query

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly