Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
Anonymous
Not applicable

Star Schema From Flat Files - Performance

Hello All,

 

Most of my Power BI reports are based on Excel files generated from the ERP systems. They look like one big table, there is no way for us to export facts and dims separately.

I keep hearing that we should always use star schema, so I have a few questions related to that. When building a star schema, one of my goals would be to remove any text columns from the fact table and create dimension tables.

It is nicely described here - https://www.youtube.com/watch?v=VV3tYTudNBY

 

1) Should I create star schema from flat files in Power Query ("Transform Data")? Is it efficient from the memory/performance perspective? Per my understanding, every time I refresh the data, all of these queries will be run... is it efficient? Or would it be better to push those steps to Dataflows?

 

2) Would it be better to create star schema from the report with other ETL Tool (Alteryx) and output it to facts & dims and then upload these CSVs to Power BI?

 

The operating model I have is that system delivers files to SharePoint every day (Sales 26102021.xlsx; Sales 27102021.xlsx, etc.) and I am connecting Power BI to the SharePoint folder. These reports are refreshed with a schedule on a daily basis as well.

 

3) I would like to get some clarification regarding calculated columns vs measures. I understand that columns are created when the query is refreshed (vs measures are calculated on the fly when a user opens the report). However, I saw somewhere a tip to always try using measures (vs columns). Is that really accurate? If my report is refreshed once per day, overnight... I don't really care about it. I only care for the report to work faster for the end-users. So, it seems that calculated columns > measures. Am I right? Assuming we are talking here about calculations that can be done both ways (which is not always the case). 

 

Example: 

Plan RevenuePercentage Of Completion

50000

10%

200080%
3000050%

 

In the above scenario, I want to calculate the revenue recognized (which is Plan Revenue * Percentage Of Completion). I can do it either with a calculated column (Plan Revenue * POC) or with a SUMX measure, which seems pretty heavy...

 

What are your thoughts?

2 ACCEPTED SOLUTIONS

@Anonymous None of this has to do with performance. Only reusability. The way dataflows work is that dataflows transform the data from the source systems and store the transformed data in Azure Gen2 blob storage. Power BI Desktop files then use import mode datasets to import the transformed data from the blob storage. So, some performance benefits potentially for the desktop file in that it isn't doing the transformation steps itself. However, the big wins with dataflows are that the source system is only bothered once by the dataflow and not by a bunch of individual PBIX files all importing the same data. In addition, the dataflow can be used by lots of reports but you only have to do the transformation steps once in the dataflow. A dataflow is simply a reusable query. It uses Power Query, the same technology used in Power Query Editor.

 

Live datasets are a way to take all of this a step further and make datasets essentially resusable by a large number of PBIX files. This live dataset could use its own Power Query queries or dataflows to import the data. In either case, the source system is only being bothered once for refreshes.

 

All of this said, if you aren't planning to reuse the datasets/queries then you might as well just create the queries in the PBIX itself. Once you publish to the Service, refreshes will be done in the Service just like dataflows so no difference in performance, both are being executed in the service using the exact same technology (Power Query). Now, some will argue that best practice is to use dataflows regardless because they can be reused even if you currently have no plans to reuse them. Yes and no. If you use dataflows, remember that you now have an additional import step so to get the latest data, first your dataflow has to refresh and then your dataset needs to refresh from that refreshed dataflow data. So this can add a delay potentially in the velocity of your data refresh process. 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

View solution in original post

@Anonymous Sure. My opinion on that subject follows your reasoning and I have gone back and forth between columns vs. measures several times. The answer is that it depends. However, if you have no need for dynamic interactivity in your calculations then I tend to believe that columns are better. One, the report is faster as you don't have complex calculations going on in measures that are executed when visuals are displayed just base aggregations of columns. Second, columns tend to be safer, especially for self-service BI. With measures you can't control the context in which users use the measures so you can end up with "incorrect" results. This is especially true with measure totals as measure totals need to be constructed with a specific context in mind. So, I'm with you on the 3rd item, in all likelihood, columns are the way to go in your case.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

View solution in original post

7 REPLIES 7
Anonymous
Not applicable

Ok, now that is 100% clear. What I missed is that dataflow creates dataset anyway... so it's one extra step, therefore - I follow you - it makes more sense (in my scenario) to publish the dataset to the Service and connect Power BI Desktop to it. Awesome!

 

PS: How would you approach question 3 from my topic, is this something that you can advise with?

@Anonymous Sure. My opinion on that subject follows your reasoning and I have gone back and forth between columns vs. measures several times. The answer is that it depends. However, if you have no need for dynamic interactivity in your calculations then I tend to believe that columns are better. One, the report is faster as you don't have complex calculations going on in measures that are executed when visuals are displayed just base aggregations of columns. Second, columns tend to be safer, especially for self-service BI. With measures you can't control the context in which users use the measures so you can end up with "incorrect" results. This is especially true with measure totals as measure totals need to be constructed with a specific context in mind. So, I'm with you on the 3rd item, in all likelihood, columns are the way to go in your case.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

These are great tips @Greg_Deckler - thanks again!

 

Sorry for being a pain, but could you help me to understand why publishing a single data set (and connecting Power BI to it) is a better option vs dataflow? Isn't it the same from the performance perspective? I may be ignorant, but to me it seems that both queries and transformations will be executed in the cloud - am I right?

 

To clarify, I am not planning to reuse these datasets/dataflows for other reports.

@Anonymous None of this has to do with performance. Only reusability. The way dataflows work is that dataflows transform the data from the source systems and store the transformed data in Azure Gen2 blob storage. Power BI Desktop files then use import mode datasets to import the transformed data from the blob storage. So, some performance benefits potentially for the desktop file in that it isn't doing the transformation steps itself. However, the big wins with dataflows are that the source system is only bothered once by the dataflow and not by a bunch of individual PBIX files all importing the same data. In addition, the dataflow can be used by lots of reports but you only have to do the transformation steps once in the dataflow. A dataflow is simply a reusable query. It uses Power Query, the same technology used in Power Query Editor.

 

Live datasets are a way to take all of this a step further and make datasets essentially resusable by a large number of PBIX files. This live dataset could use its own Power Query queries or dataflows to import the data. In either case, the source system is only being bothered once for refreshes.

 

All of this said, if you aren't planning to reuse the datasets/queries then you might as well just create the queries in the PBIX itself. Once you publish to the Service, refreshes will be done in the Service just like dataflows so no difference in performance, both are being executed in the service using the exact same technology (Power Query). Now, some will argue that best practice is to use dataflows regardless because they can be reused even if you currently have no plans to reuse them. Yes and no. If you use dataflows, remember that you now have an additional import step so to get the latest data, first your dataflow has to refresh and then your dataset needs to refresh from that refreshed dataflow data. So this can add a delay potentially in the velocity of your data refresh process. 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Greg_Deckler
Super User
Super User

@Anonymous Dataflows are good. If you you use Power Query in the Desktop. Create a single query that imports the file. Right-click and choose "Disable load". Now, right-click again and choose Reference. Now your query will be loaded and cached once and used by the other queries that reference it without reloading all of the data for every individual query.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

That is super smart! Thank you @Greg_Deckler. If I can, should I do this in Dataflows (vs PBI Desktop)? Would it improve the performance or it's not really that big change?

@Anonymous Dataflows are really about reusability versus performance. So, if you expect to create multiple desktop PBIX files using the same data, then go with dataflows. However, you might be better off creating a "golden dataset" using straight Power Query queries (non-dataflows) and publishing just the dataset. Then, connect live to that dataset from other PBIX files.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.