Solved: Advice required on the Pros and Cons on normalisin...

Anonymous · ‎08-13-2018

Hi,

I have been given a SQL query, by the solution team, to retrieve data needed to satisfy a report. I execute this query using a SQL Server connection in Import mode. The query joins on 6 tables with minimal functions taking place in the query. The returned dataset is 20 columns wide and 19 million rows deep.

I'm thinking 19 million rows deep requires the implementation of a best practice Star Schema. I see a single FACT table and 4 DIMENSION tables being the model.

The original flat table has it's Enabled Load setting disabled. I then reference that same flat table 5 times, once for each of the tables (1 FACT table and 4 DIMENSION tables).

So here's the question, each of these 5 tables has to have the full 19 million rows loaded to it. It is seriously time consuming. Yes, the Dimension tables will have duplicate rows removed and therefore end up being considerbly less than the 19 millions rows, but it is the initial loading of these 19 millions rows into each of the 5 tables each time the data is refreshed (within Power Query or through the Power BI Service). Corret me if I'm wrong.

How do you weigh up the Pros and Cons of creating a Star Schema (or Snow-Flake Schema for that matter)?
Where do you draw the line and say a flat table of 20 columns and 19 million rows loaded once is better than a Star Schema that has to have 19 million rows loaded 5 times to populate each table?

Would the quickness of the slicing/filtering on the analytics/measures outweight the repetitive loading time?

What are your thoughts?
Thanks.

Anonymous · ‎08-14-2018

Thanks @v-yulgu-msft @SjoerdV for your responses. I'm going to close this post and re-ask the question in a different way with a different post title. I feel I may get more view/responses this way. Plus, the topic will be useful for pretty much everyone that does data model within Power BI.

View solution in original post

SjoerdV · ‎08-14-2018

Kind of the same situation as mine, only with much more records.

What I'm (currently) doing, is importing 2 CSV fact-tables with a lot of columns. I do some ETL-stuff on those 2 tables and then disable loading them. From there I create a singe fact-table, together with some dimension-tables by extracting data from the 2 original CSV fact-tables.

I don't know if this is the most efficient way, but it saves me from creating a wide fact-table. From what I can see so far, is that we're doing exactly the same here. I therefore wonder if the query is actually loading the dim-tables 5 times...

Anonymous · ‎08-14-2018

Thanks @v-yulgu-msft @SjoerdV for your responses. I'm going to close this post and re-ask the question in a different way with a different post title. I feel I may get more view/responses this way. Plus, the topic will be useful for pretty much everyone that does data model within Power BI.

v-yulgu-msft · ‎08-14-2018

Hi @Anonymous,

Here are some links for your reference:

Star or Snowflake for datamodel?

Star Schema vs Snowflake Schema and Why You Should Care

Pros and cons of the snowflake schema

Regards,

Yuliana Gu

Community Support Team _ Yuliana Gu
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Advice required on the Pros and Cons on normalising a very large flattened table into a Star Schema

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly