Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hi,
I have been given a SQL query, by the solution team, to retrieve data needed to satisfy a report. I execute this query using a SQL Server connection in Import mode. The query joins on 6 tables with minimal functions taking place in the query. The returned dataset is 20 columns wide and 19 million rows deep.
I'm thinking 19 million rows deep requires the implementation of a best practice Star Schema. I see a single FACT table and 4 DIMENSION tables being the model.
The original flat table has it's Enabled Load setting disabled. I then reference that same flat table 5 times, once for each of the tables (1 FACT table and 4 DIMENSION tables).
So here's the question, each of these 5 tables has to have the full 19 million rows loaded to it. It is seriously time consuming. Yes, the Dimension tables will have duplicate rows removed and therefore end up being considerbly less than the 19 millions rows, but it is the initial loading of these 19 millions rows into each of the 5 tables each time the data is refreshed (within Power Query or through the Power BI Service). Corret me if I'm wrong.
How do you weigh up the Pros and Cons of creating a Star Schema (or Snow-Flake Schema for that matter)?
Where do you draw the line and say a flat table of 20 columns and 19 million rows loaded once is better than a Star Schema that has to have 19 million rows loaded 5 times to populate each table?
Would the quickness of the slicing/filtering on the analytics/measures outweight the repetitive loading time?
What are your thoughts?
Thanks.
Solved! Go to Solution.
Thanks @v-yulgu-msft @SjoerdV for your responses. I'm going to close this post and re-ask the question in a different way with a different post title. I feel I may get more view/responses this way. Plus, the topic will be useful for pretty much everyone that does data model within Power BI.
Kind of the same situation as mine, only with much more records.
What I'm (currently) doing, is importing 2 CSV fact-tables with a lot of columns. I do some ETL-stuff on those 2 tables and then disable loading them. From there I create a singe fact-table, together with some dimension-tables by extracting data from the 2 original CSV fact-tables.
I don't know if this is the most efficient way, but it saves me from creating a wide fact-table. From what I can see so far, is that we're doing exactly the same here. I therefore wonder if the query is actually loading the dim-tables 5 times...
Thanks @v-yulgu-msft @SjoerdV for your responses. I'm going to close this post and re-ask the question in a different way with a different post title. I feel I may get more view/responses this way. Plus, the topic will be useful for pretty much everyone that does data model within Power BI.
Hi @Anonymous,
Here are some links for your reference:
Star or Snowflake for datamodel?
Star Schema vs Snowflake Schema and Why You Should Care
Pros and cons of the snowflake schema
Regards,
Yuliana Gu
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.
User | Count |
---|---|
113 | |
99 | |
80 | |
70 | |
60 |
User | Count |
---|---|
149 | |
114 | |
107 | |
89 | |
67 |