Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hi,
I have very large table of around 600 M rows. It has maily two columns file names and last used date of the file. so it is constantly updated with new date and other values. using direct query is making my report slow so I want to go with import mode.
I am stuck at how do I refresh this data daily. I can't refresh all 600M rows everyday that would not be efficient. I have read about detect data changes option but I do not have two date columns for that, I only have one date column that is " Last date used " that changes constantly. so can't go with detect data changes option.
can I get a way through this ?
Thanks !
Hi , @Anonymous
Could you please tell me whether your problem has been solved?
If yes, you could accept the helpful answer as solution. You also could share your own solution here. For now, there is no content of description in the thread. If you still need help, please share more details to us.
Best Regards,
Community Support Team _ Eason
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
hi @Greg_Deckler ,
each row in the table I am using is unique and row count of that table is around 1200 M +
so I think aggregations does't help me much in my case.
can I have any other solution..?
Thanks
@Anonymous - The purpose of aggregations. I have a table
ID Category Dimension Value
1 Red One 1
2 Blue One 10
3 Green One 12
4 Red Two 5
5 Blue Two 5
Aggregation table(s) store:
Red 6
Blue 15
Green 12
One 23
Two 10
So, if you want the sum of Red, you are getting it from the aggregation table and not hitting the millions of rows in order to get the sum. This is an example. The key is to try to avoid aggregations or other operations that hit your main fact table and cause in essence a table scan. It's hard to be more specific because your request is very vague but this is the general route towards fast speeds against billion and trillion row datasets.
ID Category Dimension Value
1 Red One 1
2 Blue One 10
3 Green One 12
4 Red Two 5
5 Blue Two 5
Aggregation table(s) store:
Red 6
Blue 15
Green 12
One 23
Two 10
@Greg_Deckler : lets take the same example.what if I want sum of value whose category is red and dimension is one ?
Then I should be aggregating by taking both columns(category and dimension) into 'group by' right ? that again makes the same table !
did you get my point..?
@Anonymous - I understand your point. As I stated, the goal is to avoid operations where you hit all rows in the fact table. You are asking me to give you specific advice to your situation but you are
1. Not giving me any sense at all about your data
2. Not providing me any details about how you want to present that data
If you want a report that simply lists a billion rows of data then I've got bad news for you, nothing is going to be fast doing that short of maybe a super computer. What possible use would be just listing out a billion rows of data anyway? Doesn't make any sense to me. So, figure out how you want to present those billion rows. Build the aggregations. Use the aggregations in your visuals. Otherwise all hope is lost, you cannot just expect any program to quickly list out a billion rows of information. And, how any human being could possibly comprehend a billion rows of data being displayed to them in the first place is non-sensical.
class_name user_name file_name category date time_s & other column
first guru file1 H 8/5/20 40
first guru file2 H 8/5/20 50
first john file1 H 8/5/20 25
second guru file1 F 6/5/20 60
my data is similar to this where each row is unique and filters are 'date range' , 'class_name' and 'category'.
I want to display the top 10 rows based on the 'time' column as filtered by the user.I am using top N filter for that.
now do you get the idea about the data and what I want ..?
@Anonymous - Right, so you are not truly aggregating anything and will always have to hit all rows in the base fact table. So that means that you are pretty much hosed. Only thing that I could think of would be to create a table that contains the top N rows for every possible combination of what you will allow them to filter on.
yeah I thought of this option but I even have the date column column right ? so I coudn't go that way !
@Anonymous Well, that depends. I am probably doing this incorrectly but I believe your table would be:
n1 * n2 * n3 * n4 * n5 * 10
So, distinct number of values from each of your columns multiplied together * 10
So, let's say you have 3 years worth of data, that's roughly 1000 different dates. Then for each of the other categories you have 5 choices. For 5 columns like that you would have:
1000 * 5 * 5 * 5 * 5 * 10, or 6,250,000 rows
Better than a billion.
yeah but unique values for one of the column is around 1 million and I have about 9 years of data
@Anonymous - Now I have to understand the use case here. What exactly is this data and why is the top 10 of the rows for any combination of values important and how in the world do you end up with a million different values for something in a column if that isn't an index or something? And how is a user going to choose between a million different things anyway? They are going to scroll through a million choices looking for the one they want?
The use case is to view the top 10 files based on time for a particular class_name within the selected date range.
This class_name column happens to have about 1M unique values. Each user would only be bothered about some of the class_name values so they know what they want to select. so they just type it instead of scrolling through all of them.
and even I don't know why that class_name has about 1M unique values.
After optimizing in every way possible I am getting the visual in and around 1min and sometimes 2 or 3 mins. It seems reasonable but I am still trying to make it faster.
@Anonymous - I guess what I was going for was the industry, type of data, etc. In other words, this is manufacturing data about the time it takes to manufacture a component and thus it is important for whatever reason to figure out the items that take the most time to produce, etc. Not saying that is what this data is but trying to wrap my head around the use case as sometimes there are alternate ways to achieve something.
Hi, @Anonymous
Please check if this post help.
https://community.powerbi.com/t5/Service/Refresh-large-datasets-on-Power-BI-service/m-p/1154745
Have you consider using incremental refresh if possible?
https://docs.microsoft.com/en-us/power-bi/admin/service-premium-incremental-refresh
Best Regards,
Community Support Team _ Eason
@Anonymous - Go with DirectQuery and use Aggregation Tables to speed things up. That's how Microsoft demonstrates Power BI against trillion row datasets.
Hi @Greg_Deckler , Thanks for the reply
Sorry I did not quite get your point can you say in detail please.
what do you mean by aggregation tables and how do I set them up( on SQL side or on power side )..?
Thanks
@Anonymous
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.