Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
swethabonthu
Frequent Visitor

Refresh large datasets on Power BI service

Hi,

 

I have data imported from bigquery using the SIMBA ODBC driver. My organisation has power BI premium capcity. There are billions of rows in the views as the data is in long format. 

 

Data structure preview: There are around 50,000 SKUs tracked across 40 regions on a daily basis. This is around 2million rows with dates as column headers. Since this structure is not supported for visualisation, data is unpivoted for visualisation purpose. This makes the number of rows go upto 300millions for 5months data. I'm now concerned about the increasing 2million rows daily and the dashboard is expected to be in place for at least an year. 

 

Here comes the major issue. I wish to refresh the data on a daily basis and below are the few roadblocks:

  • Refresh on service fails because of timeout issues after 2 hours. Couldn't increase command timeout as it is not being supported by bigquery import
  • Direct query is very very slow
  • Data is present from 1st of Jan 2020 and incremental refresh is not preferred as the data has to be updated from the day of start
  • Have tried dataflow as well. It takes around 6 hrs for data refresh with no manipulations except appending tables in Power BI

 

Could someone please help me in scheduling a successful daily refresh for this huge dataset? Any suggestions would be greatly appreciated

 

Regards,

Swetha

1 ACCEPTED SOLUTION
v-jayw-msft
Community Support
Community Support

Hi @swethabonthu ,

 

You may following those tips to reduce the size of dataset or optimize the model of dataset based on this document, some tips may not reduce the time of refresh.

 

  • Remove unused tables or columns, where possible. 
  • Avoid distinct counts on fields with high cardinality – that is, millions of distinct values.  
  • Take steps to avoid fields with unnecessary precision and high cardinality. For example, you could split highly unique datetime values into separate columns – for example, month, year, date, and so on. Or, where possible, use rounding on high-precision fields to lower cardinality – (for example, 13.29889 -> 13.3).
  • Use integers instead of strings, where possible.
  • Be wary of DAX functions, which need to test every row in a table – for example, RANKX – in the worst case, these functions can exponentially increase run-time and memory requirements given linear increases in table size.
  • When connecting to data sources via DirectQuery, consider indexing columns that are commonly filtered or sliced again. Indexing greatly improves report responsiveness.  

Or you can increase the timeout value in connector function.

 

Best Regards,

Jay

 

Community Support Team _ Jay Wang

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Community Support Team _ Jay
If this post helps, then please consider Accept it as the solution
to help the other members find it.

View solution in original post

7 REPLIES 7
colivier
Advocate I
Advocate I

Hi @swethabonthu 

 

You can use PBI Spy (www.pbispy.com) to quickly identify columns that are not being used.

Reach out to me if you need any assistance doing this.

Anonymous
Not applicable

OMG! I wish I would have known about this before!!

I've been given a report with 5 tables, each one with about 350 columns, I was going crazy trying to figure out what data was really used.

 

THANK YOU!

tables with 350 columns? I wonder you use large datasets storage format on premium capacities? is it ok with model performance ? I mean P2 OR P3 premium capacity can keep good performance for large datasets with 350+columns talbes.

Glad to be of assistance. This is the first real public release of PBI Spy. If anything does not work or is unclear please don't hesitate to contact me. I'm trying to make it as good as it can be, but it's a journey.

v-jayw-msft
Community Support
Community Support

Hi @swethabonthu ,

 

You may following those tips to reduce the size of dataset or optimize the model of dataset based on this document, some tips may not reduce the time of refresh.

 

  • Remove unused tables or columns, where possible. 
  • Avoid distinct counts on fields with high cardinality – that is, millions of distinct values.  
  • Take steps to avoid fields with unnecessary precision and high cardinality. For example, you could split highly unique datetime values into separate columns – for example, month, year, date, and so on. Or, where possible, use rounding on high-precision fields to lower cardinality – (for example, 13.29889 -> 13.3).
  • Use integers instead of strings, where possible.
  • Be wary of DAX functions, which need to test every row in a table – for example, RANKX – in the worst case, these functions can exponentially increase run-time and memory requirements given linear increases in table size.
  • When connecting to data sources via DirectQuery, consider indexing columns that are commonly filtered or sliced again. Indexing greatly improves report responsiveness.  

Or you can increase the timeout value in connector function.

 

Best Regards,

Jay

 

Community Support Team _ Jay Wang

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Community Support Team _ Jay
If this post helps, then please consider Accept it as the solution
to help the other members find it.
lbendlin
Super User
Super User

Please explain the reasoning for not using incremental refresh in a bit more detail. Do your data rows come with a "last modified date" tag?

Data imported from bigquery uses complex machine learning techniques and the historic data is not fixed. Also, I don't have "last modified date" tag

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors