Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Maggi029
Helper II
Helper II

How power BI handle large DataSet?

Because I'm using elastic search as my database, I'll need to create historic data visualisation. Every week, I will push about 30MB of json data into the elastic search index, and I have imported my data from elastic search to power bi using Python scripting, and everything is working fine. After a few weeks, the data inside elastic search has grown to around 3GB, and the power bi desktop now throws the following exception when trying to refresh the data.

Maggi029_0-1634616979193.png

 

What is best way to handle this scenerio, will i need to change input method or what is best way to hanle this.

 

Code:

 

 

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import pandas as pd


client = Elasticsearch([{'host':'localhost'}])
my_datasets1=[]

for index in client.indices.get('mydata_*'):
  s = Search(using=client, index=index)
  if 'myindexname_' in index:
    my_datasets1.append(pd.DataFrame([hit.to_dict() for hit in s.scan()]))

metrics_dataset= pd.concat(my_datasets1)

 

 

 @nicpenning  can you please look into this

6 REPLIES 6
nicpenning
Helper I
Helper I

Sorry Maggi029, I am not sure how Python ties into your PowerBi integration.

I query Elasticsearch directly with PowerBi and use pagination to get all results on GBs of data. 

 

However, I recommend looking into Elasticsearch Data Transforms that allow you to aggregate large sets of data into a new index with the value you need instead of working with raw documents.

https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-overview.html

An example is have 10,000 documents with the data such as:
{"computer_name": "computer1", "organzation" : "Finance", "country" : "United States of America"}

{"computer_name": "computer2", "organzation" : "Finance", "country" : "United States of America"}

....
{"computer_name": "computer10000", "organzation" : "Finance", "country" : "United States of America"}

You can use transforms to get counts, averages, etc.. to reduce the data down to just 3 documents.

{"unique_computer_names": "700", "organzation" : "Finance", "country" : "United States of America"}
{"unique_computer_names": "100", "organzation" : "Human Resources", "country" : "Nigeria"}
{"unique_computer_names": "200", "organzation" : "Information Technology", "country" : "France"}

Working at something like this at scale can take billions of documents to thousands of documents when you narrow in on specific data you wish to display high level metrics in PowerBi. 

This might be a lot of work and research but you can dramatically reduce large datasets by creating summaries of the data and display what is needed. Not a quick solution by any means but it helps PowerBi out by using summarized data.

Hi @nicpenning 

 

Thanks for look into this,

 

Previously, as you suggested (Solved: Re: Getting Data from Elastic Stack with REST with... - Page 2 - Microsoft Power BI Communit...), I used the rest api to import the data from elastic search. However, I was unable to refresh the data in the PowerBI online service using this method because recursive calls turn this datasource into dynamic data, and PowerBI fails to refresh dynamic data. As a result, I rely on the Python method. Do you believe my request can still be fulfilled using the REST API, and if so, could you please elaborate?

I am not familar with the dynamic data source or the online PowerBi service, so I won't be much help here. I hope you can find a solution soon!

TomMartens
Super User
Super User

Hey @Maggi029 ,

 

from my perspective, you are facing or will be facing some challenges these challenges are

  • Power BI file size limitations for upload, these limits are ~2GB for PRO licensing and ~10GB for Premium per capacity licensing
  • Using Python inside the Power BI for data wrangling/data shaping

This is how I would tackle this

  • Separate the content from the data by using two different pbix files, one that only publishes the data model to the service and a 2nd one that publishes the content. The content pbix is using the Power BI dataset as a data source. One advantage of this approach is that you gain more flexibility if architectural changes are happening.
  • I would not use python for data shaping / data wrangling inside Power BI. This is simply for the following reasons. Using python (or R) for data shaping / data wrangling inside Power Query requires the on-premises gateway in personal mode if you are planning for scheduled data refresh. Some organizations (e.g. the one I'm working for) are not allowing the use of gateways in personal mode. Requesting one can lead to discussions with your Power BI service administrators.

Some architectural considerations

To overcome the file size limit that can be uploaded I would consider moving my python code to Azure, e.g. creating an Azure Automation script or writing an Azure function. This code then creates JSON documents. Each document contains data for a day or for a month, whatever you consider useful. The JSON document can be stored inside an Azure blob store. Then the files will be picked up by Power Query. You can create your data model using just a small number of JSON documents, publish the pbix file to the service, copy all the JSON documents to the blob container, manually refresh the dataset, done!

 

Hopefully, this provides some ideas on how to overcome your challenges.

 

Regards,

Tom



Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

Hi @TomMartens 

 

Thanks for your reply , I think you suggestion have some useful benefits. I have few dounts needs to clarify on your suggestion ,since im new to this method.

 

1. Separate the content from the data by using two different pbix files - What is usecase for this method, is it also supports size management?

2.. Azure Automation script- According to my understanding, you're suggesting that you write an Azure automation script that pulls data from elastic search and converts it into small documents and save it in blob storage, which you can then connect to PowerBI using Azure Blob Storage. Correct me if i was wrong.

 

Thanks

 

 

Hey @Maggi029 ,

 

in regards to 1
next to be able to develop the data model and the content in parallel, it's important to know, that you can't download the pbix file once you have enabled incremental refresh.
Incremental refresh does not directly support handling large files, but you can reduce the duration of a data refresh as only the incremental will be added to the model. This will become a little bit more complex if the source are files, but there are many posts out there, which are describing the proper configuration. The one of @GilbertQ  is my favorite: How you can incrementally refresh any Power BI data source (This example is a CSV File) - Reporting/...

 

in regards to 2

Yes, accessing files from a blob store, is more simple than publishing a single large pbix file to the Power BI service.

 

Regards,

Tom



Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.