Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
vyacheslavg
Helper II
Helper II

Python performance/benchmark - very slow - how to improve?

Hi All,

I've done several load/reload tests, comparing several pieces of software - Power BI, Python (pandas, pyarrow), Qliksense and combination of Power BI + Python.

Unfortunately, the performance shown by Power BI desktop is very low(slow).

 

I took a csv datafile from CIti bike (July 2017, approx. 350 MB, approx. 1.75M rows).

https://www.citibikenyc.com/system-data

 

As you can see from table below, this file can be loaded in approx. 1.5 seconds from disk into RAM on moderate spec PC with fast, but not fastest SSD (Evo 960). I can make less than 1 second, using non-columnar format, called Feather, but this format is huge in size, even bigger than the original csv. Parquet, on the other hand is quite compact.

 

Regular reload of the same file in Power BI desktop takes 47 seconds.

Qliksense out-of-the box shows much better results, especially with usage of columnar datastorage (qvd). The reload from start (columnar storage on disk) to finish (columnar data model in RAM) takes about 3 seconds.

 

My hope was to reload the file from disk and pass a dataframe to PowerBI in RAM using this new (preview) Python integration.

But that was extremely slow and took approx. 93 seconds.

 

Seeking any help or suggestions on how to improve load times to Power BI.

 

 

 

Software                                Load time, in seconds

Parquet load (pyarrow, pandas 0.23)1.5
Qlik sense columnar reload (qvd)3
Python load (pandas 0.23) of csv4.5
Qlik sense reload of csv23
Qlik sense desktop initial load of csv41
Power BI load/reload of csv via pure M47
Parquet reload via Power BI Python93
CSV load via Power BI Python104

 

 

2 REPLIES 2
Stachu
Community Champion
Community Champion

Are you looking for a way specific to this dataset or more general approach?
If it's the former - how do you plan to structure your data model? Also, how often do you plan to refresh report?

in this article you can find few tips related to compresion, not sure this can impact loading times though

https://powerpivotpro.com/2016/01/restructure-data-table-improved-compression/



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

I'm looking for general approach on how to speed up load times in ETL area (data load, queries).

I don't have problems with DAX performance (once the data is loaded into columnar data model). It works quite fast, I don't see any difference with Qlik in that area. (Pandas is not a competitor in this area, because it still(?) does not have the columnar processing implemented). http://wesmckinney.com/blog/apache-arrow-pandas-internals/

 

But I do have datasets which change a lot and I need to reload these into RAM.

My hope was that I could use a combo of Pandas/Parquet and M - but this is very, very slow - even slower than an pure M approach.

 

I don't have any problems with Qliksense or Pandas performance on the other side, i.e. when I invest in good hardware (or rent performant server in cloud) - i get a considerable boost in reload times, as expected.

 

 

So, in summary I'm looking for general approach on how to speed up load times via M (and Python or R combos).

 

Helpful resources

Announcements
LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.