Python performance/benchmark - very slow - how to ...

vyacheslavg · ‎08-14-2018

Hi All,

I've done several load/reload tests, comparing several pieces of software - Power BI, Python (pandas, pyarrow), Qliksense and combination of Power BI + Python.

Unfortunately, the performance shown by Power BI desktop is very low(slow).

I took a csv datafile from CIti bike (July 2017, approx. 350 MB, approx. 1.75M rows).

https://www.citibikenyc.com/system-data

As you can see from table below, this file can be loaded in approx. 1.5 seconds from disk into RAM on moderate spec PC with fast, but not fastest SSD (Evo 960). I can make less than 1 second, using non-columnar format, called Feather, but this format is huge in size, even bigger than the original csv. Parquet, on the other hand is quite compact.

Regular reload of the same file in Power BI desktop takes 47 seconds.

Qliksense out-of-the box shows much better results, especially with usage of columnar datastorage (qvd). The reload from start (columnar storage on disk) to finish (columnar data model in RAM) takes about 3 seconds.

My hope was to reload the file from disk and pass a dataframe to PowerBI in RAM using this new (preview) Python integration.

But that was extremely slow and took approx. 93 seconds.

Seeking any help or suggestions on how to improve load times to Power BI.

Software Load time, in seconds

Parquet load (pyarrow, pandas 0.23)	1.5
Qlik sense columnar reload (qvd)	3
Python load (pandas 0.23) of csv	4.5
Qlik sense reload of csv	23
Qlik sense desktop initial load of csv	41
Power BI load/reload of csv via pure M	47
Parquet reload via Power BI Python	93
CSV load via Power BI Python	104

Stachu · ‎08-14-2018

Are you looking for a way specific to this dataset or more general approach?
If it's the former - how do you plan to structure your data model? Also, how often do you plan to refresh report?

in this article you can find few tips related to compresion, not sure this can impact loading times though

https://powerpivotpro.com/2016/01/restructure-data-table-improved-compression/

Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

vyacheslavg · ‎08-14-2018

I'm looking for general approach on how to speed up load times in ETL area (data load, queries).

I don't have problems with DAX performance (once the data is loaded into columnar data model). It works quite fast, I don't see any difference with Qlik in that area. (Pandas is not a competitor in this area, because it still(?) does not have the columnar processing implemented). http://wesmckinney.com/blog/apache-arrow-pandas-internals/

But I do have datasets which change a lot and I need to reload these into RAM.

My hope was that I could use a combo of Pandas/Parquet and M - but this is very, very slow - even slower than an pure M approach.

I don't have any problems with Qliksense or Pandas performance on the other side, i.e. when I invest in good hardware (or rent performant server in cloud) - i get a considerable boost in reload times, as expected.

So, in summary I'm looking for general approach on how to speed up load times via M (and Python or R combos).

Python performance/benchmark - very slow - how to improve?

Helpful resources

Fabric certifications survey

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly