Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
mjohannesson
Advocate I
Advocate I

Refresh fails for large datasets using Spark connector

Our reports and datasets imports data from Databricks Spark Delta tables using the Spark connector into our Premium P1 capacity. We're using incremental refresh for the larger (fact) tables, but we're having trouble with the initial refresh after publishing the pbix file. When refreshing large datasets it often fails after 30-60 minutes with various errors like the ones below. How can we make the refresh more stable to be able to use large datasets in Power BI?

 

Data source error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: 'Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 588.0 failed 4 times, most recent failure: Lost task 5.3 in stage 588.0 (TID 2196, 10.139.64.4, executor 24): ExecutorLostFailure (executor 24 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace:'.. The exception was raised by the IDbCommand interface.

Data source error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: 'Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 19 tasks (4.2 GB) is bigger than spark.driver.maxResultSize (4.0 GB)'.. The exception was raised by the IDbCommand interface.

Data source error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: 'Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 309.0 failed 4 times, most recent failure: Lost task 8.3 in stage 309.0 (TID 966, 10.139.64.4, executor 13): java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:654) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace:'.. The exception was raised by the IDbCommand interface.

Data source error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: 'Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 622.0 failed 4 times, most recent failure: Lost task 5.3 in stage 622.0 (TID 2269, 10.139.64.4, executor 10): java.lang.OutOfMemoryError: Java heap space Driver stacktrace:'.. The exception was raised by the IDbCommand interface.

 

3 REPLIES 3
v-shex-msft
Community Support
Community Support

HI @mjohannesson,

>>Job aborted due to stage failure: Total size of serialized results of 19 tasks (4.2 GB) is bigger than spark.driver.maxResultSize (4.0 GB)'.. The exception was raised by the IDbCommand interface.

Please take a look at following document about maxResultsize issue:

Apache Spark job fails with maxResultSize exception 

In addition, you can take a look at the timeout properties setting on both database connection and power bi service side to choose a suitable refresh timeout. (it may reduce the issues when power bi processing to refresh large datasets)

BTW, Please also check at the following link about the optional parameter 'batch size' if it help for your scenario.

Turbo boost data loads from Spark using SQL Spark connector 
Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

Hi @v-shex-msft ,

 

Thanks for the input. In the Advanced Options section of the Azure Databricks cluster I use for data refresh, the Spark Config is already set to "spark.driver.maxResultSize 0", which should mean it is unlimited, but still some of the error messages claims that is's set to 4.0 GB. Why is that?

 

You also mentioned the timeout properties for database connection and Power BI service. Where can I find these?

 

Thanks,
Magnus

HI @promagnus,

#1, It sounds like power bi connector still uses the default setting instead of data source setting, perhaps you can contact to power bi team to confirm this.
#2, It means the max available timeout of the datasource connection, you can check the cluster configuration to find out the related properties of session timeout.

Cluster configurations#spark-config 

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors