Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
amaaiia
Resolver I
Resolver I

Duplicated rows between notebook and SQL Endpoint

I'm having troubles when trying to count the number of rows in my table from Lakehouse.

 

If I count the rows from SQL Endpoint, I get 221555 rows, if I read the table from notebook and then I count the rows (df.count()) I get that number twice: 443110

 

How can this be possible?

1 ACCEPTED SOLUTION
Scott_Powell
Advocate III
Advocate III

Hi all, there's apparently a bug where the metadata on which parquet file is the latest and greatest can get hosed up between the lakehouse SQL endpoint and the notebooks. I worked a ticket with Microsoft and they had me run the following code. After running it, wait 30 minutes or so to retry running the notebook. In my case this perfectly fixed the issue.

 

mssparkutils.fs.unmount("/default", {"scope""default_lh"})
sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.reset()
sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.personalizeSession()
 
Hope this helps,
Scott

View solution in original post

10 REPLIES 10
Scott_Powell
Advocate III
Advocate III

Hi all, there's apparently a bug where the metadata on which parquet file is the latest and greatest can get hosed up between the lakehouse SQL endpoint and the notebooks. I worked a ticket with Microsoft and they had me run the following code. After running it, wait 30 minutes or so to retry running the notebook. In my case this perfectly fixed the issue.

 

mssparkutils.fs.unmount("/default", {"scope""default_lh"})
sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.reset()
sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.personalizeSession()
 
Hope this helps,
Scott

Hi @Scott_Powell 
Thanks for sharing the solution here. Please continue using Fabric Community for any help regarding your queries.

frithjof_v
Continued Contributor
Continued Contributor

I created an idea: https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=cea60fc6-93f2-ee11-a73e-6045bd7cb2b6 

 

Please vote if you want this issue to be fixed

amaaiia
Resolver I
Resolver I

Sure, when I get the number of rows from SQL endpoint:

amaaiia_3-1708332935251.png

 

And when I count them through notebook:

amaaiia_0-1708334633234.png

 

 

 

Hi @amaaiia 
I tried checking the same with my Lakehouse tables, but the count is same. I checked for two tables.

vnikhilanmsft_0-1708333813868.png

vnikhilanmsft_1-1708333859160.png

The discrepancy in row counts between SQL Endpoint and notebook could be due to several reasons:

Data Duplication: There might be duplicate rows in your data. When you read the data into a DataFrame and use df.count(), it counts all rows, including duplicates. If this is the case, you can remove duplicates using the df.dropDuplicates() function.

Data Inconsistency: There might be inconsistencies between the data in your SQL Endpoint and the data in your notebook. This could be due to issues with data ingestion, data updates, or data synchronization. You can verify this by comparing a subset of your data in both SQL Endpoint and notebook.

Caching Issues: Sometimes, Spark might cache the DataFrame, and if the underlying data changes, the cached DataFrame might not reflect these changes2. You can try to clear the cache using the spark.catalog.clearCache() function in your notebook.

Hope this helps. Please let me know if you have any further questions.

Hi @amaaiia 
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

Hi,

I didn't find a solution. As I was in dev environment trying a demo I just deleted an laoded data again. I'm not having duplicates now. I hope this won't happen again.

frithjof_v
Continued Contributor
Continued Contributor

@amaaiia Thanks for sharing!

Did you use Dataflow Gen2 to ingest data into your Lakehouse?

Here is a similar issue:

https://community.fabric.microsoft.com/t5/General-Discussion/Duplicated-Rows-In-Tables-Built-By-Note...

I'm experiencing a similar issue:
My dataflow Gen2 is storing data in lakehouse. Once in a while the 'replace' table setting in the dataflow Gen2 doesn't seem to work and it results in having the same data copied twice. It only seems to be affecting smaller tables. If I delete the table it works for a few days but then suddenly there are duplicates again.

v-nikhilan-msft
Community Support
Community Support

Hi @amaaiia 
Thanks for using Fabric Community.
Can you please provide the screenshots for the SQL code and the notebook code? This would help me to understand the question better. 
Thanks

Helpful resources

Announcements
April Fabric Update Carousel

Fabric Monthly Update - April 2024

Check out the April 2024 Fabric update to learn about new features.

Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Kudoed Authors