Solved: Re: Duplicated rows between notebook and SQL Endpo...

amaaiia · ‎02-19-2024

I'm having troubles when trying to count the number of rows in my table from Lakehouse.

If I count the rows from SQL Endpoint, I get 221555 rows, if I read the table from notebook and then I count the rows (df.count()) I get that number twice: 443110

How can this be possible?

Scott_Powell · ‎04-04-2024

Hi all, there's apparently a bug where the metadata on which parquet file is the latest and greatest can get hosed up between the lakehouse SQL endpoint and the notebooks. I worked a ticket with Microsoft and they had me run the following code. After running it, wait 30 minutes or so to retry running the notebook. In my case this perfectly fixed the issue.

mssparkutils.fs.unmount("/default", {"scope": "default_lh"})

sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.reset()

sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.personalizeSession()

Hope this helps,

Scott

View solution in original post

Scott_Powell · ‎04-04-2024

Hi all, there's apparently a bug where the metadata on which parquet file is the latest and greatest can get hosed up between the lakehouse SQL endpoint and the notebooks. I worked a ticket with Microsoft and they had me run the following code. After running it, wait 30 minutes or so to retry running the notebook. In my case this perfectly fixed the issue.

mssparkutils.fs.unmount("/default", {"scope": "default_lh"})

sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.reset()

sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.personalizeSession()

Hope this helps,

Scott

v-nikhilan-msft · ‎04-04-2024

Hi @Scott_Powell
Thanks for sharing the solution here. Please continue using Fabric Community for any help regarding your queries.

frithjof_v · ‎04-04-2024

I created an idea: https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=cea60fc6-93f2-ee11-a73e-6045bd7cb2b6

Please vote if you want this issue to be fixed

amaaiia · ‎02-19-2024

Sure, when I get the number of rows from SQL endpoint:

And when I count them through notebook:

v-nikhilan-msft · ‎02-19-2024

Hi @amaaiia
I tried checking the same with my Lakehouse tables, but the count is same. I checked for two tables.

The discrepancy in row counts between SQL Endpoint and notebook could be due to several reasons:

Data Duplication: There might be duplicate rows in your data. When you read the data into a DataFrame and use df.count(), it counts all rows, including duplicates. If this is the case, you can remove duplicates using the df.dropDuplicates() function.

Data Inconsistency: There might be inconsistencies between the data in your SQL Endpoint and the data in your notebook. This could be due to issues with data ingestion, data updates, or data synchronization. You can verify this by comparing a subset of your data in both SQL Endpoint and notebook.

Caching Issues: Sometimes, Spark might cache the DataFrame, and if the underlying data changes, the cached DataFrame might not reflect these changes2. You can try to clear the cache using the spark.catalog.clearCache() function in your notebook.

Hope this helps. Please let me know if you have any further questions.

v-nikhilan-msft · ‎02-20-2024

Hi @amaaiia
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

amaaiia · ‎02-20-2024

Hi,

I didn't find a solution. As I was in dev environment trying a demo I just deleted an laoded data again. I'm not having duplicates now. I hope this won't happen again.

frithjof_v · ‎02-20-2024

@amaaiia Thanks for sharing!

Did you use Dataflow Gen2 to ingest data into your Lakehouse?

Here is a similar issue:

https://community.fabric.microsoft.com/t5/General-Discussion/Duplicated-Rows-In-Tables-Built-By-Note...

MysticSapphire · ‎02-23-2024

I'm experiencing a similar issue:
My dataflow Gen2 is storing data in lakehouse. Once in a while the 'replace' table setting in the dataflow Gen2 doesn't seem to work and it results in having the same data copied twice. It only seems to be affecting smaller tables. If I delete the table it works for a few days but then suddenly there are duplicates again.

v-nikhilan-msft · ‎02-19-2024

Hi @amaaiia
Thanks for using Fabric Community.
Can you please provide the screenshots for the SQL code and the notebook code? This would help me to understand the question better.
Thanks

Duplicated rows between notebook and SQL Endpoint

Helpful resources

Fabric Monthly Update - April 2024

Microsoft Fabric Learn Together

Fabric Community Update - April 2024