Solved: deleting duplicate rows in lakehouse

scabral · ‎03-14-2024

Hi,

i developed a data pipeline in Fabric Data Factory that copy Parquet file data into a fabrick lakehouse table. After loading the data, i need to remove duplicates from the lakehouse table.

I tried to add a script step in the pipeline, but it seems to only allow connections to warehouse and not lakehouse. I also tried to create a notebook using spark sql and run a sql script against the lakehouse table like this:

;WITH
Dups
AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY cmal.Id ORDER BY cmal.CreationTime) AS [Rank]
FROM dbo.AuditLog cmal)

DELETE Dups
FROM Dups
WHERE Dups.Rank > 1
GO

but i get the following error:

no viable alternative at input 'with'

so i don't think spark sql recognizes the 'with' funciton. Does anyone know of another way to delete duplicates from a lakehouse table in Fabric?

thanks

scott

v-nikhilan-msft · ‎03-14-2024

Hi @scabral
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .

I have created a repro based on a sample table named List3.

After dropping the duplicates you can write this dataframe into the table in lakehouse.

Hope this helps. Please let me know if you have any further queries.

View solution in original post

v-nikhilan-msft · ‎03-14-2024

Hi @scabral
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .

I have created a repro based on a sample table named List3.

After dropping the duplicates you can write this dataframe into the table in lakehouse.

Hope this helps. Please let me know if you have any further queries.

scabral · ‎03-19-2024

thanks, i will try it out and reply with my findings

v-nikhilan-msft · ‎03-21-2024

Hi @scabral
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

scabral · ‎03-21-2024

thanks,

here is the code that I created using pyspark to remove duplicates from my table in lakehouse after loading:

df = spark.sql("Select Id, CreationTime from stgadminactivitieslog")

df1 = df.dropDuplicates(['Id', 'CreationTime'])

df1.write.format("delta").mode("overwrite").save("Tables/stgadminactivitieslog")

Scott

v-nikhilan-msft · ‎03-21-2024

Hi @scabral
Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.

deleting duplicate rows in lakehouse

Helpful resources

Fabric Monthly Update - April 2024