Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
scabral
Helper IV
Helper IV

deleting duplicate rows in lakehouse

Hi,

 

i developed a data pipeline in Fabric Data Factory that copy Parquet file data into a fabrick lakehouse table.  After loading the data, i need to remove duplicates from the lakehouse table.

 

I tried to add a script step in the pipeline, but it seems to only allow connections to warehouse and not lakehouse.  I also tried to create a notebook using spark sql and run a sql script against the lakehouse table like this:

 

;WITH
Dups
AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY cmal.Id ORDER BY cmal.CreationTime) AS [Rank]
FROM dbo.AuditLog cmal)

DELETE Dups
FROM Dups
WHERE Dups.Rank > 1
GO

 

but i get the following error:

 

no viable alternative at input 'with'

 

so i don't think spark sql recognizes the 'with' funciton.  Does anyone know of another way to delete duplicates from a lakehouse table in Fabric?

 

thanks

scott

1 ACCEPTED SOLUTION
v-nikhilan-msft
Community Support
Community Support

Hi @scabral 
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .

I have created a repro based on a sample table named List3. 

vnikhilanmsft_0-1710439172869.png


After dropping the duplicates you can write this dataframe into the table in lakehouse.

vnikhilanmsft_1-1710439260549.png


Hope this helps. Please let me know if you have any further queries.



View solution in original post

5 REPLIES 5
v-nikhilan-msft
Community Support
Community Support

Hi @scabral 
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .

I have created a repro based on a sample table named List3. 

vnikhilanmsft_0-1710439172869.png


After dropping the duplicates you can write this dataframe into the table in lakehouse.

vnikhilanmsft_1-1710439260549.png


Hope this helps. Please let me know if you have any further queries.



thanks, i will try it out and reply with my findings

Hi @scabral 
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

thanks,

 

here is the code that I created using pyspark to remove duplicates from my table in lakehouse after loading:

 

df = spark.sql("Select Id, CreationTime from stgadminactivitieslog")
df1 = df.dropDuplicates(['Id', 'CreationTime'])
df1.write.format("delta").mode("overwrite").save("Tables/stgadminactivitieslog")
 
Scott

Hi @scabral 
Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.

Helpful resources

Announcements
April Fabric Update Carousel

Fabric Monthly Update - April 2024

Check out the April 2024 Fabric update to learn about new features.

Top Kudoed Authors