Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Grow your Fabric skills and prepare for the DP-600 certification exam by completing the latest Microsoft Fabric challenge.

Reply
DebbieE
Community Champion
Community Champion

Fabric Pyspark Help. Adding a Min into a group by agg code in Notebooks

It would be really useful if we had a Pyspark forum

 

Im SQL through and through and learning Pyspark is a NIGHTMARE

I have the following code that finds all the contestants with more that one record in a list that should be unique

from pyspark.sql.functions import *

dfcont.groupBy('contestant')\
        .agg(count('contestant').alias('TotalRecords'))\
        .filter(col('TotalRecords')>1).show(1000)
 
There are 4 contestants that have more than one value in the list
 
I also want to bring through the min contestantKey into the above result set.
 
In SQL it would be
SELECT Min(CustomerID, Customer, Count(*)
GROUP BY Customer
Having Count(*) >1
 
I find it unbelievably frustrating that I can just type that in a couple of seconds. And yet I have been struggling to do it in Pyspark for...... well. Too long. 
 
Any help would be appreciated.  
1 ACCEPTED SOLUTION
v-gchenna-msft
Community Support
Community Support

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

vgchennamsft_2-1712288591192.png

 



Spark SQL Code:

vgchennamsft_0-1712288207519.png

 

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

 


Pyspark Code:

vgchennamsft_1-1712288265517.png

 



Can you please try below code -

 

 

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

 



Hope this is helpful. Please let me know incase of further queries.

 

View solution in original post

3 REPLIES 3
DebbieE
Community Champion
Community Champion

Yey. It worked. thank you so much. the excercise is to try to do everything I usually do with pyspark so having these extra examples are gold too.

Hi @DebbieE ,

Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.

v-gchenna-msft
Community Support
Community Support

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

vgchennamsft_2-1712288591192.png

 



Spark SQL Code:

vgchennamsft_0-1712288207519.png

 

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

 


Pyspark Code:

vgchennamsft_1-1712288265517.png

 



Can you please try below code -

 

 

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

 



Hope this is helpful. Please let me know incase of further queries.

 

Helpful resources

Announcements
RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

Expanding the Synapse Forums

New forum boards available in Synapse

Ask questions in Data Engineering, Data Science, Data Warehouse and General Discussion.

MayFabricCarousel

Fabric Monthly Update - May 2024

Check out the May 2024 Fabric update to learn about new features.