Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
DebbieE
Community Champion
Community Champion

Fabric Pyspark Help. Adding a Min into a group by agg code in Notebooks

It would be really useful if we had a Pyspark forum

 

Im SQL through and through and learning Pyspark is a NIGHTMARE

I have the following code that finds all the contestants with more that one record in a list that should be unique

from pyspark.sql.functions import *

dfcont.groupBy('contestant')\
        .agg(count('contestant').alias('TotalRecords'))\
        .filter(col('TotalRecords')>1).show(1000)
 
There are 4 contestants that have more than one value in the list
 
I also want to bring through the min contestantKey into the above result set.
 
In SQL it would be
SELECT Min(CustomerID, Customer, Count(*)
GROUP BY Customer
Having Count(*) >1
 
I find it unbelievably frustrating that I can just type that in a couple of seconds. And yet I have been struggling to do it in Pyspark for...... well. Too long. 
 
Any help would be appreciated.  
1 ACCEPTED SOLUTION
v-gchenna-msft
Community Support
Community Support

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

vgchennamsft_2-1712288591192.png

 



Spark SQL Code:

vgchennamsft_0-1712288207519.png

 

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

 


Pyspark Code:

vgchennamsft_1-1712288265517.png

 



Can you please try below code -

 

 

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

 



Hope this is helpful. Please let me know incase of further queries.

 

View solution in original post

3 REPLIES 3
DebbieE
Community Champion
Community Champion

Yey. It worked. thank you so much. the excercise is to try to do everything I usually do with pyspark so having these extra examples are gold too.

Hi @DebbieE ,

Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.

v-gchenna-msft
Community Support
Community Support

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

vgchennamsft_2-1712288591192.png

 



Spark SQL Code:

vgchennamsft_0-1712288207519.png

 

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

 


Pyspark Code:

vgchennamsft_1-1712288265517.png

 



Can you please try below code -

 

 

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

 



Hope this is helpful. Please let me know incase of further queries.

 

Helpful resources

Announcements
LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

April Fabric Update Carousel

Fabric Monthly Update - April 2024

Check out the April 2024 Fabric update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Kudoed Authors