Solved: Re: Fabric Pyspark Help. Adding a Min into a group...

DebbieE · ‎04-04-2024

It would be really useful if we had a Pyspark forum

Im SQL through and through and learning Pyspark is a NIGHTMARE

I have the following code that finds all the contestants with more that one record in a list that should be unique

from pyspark.sql.functions import *

dfcont.groupBy('contestant')\

.agg(count('contestant').alias('TotalRecords'))\

.filter(col('TotalRecords')>1).show(1000)

There are 4 contestants that have more than one value in the list

I also want to bring through the min contestantKey into the above result set.

In SQL it would be

SELECT Min(CustomerID, Customer, Count(*)

GROUP BY Customer

Having Count(*) >1

I find it unbelievably frustrating that I can just type that in a couple of seconds. And yet I have been struggling to do it in Pyspark for...... well. Too long.

Any help would be appreciated.

v-gchenna-msft · ‎04-04-2024

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

Spark SQL Code:

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

Pyspark Code:

Can you please try below code -

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

Hope this is helpful. Please let me know incase of further queries.

View solution in original post

DebbieE · ‎04-05-2024

Yey. It worked. thank you so much. the excercise is to try to do everything I usually do with pyspark so having these extra examples are gold too.

v-gchenna-msft · ‎04-05-2024

Hi @DebbieE ,

Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.

v-gchenna-msft · ‎04-04-2024

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

Spark SQL Code:

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

Pyspark Code:

Can you please try below code -

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

Hope this is helpful. Please let me know incase of further queries.