cancel
Showing results for 
Search instead for 
Did you mean: 

From Pandas and SQL to Power Query Code

This is an original article from LaDataWeb blog that I wanted to share with the community in English.

Many times, I have found people working with Python in data engineering or as data scientists who need to work with a Microsoft data movement tool (Data Factory, Logic Apps, etc.) or a reporting tool (Power BI, Excel, etc.) due to a company requirement. The truth here is that those people hate working without coding in Python. They don't want to manage a small environment configuring and clicking for transformations.

Today, Microsoft is pushing Power Query to its limits, integrating the language in many services or tools for data transformations. Now, it can be used with SSIS, SSAS, Power Automate, Logic Apps, Power BI and Dataflows (in Power Platform and Data Factory). Each day, the work with Power Query is growing faster and embracing not only BI roles but also data engineers and scientists.

That's why I have created this post to help them how to code the basic SQL and Pandas (Python lib) transformations with Power Query.

First, they need to understand that in most common situations (dataflows and Power BI) Power Query works in steps. Each line calls the previous result to increase a new step from the previous one and to keep it in order. Now you know that if you see #"Last Step" we are talking about a table result of the previous step.

Let's check the examples that will be listed in the following order: SQL, Python Pandas, Power Query.

 

Show first 5 lines of the dataset

In SQL:

 

 

SELECT TOP 5 * FROM table

 

 

In Pandas:

 

 

df.head()

 

 

In Power Query:

 

 

Table.FirstN(#"Last Step",5)

 

 

Count rows

 

 

SELECT COUNT(*) FROM table1
df.shape()
Table.RowCount(#"Last Step")

 

 

Select the columns of a table

 

 

SELECT column1, column2 FROM table1
df[["column1", "column2"]]
#"Last Step"[[Columna1],[Columna2]]

//Or it might be:

Table.SelectColumns(#"Last Step", {"Columna1", "Columna2"} )

 

 


Filter rows

 

 

SELECT column1, column2 FROM table1 WHERE column1 = 2
df[['column1', 'column2']].loc[df['column1'] == 2]
Table.SelectRows(#"Last Step", each [column1] == 2 )

 

 


Filtering with more than 1 column

 

 

SELECT * FROM table1 WHERE column1 > 1 AND column2 < 25
df.loc[(df['column1'] > 1) & (df['column2'] < 25)]

# Or using OR y NOT operators

df.loc[(df['column1'] > 1) | ~(df['column2'] < 25)]
Table.SelectRows(#"Last Step", each [column1] > 1 and column2 < 25 )

//Or using OR y NOT operators

Table.SelectRows(#"Last Step", each [column1] > 1 or not ([column1] < 25 ) )

 

 


Filters with complex operators

 

 

SELECT * FROM table1 WHERE column1 BETWEEN 1 and 5 AND column2 IN (20,30,40,50) AND column3 LIKE '%arcelona%'
df.loc[(df['colum1'].between(1,5)) & (df['column2'].isin([20,30,40,50])) & (df['column3'].str.contains('arcelona'))]
Table.SelectRows(#"Last Step", each ([column1] > 1 and [column1] < 5) and List.Contains({20,30,40,50}, [column2]) and Text.Contains([column3], "arcelona") )

 

 

Join tables

 

 

SELECT t1.column1, t2.column1 FROM table1 t1 LEFT JOIN table2 t2 ON t1.column_id = t2.column_id

 

 

There are two functions that can help us in this process. Merge and Join.

 

 

df_joined = df1.merge(df2, left_on='lkey', right_on='rkey', how='left')
df_joined = df1.join(df2, on='column_id', how='left')Luego seleccionamos dos columnas
df_joined.loc[['column1_df1', 'column1_df2']]

 

 

In Power Query, we first we set the join conditions and then select the columns with the ExpandTableColumn function.

 

 

#"Origen" = #"Last Step"[[column1_t1]]
#"Paso Join" = Table.NestedJoin(#"Origen", {"column_t1_id"}, table2, {"column_t2_id"}, "Prefijo", JoinKind.LeftOuter)
#"Expansion" = Table.ExpandTableColumn(#"Paso Join", "Prefijo", {"column1_t2"}, {"Prefijo_column1_t2"})

 

 

Group By

 

 

SELECT column1, count(*) FROM table1 GROUP BY column1
df.groupby('column1')['column1'].count()
Table.Group(#"Last Step", {"column1"}, {{"Alias de count", each Table.RowCount(_), type number}})

 

 

Filtering a table grouped by

 

 

SELECT store, sum(sales) FROM table1 GROUP BY store HAVING sum(sales) > 1000
df_grouped = df.groupby('store')['sales'].sum()
df_grouped.loc[df_grouped > 1000]
#”Grouping” = Table.Group(#"Last Step", {"store"}, {{"Alias de sum", each List.Sum([sales]), type number}})
#"Final" = Table.SelectRows( #"Grouping" , each [Alias de sum] > 1000 )

 

 

Sort descending by column

 

 

SELECT * FROM table1 ORDER BY column1 DESC
df.sort_values(by=['column1'], ascending=False)
Table.Sort(#"Last Step",{{"column1", Order.Descending}})

 

 


UNION two tables with the same characteristics

 

 

SELECT * FROM table1 UNION SELECT * FROM table2

 

 

In Pandas we have two known options, append and concat.

 

 

 

df.append(df2)
pd.concat([df1, df2])
Table.Combine({table1, table2})

 

 

 

Transformations

The following transformations are only for Pandas and Power Query because the are not as regular in query languages as SQL.

 

Analyze table content

 

 

 

df.describe()
Table.Profile(#"Last Step")

 

 

 

Check for unique values in columns

 

 

 

df.value_counts()
Table.Profile(#"Last Step")[[Column],[DistinctCount]]

 

 

 

Generate a test table with manual entry data.

 

 

 

df = pd.DataFrame([[1,2],["Boris Yeltsin", "Mikhail Gorbachev"]], columns=["CustomerID", "Name"])
Table.FromRecords({[CustomerID = 1, Name = "Bob", Phone = "123-4567"]})

 

 

 


Remove column from dataset

There are two ways

 

 

 

df.drop(columns=['column1'])
df.drop(['column1'], axis=1)
Table.RemoveColumns(#"Last Step",{"column1"})

 

 

 


Apply transformations in a column

 

 

 

df.apply(lambda x : x['column1'] + 1 , axis = 1)
Table.TransformColumns(#"Last Step", {{"column1", each _ + 1, type number}})

 

 

 

Alright! That is all folks. It was a long way and it could be longer, but this might be the basics we need. I hope this helps some people find Power Query more user-friendly, since they can just write code directly.