Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hi everyone,
I have a large table with a list of customers and I wanted to identify if any customers in the table exist in a separate keyword list of distinct customers using DAX. I have received support from the forums on using an M code solution but have found performance to be extremely slow for 200,000+ records.
Alternatively, I have used Power Query's fuzzy matching feature which is significantly faster than custom M code, but it is not as accurate in identifying matches.
I understand it is possible to use buffered lists with List.Buffer() to significantly improve performance, but I do not quite know if I am using this correctly.
let
Source = Table.Buffer(SharePoint.Files("https://rbcom.sharepoint.com/sites/CustomerScope", [ApiVersion = 15])),
#"PBI_Customer_Master_Tables csv_https://rbcom sharepoint com/sites/CustomerScope/Shared Documents/1 Customer Master Cleansing/3 PowerBI/0 Power BI Master Files/0 Test Data/" =
Source{[Name="PBI_Customer_Master_Tables.csv",#"Folder Path"="https://rbcom.sharepoint.com/sites/CustomerScope/Shared Documents/1. Customer Master Cleansing/3. PowerBI/0. Power BI Master Files/0. Test Data/"]}[Content],
#"Imported CSV" = Csv.Document(#"PBI_Customer_Master_Tables csv_https://rbcom sharepoint com/sites/CustomerScope/Shared Documents/1 Customer Master Cleansing/3 PowerBI/0 Power BI Master Files/0 Test Data/",[Delimiter=",", Columns=17, Encoding=65001, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV", [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Customer", type text}, {"Data Source", type text}, {"Customer Text", type text}, {"Data Origin", type text}, {"IAG", type text}, {"IAG Text", type text}, {"IBG", type text}, {"IBG Text", type text}, {"Trade Type Text", type text}, {"Trade Type", type text}, {"Parent Customer Code", type text}, {"Group Company Code", type text}, {"Record Timestamp", type date}}),
#"Appended Query" = Table.Combine({#"Changed Type", DB_CustomerMasterData_History}),
#"Removed Other Columns" = Table.SelectColumns(#"Appended Query",{"Customer", "Data Source", "Customer Text", "Data Origin", "IAG", "IAG Text", "IBG", "IBG Text", "Trade Type Text", "Trade Type", "Parent Customer Code", "Group Company Code", "Record Timestamp"}),
#"Changed Type3" = Table.TransformColumnTypes(#"Removed Other Columns",{{"Trade Type", type text}}),
#"Inserted Text After Delimiter" = Table.AddColumn(#"Changed Type3", "Text After Delimiter", each Text.AfterDelimiter([Customer], "/"), type text),
#"Added Custom" = Table.AddColumn(#"Inserted Text After Delimiter", "eComm Keyword", each List.Accumulate
(
List.Buffer(#"Tbl_eComm"[Keyword]),
"",
(state, current) =>
if List.Contains(Text.Split([Customer Text], " "), current, Comparer.OrdinalIgnoreCase) and state = ""
then state & " " & current
else state
))
in
#"Added Custom"
Best regards,
Ami
Solved! Go to Solution.
Try buffering the list before List.Accumulate
let
Source = Table.Buffer(SharePoint.Files("https://rbcom.sharepoint.com/sites/CustomerScope", [ApiVersion = 15])),
#"PBI_Customer_Master_Tables csv_https://rbcom sharepoint com/sites/CustomerScope/Shared Documents/1 Customer Master Cleansing/3 PowerBI/0 Power BI Master Files/0 Test Data/" =
Source{[Name="PBI_Customer_Master_Tables.csv",#"Folder Path"="https://rbcom.sharepoint.com/sites/CustomerScope/Shared Documents/1. Customer Master Cleansing/3. PowerBI/0. Power BI Master Files/0. Test Data/"]}[Content],
#"Imported CSV" = Csv.Document(#"PBI_Customer_Master_Tables csv_https://rbcom sharepoint com/sites/CustomerScope/Shared Documents/1 Customer Master Cleansing/3 PowerBI/0 Power BI Master Files/0 Test Data/",[Delimiter=",", Columns=17, Encoding=65001, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV", [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Customer", type text}, {"Data Source", type text}, {"Customer Text", type text}, {"Data Origin", type text}, {"IAG", type text}, {"IAG Text", type text}, {"IBG", type text}, {"IBG Text", type text}, {"Trade Type Text", type text}, {"Trade Type", type text}, {"Parent Customer Code", type text}, {"Group Company Code", type text}, {"Record Timestamp", type date}}),
#"Appended Query" = Table.Combine({#"Changed Type", DB_CustomerMasterData_History}),
#"Removed Other Columns" = Table.SelectColumns(#"Appended Query",{"Customer", "Data Source", "Customer Text", "Data Origin", "IAG", "IAG Text", "IBG", "IBG Text", "Trade Type Text", "Trade Type", "Parent Customer Code", "Group Company Code", "Record Timestamp"}),
#"Changed Type3" = Table.TransformColumnTypes(#"Removed Other Columns",{{"Trade Type", type text}}),
#"Inserted Text After Delimiter" = Table.AddColumn(#"Changed Type3", "Text After Delimiter", each Text.AfterDelimiter([Customer], "/"), type text),
BufferedList = List.Buffer(#"Tbl_eComm"[Keyword]),
#"Added Custom" = Table.AddColumn(#"Inserted Text After Delimiter", "eComm Keyword", each List.Accumulate
(
BufferedList,
"",
(state, current) =>
if List.Contains(Text.Split([Customer Text], " "), current, Comparer.OrdinalIgnoreCase) and state = ""
then state & " " & current
else state
))
in
#"Added Custom"
regards
Phil
Proud to be a Super User!
I agree you should buffer outside the List.Accumulate, but re-reading your original post I believe this would perform faster in DAX. If you can provide some example data in the two tables/columns, a column or measure expression can be suggested.
Regards,
Pat
To learn more about Power BI, follow me on Twitter or subscribe on YouTube.
@imranamikhan Thanks for the sample data. This DAX column expression returned all the same results as the M approach. I called your tables Data and Keyword, so you'll need to change the expression and use your actual table names. There is no relationship between the tables. Runs fast on this small dataset, but please try on your actual.
NewColumn =
VAR thiscustomer = Customer[Customer Text]
VAR matchtable =
FILTER (
VALUES ( Keyword[Keyword] ),
SEARCH (
Keyword[Keyword],
thiscustomer,
1,
0
) > 0
)
VAR matchkeyword =
MAXX (
matchtable,
Keyword[Keyword]
)
RETURN
IF (
ISBLANK ( matchkeyword ),
"#No Match",
matchkeyword
)
Regards,
Pat
To learn more about Power BI, follow me on Twitter or subscribe on YouTube.
Thanks @mahoneypat
This is a super impressive formula and I am getting the exact number of matches for 200k+ records as the M query. Performance times when running a refresh are equal to the M query version. So I now have a choice between both options.
Thanks again all for the support.
@imranamikhan Thanks for the sample data. This DAX column expression returned all the same results as the M approach. I called your tables Data and Keyword, so you'll need to change the expression and use your actual table names. There is no relationship between the tables. Runs fast on this small dataset, but please try on your actual.
NewColumn =
VAR thiscustomer = Customer[Customer Text]
VAR matchtable =
FILTER (
VALUES ( Keyword[Keyword] ),
SEARCH (
Keyword[Keyword],
thiscustomer,
1,
0
) > 0
)
VAR matchkeyword =
MAXX (
matchtable,
Keyword[Keyword]
)
RETURN
IF (
ISBLANK ( matchkeyword ),
"#No Match",
matchkeyword
)
Regards,
Pat
To learn more about Power BI, follow me on Twitter or subscribe on YouTube.
¡Hola! @mahoneypat , tengo una pregunta. Si no quiero que me devuelva como resultado "matchkeyword", sino los valores de una columna "result", que corresponden al segmento de esos datos.
Keyword[Keyword] - Keyword[result] Customer[Customer Text]
factura Consultor Junior Duplicidad de factura
Cobro Consultor Senior Cobro no aplicado
Devoluciones Consultor Medio Cambio de devoluciones anual no a aplicado
Good to hear from you and thanks @PhilipTreacy - this is significantly faster (down from 40 minutes to 15). And I believe the rest of the time is spent on loading other queries I have running.
Hi @mahoneypat - I have added some sample data to OneDrive: Onedrive Link
@lbendlin - thank you for the explaination.
I agree you should buffer outside the List.Accumulate, but re-reading your original post I believe this would perform faster in DAX. If you can provide some example data in the two tables/columns, a column or measure expression can be suggested.
Regards,
Pat
To learn more about Power BI, follow me on Twitter or subscribe on YouTube.
Try buffering the list before List.Accumulate
let
Source = Table.Buffer(SharePoint.Files("https://rbcom.sharepoint.com/sites/CustomerScope", [ApiVersion = 15])),
#"PBI_Customer_Master_Tables csv_https://rbcom sharepoint com/sites/CustomerScope/Shared Documents/1 Customer Master Cleansing/3 PowerBI/0 Power BI Master Files/0 Test Data/" =
Source{[Name="PBI_Customer_Master_Tables.csv",#"Folder Path"="https://rbcom.sharepoint.com/sites/CustomerScope/Shared Documents/1. Customer Master Cleansing/3. PowerBI/0. Power BI Master Files/0. Test Data/"]}[Content],
#"Imported CSV" = Csv.Document(#"PBI_Customer_Master_Tables csv_https://rbcom sharepoint com/sites/CustomerScope/Shared Documents/1 Customer Master Cleansing/3 PowerBI/0 Power BI Master Files/0 Test Data/",[Delimiter=",", Columns=17, Encoding=65001, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV", [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Customer", type text}, {"Data Source", type text}, {"Customer Text", type text}, {"Data Origin", type text}, {"IAG", type text}, {"IAG Text", type text}, {"IBG", type text}, {"IBG Text", type text}, {"Trade Type Text", type text}, {"Trade Type", type text}, {"Parent Customer Code", type text}, {"Group Company Code", type text}, {"Record Timestamp", type date}}),
#"Appended Query" = Table.Combine({#"Changed Type", DB_CustomerMasterData_History}),
#"Removed Other Columns" = Table.SelectColumns(#"Appended Query",{"Customer", "Data Source", "Customer Text", "Data Origin", "IAG", "IAG Text", "IBG", "IBG Text", "Trade Type Text", "Trade Type", "Parent Customer Code", "Group Company Code", "Record Timestamp"}),
#"Changed Type3" = Table.TransformColumnTypes(#"Removed Other Columns",{{"Trade Type", type text}}),
#"Inserted Text After Delimiter" = Table.AddColumn(#"Changed Type3", "Text After Delimiter", each Text.AfterDelimiter([Customer], "/"), type text),
BufferedList = List.Buffer(#"Tbl_eComm"[Keyword]),
#"Added Custom" = Table.AddColumn(#"Inserted Text After Delimiter", "eComm Keyword", each List.Accumulate
(
BufferedList,
"",
(state, current) =>
if List.Contains(Text.Split([Customer Text], " "), current, Comparer.OrdinalIgnoreCase) and state = ""
then state & " " & current
else state
))
in
#"Added Custom"
regards
Phil
Proud to be a Super User!
Hi @lbendlin - Yes I only need to use once per power query refresh. So whenever new data comes in, I refresh Power Query to see the latest results.
I do not understand what you meant by re-use. If you could explain that would be appreciated.
List.Buffer() and Table.Buffer() are pointless unless you plan to re-use their data in the same query multiple times. All they do is pre-spool the data into memory, so that you can reference it later at better speed. If you don't reference the list or the table later in the query then you are wasting memory.
You only get a performance boost if you re-use your buffer (table or list). If you declare it and then don't use it at least twice, or - worse, as in your case - recreate the buffer in each iteration of List.Accumulate, then performance will be worse, not better.
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.
User | Count |
---|---|
112 | |
97 | |
85 | |
68 | |
59 |
User | Count |
---|---|
150 | |
120 | |
99 | |
87 | |
68 |