Power query tricky question

vjnvinod · ‎02-12-2019

Dear community.

I have a table whose coloumns are below

Source.Name

Fiscal year

Account

AccountChannel

AccountSegment

AccountSubSegment

AccountSector

AccountSubSector

1Area

1Region

GCSP

1Country

ServiceLine

SubServiceLine

TER_FYTD_USD

TER_FYTD_Constant

RevenuePlanFY_Constant

NER_FYTD_Constant

MarginFYTD_Constant

MarginPFYTD_Constant

TER_PFYTD_Constant

SaleFYTD_constant

PipelineWeightedOpen_constant

Region

I need to create additional coloumn with a condition that, if all of my row data matches except for category, then i want that to show it as duplicate

For instance, if there are 3 duplicate rows, then it should show 2 rows as duplicate and 1 row as original.

let me know how to achieve this in power query

ElenaN · ‎02-12-2019

Hello,

A solution would be the one attached below, however I am not that sure how doable it is considering you have a lot of columns by which to group by.

Steps would be to :

1. create an index for the rows in your table (ideal for the rows to be sorted so that the 1st row is the original one - in case it matters or there is a rule for this)

2. group by all columns except the category one and return count rows and also the previously created index from the new table created - so you know which rows are duplicated (the ones that have count > 1)

3. group again by the new count column and all the column except the category one and create min index inside each duplicated rows.

4. create a new column by comparing initial index with min index inside a group to check which one is the 1st duplicated row and assign text either Duplicate or Original.

5. Remove not needed fields

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUUoEYkelWJ1oJSMgKwmIncA8YyArGYidwTyYSicUHlAuFgA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, Column3 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", Int64.Type}, {"Column2", type text}, {"Column3", type text}}),
    #"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 0, 1),
    #"Grouped Rows" = Table.Group(#"Added Index", {"Column1", "Column2"}, {{"Count", each Table.RowCount(_), type number}, {"Details", each _, type table}}),
    #"Subtracted from Column" = Table.TransformColumns(#"Grouped Rows", {{"Count", each _ - 1, type number}}),
    #"Expanded Details" = Table.ExpandTableColumn(#"Subtracted from Column", "Details", {"Column3", "Index"}, {"Details.Column3", "Details.Index"}),
    #"Renamed Columns" = Table.RenameColumns(#"Expanded Details",{{"Details.Index", "Duplicates Index"}}),
    #"Grouped Rows1" = Table.Group(#"Renamed Columns", {"Column1", "Column2", "Count"}, {{"Min", each List.Min([Duplicates Index]), type number}, {"Details", each _, type table}}),
    #"Expanded Details1" = Table.ExpandTableColumn(#"Grouped Rows1", "Details", {"Details.Column3", "Duplicates Index"}, {"Details.Details.Column3", "Details.Duplicates Index"}),
    #"Renamed Columns1" = Table.RenameColumns(#"Expanded Details1",{{"Min", "Min Index"}, {"Details.Duplicates Index", "Index"}}),
    #"Added Conditional Column" = Table.AddColumn(#"Renamed Columns1", "Is Duplicate", each if [Min Index] = [Index] then "Original" else "Duplicate"),
    #"Removed Columns" = Table.RemoveColumns(#"Added Conditional Column",{"Index", "Min Index", "Count"})
in
    #"Removed Columns"

I tested on the following data set:

where Column1 and Column 2 can be duplicated rows, while Column3 can have different values (like your mentioned category column)

Hope it helps,

ElenaN

vjnvinod · ‎02-12-2019

@ElenaN

Can you share the PBIX file?

ElenaN · ‎02-12-2019

Attached!

https://www.dropbox.com/s/6dimi9b5d4p9ayu/DuplicateRows.pbix?dl=0

vjnvinod · ‎02-12-2019

@ElenaN @Anonymous

superhelpful & both the solution works

i have another tricky question on the same topic

i have a table "cross-Domain extract file" and 3 coloumns stated below

1) “ClientSectorDesc”

2) "Sector as per Account list"

3) "Account Name_Lookup"

In the table "cross-Domain extract file"

Filter Banking &Capital market in coloumn “ClientSectorDesc” &

Select Null in “Sector as per Account list” column - Replace Null with BCM

and also With same filters on Replace null in coloumn“Account Name_Lookup” with Others.

after the changes made, unfilter everything (like we do in excel)

how do i achieve this, i wasn't sucessfull using this find and replace method

Anonymous · ‎02-12-2019

Use a custom column: if (client sector) =finance ;"your non null value" ;your original column

vjnvinod · ‎02-12-2019

@Anonymous

thanks

Not sure, if this helps., i have 3 coloumns as mentioned in my post

can you let me know the steps or can you replicate this and share the pbix?

Anonymous · ‎02-13-2019

Can you share your (or a part of) pbix so I don't have to create from scratch?

Anonymous · ‎02-12-2019

As a rule of thumb, each "row" in powerquery cannot reference to "other rows". Especially as there's not the concept of "next" and "previous" rows.
BUT if your objective is to remove duplicate rows, you can simply remove the Category column and then group data.

vjnvinod · ‎02-12-2019

@Anonymous

so you are telling there is no way we can achieve this?

my objective is to identify those rows which are duplicate and show that in different coloumn as "Duplicate" or "Original" but i don't want to remove it for obvious reasons

Anonymous · ‎02-12-2019

Well, then for doing something like that I suggest DAX as you can easily create a calculated column with a FILTER on all the rows of a group and then use a RANK to sort them. I don't know exactly how you can decide which one of the three is the original and which one is a duplicate, though.

vjnvinod · ‎02-12-2019

@Anonymous

Can you help with the DAX? i am not familiar with DAX.

also on the duplicate, it doesn't matter at all, for instance you have 3 rows repeating the values, we can mark 2 rows as duplicate & it doesn't matter which row

Anonymous · ‎02-12-2019

Maybe Elena's solutions will work, anyway

in DAX create a calclulated column

Ordinal =
VAR thisColumn1=YourTable[Column1]
VAR thisColumn2=YourTable[Column2]
-- create a var for each column that you have to group by

RETURN
RANKX(
FILTER(YourTable,
YourTable[Column1]=thisColumn1 &&
YourTable[Column2]=thisColumn2

-- && repeat for each variable
), YourTable[Category]
)

This will create a number from 1 to N for each row of the same "group". Then you add a secondary conditional column where if this column is 1 then is the valid one otherwise is duplicate.

There is also another solution that should provide exactly your result

IsDuplicate =
VAR thisColumn1 = YourTable[Column1]
VAR thisColumn2 = YourTable[Column2]
-- create a var for each column that you have to group by
RETURN
    IF (
        CALCULATE (
            MIN ( YourTable[Category] ),
            FILTER (
                ALL ( YourTable ),
                YourTable[Column1] = thisColumn1
                    && YourTable[Column2] = thisColumn2
-- repeat for all variables
            )
        )
            = MIN ( YourTable[Category] ),
        "Original",
        "Duplicate"
    )

These solutions however won't probably work if your categories might be the same as rankx and max will find the same values. If so follow Elena's suggestion and create an Index column, and RANKX on that one instead of the category

Power query tricky question

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly