Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
dhannaa
Helper IV
Helper IV

Power Query : removing duplicates and merging duplicate rows selectively into one

Hi all!

 

What would be the smart way to remove duplicates here?

 

I  need the first value from the column b and last value from column c. There can be even more than two rows for the same ID.

 

a.PNG

Thanks in advance for any ideas,

Jenni

1 ACCEPTED SOLUTION

For the sake of easy understanding of the code, I have written the power query step by step. But you can combine several steps and make the query much denser.

 

let Source = Table.FromRecords(
        {
            [id = 1, date = #date(2018,2,3), b = 513, c = 423],
            [id = 2, date = #date(2018,5,1), b = 534, c = 432],
            [id = 2, date = #date(2019,1,18), b = 345, c = 5434],
            [id = 3, date = #date(2019,2,2), b = 354, c = 4534],
            [id = 4, date = #date(2019,4,6), b = 543, c = 544],
            [id = 5, date = #date(2019,5,5), b = 654, c = 432]
        }
    ),
    
    ChangeColumnTypes = Table.TransformColumnTypes(Source,
        {
            {"date", type date}, 
            {"id", Int64.Type}, 
            {"b", Int64.Type}, 
            {"c", Int64.Type}
        }
    ),

TempTable = ChangeColumnTypes,
ResultTable = ChangeColumnTypes, 

MinDate = (x as table, y as number ) as date => List.Min(
    Table.Column(
        Table.SelectRows(x,each Record.Field(_,"id")=y),
    "date")
),

MaxDate = (x as table, y as number ) as date => List.Max(
    Table.Column(
        Table.SelectRows(x,each Record.Field(_,"id")=y),
    "date")
),

AddIsFirstDateColumn = Table.AddColumn(ResultTable,
        "IsFirstDate",
        each Record.Field(_,"date") = MinDate(TempTable,Record.Field(_,"id"))
),

AddIsLastDateColumn =Table.AddColumn(AddIsFirstDateColumn,
        "IsLastDate",
        each Record.Field(_,"date") = MaxDate(TempTable,Record.Field(_,"id"))
),

getFirstValue = (x as table,y as number) as number => List.Min(
    Table.Column(Table.SelectRows(
        x, each Record.Field(_,"id")=y and Record.Field(_,"IsFirstDate") = true
    ),"b")
),

FirstValue = Table.AddColumn(
    AddIsLastDateColumn,"FirstValue",
    each getFirstValue(AddIsLastDateColumn,Record.Field(_,"id"))
),

getLastValue = (x as table,y as number) as number => List.Min(
    Table.Column(Table.SelectRows(
        x, each Record.Field(_,"id")=y and Record.Field(_,"IsLastDate") = true
    ),"c")
),

LastValue = Table.AddColumn(
    FirstValue,"LastValue",
    each getLastValue(AddIsLastDateColumn,Record.Field(_,"id"))
),

RemoveDuplicates = Table.SelectRows(LastValue,each [IsLastDate] = true),
RemoveColumns = Table.RemoveColumns(RemoveDuplicates,{"IsLastDate","IsFirstDate","b","c"})
in
    RemoveColumns

View solution in original post

8 REPLIES 8
sreenathv
Solution Sage
Solution Sage

Two inputs required.

 

1) If there are two entries with the same date and same id, what should be the result?

2) If there is more than one entry for an id with different dates, after removing the duplicate rows, what is the date you want to show in the date field of the result table? First date or last date?

1) If there are two entries with the same date and same id, what should be the result?

-> duplicate (the latter one on the list) can be removed. This is very rare possibility and thus it really doesn't make a huge difference whichever it is.

 

2) If there is more than one entry for an id with different dates, after removing the duplicate rows, what is the date you want to show in the date field of the result table? First date or last date?

-> Last date. Date has not got too much value after this step.

For the sake of easy understanding of the code, I have written the power query step by step. But you can combine several steps and make the query much denser.

 

let Source = Table.FromRecords(
        {
            [id = 1, date = #date(2018,2,3), b = 513, c = 423],
            [id = 2, date = #date(2018,5,1), b = 534, c = 432],
            [id = 2, date = #date(2019,1,18), b = 345, c = 5434],
            [id = 3, date = #date(2019,2,2), b = 354, c = 4534],
            [id = 4, date = #date(2019,4,6), b = 543, c = 544],
            [id = 5, date = #date(2019,5,5), b = 654, c = 432]
        }
    ),
    
    ChangeColumnTypes = Table.TransformColumnTypes(Source,
        {
            {"date", type date}, 
            {"id", Int64.Type}, 
            {"b", Int64.Type}, 
            {"c", Int64.Type}
        }
    ),

TempTable = ChangeColumnTypes,
ResultTable = ChangeColumnTypes, 

MinDate = (x as table, y as number ) as date => List.Min(
    Table.Column(
        Table.SelectRows(x,each Record.Field(_,"id")=y),
    "date")
),

MaxDate = (x as table, y as number ) as date => List.Max(
    Table.Column(
        Table.SelectRows(x,each Record.Field(_,"id")=y),
    "date")
),

AddIsFirstDateColumn = Table.AddColumn(ResultTable,
        "IsFirstDate",
        each Record.Field(_,"date") = MinDate(TempTable,Record.Field(_,"id"))
),

AddIsLastDateColumn =Table.AddColumn(AddIsFirstDateColumn,
        "IsLastDate",
        each Record.Field(_,"date") = MaxDate(TempTable,Record.Field(_,"id"))
),

getFirstValue = (x as table,y as number) as number => List.Min(
    Table.Column(Table.SelectRows(
        x, each Record.Field(_,"id")=y and Record.Field(_,"IsFirstDate") = true
    ),"b")
),

FirstValue = Table.AddColumn(
    AddIsLastDateColumn,"FirstValue",
    each getFirstValue(AddIsLastDateColumn,Record.Field(_,"id"))
),

getLastValue = (x as table,y as number) as number => List.Min(
    Table.Column(Table.SelectRows(
        x, each Record.Field(_,"id")=y and Record.Field(_,"IsLastDate") = true
    ),"c")
),

LastValue = Table.AddColumn(
    FirstValue,"LastValue",
    each getLastValue(AddIsLastDateColumn,Record.Field(_,"id"))
),

RemoveDuplicates = Table.SelectRows(LastValue,each [IsLastDate] = true),
RemoveColumns = Table.RemoveColumns(RemoveDuplicates,{"IsLastDate","IsFirstDate","b","c"})
in
    RemoveColumns

Thank you so much  @sreenathv , this is working well!

 

Two questions 

- If there are duplicate dates for certain ID, duplicate is not yet removed. Should I just remove duplicates in the end? As told before, there is no way to see what is the last change in these cases as there are no time stamps.

- After all my real ID:s contain a hyphen (-) and thus they are not number values. Do I just create an ID without the hyphen or is there a way to make this code work in a situation where ID if a text field?

Jenni 

1) There is an ambiguity if id-date combinations repeat. There is no way a system can determine which one to keep and which one to remove. Maybe you can use some functions like List.First() etc.. to randomly pick the first record and keep it.

 

2) IF id is text, change the data type of the following line as text. 

 

{"id", Int64.Type}, 

 

I think it should work as we are not using the id field for any arithmetic calculations. We are using it only for matching. So it really does not matter if id is number or text.

 

 

1) Yes this is the way to go, I agree.

 

2) I already did tchange the type you described this but keep receiving error in the "RemoveDuplicates" -step. 

Expression.Error: We cannot convert the value "9-432235" to type Number.
Details:
    Value= 9-432235
    Type=Type

 

Jenni

Refer to the code below...

 

I have modified certain codes and identified as //MODIFIED CODE

 

let Source = Table.FromRecords(
        {
            [id = 1, date = #date(2018,2,3), b = 513, c = 423],
            [id = 2, date = #date(2018,5,1), b = 534, c = 432],
            [id = 2, date = #date(2019,1,18), b = 345, c = 5434],
            [id = 3, date = #date(2019,2,2), b = 354, c = 4534],
            [id = 4, date = #date(2019,4,6), b = 543, c = 544],
            [id = 5, date = #date(2019,5,5), b = 654, c = 432]
        }
    ),
    
    ChangeColumnTypes = Table.TransformColumnTypes(Source,
        {
            {"date", type date}, 
            // {"id", Int64.Type}, 
            {"id", type text},  //MODIFIED CODE
            {"b", Int64.Type}, 
            {"c", Int64.Type}
        }
    ),

TempTable = ChangeColumnTypes,
ResultTable = ChangeColumnTypes, 

//MinDate = (x as table, y as number ) as date => List.Min(
MinDate = (x as table, y as text ) as date => List.Min( // MODIFIED CODE
    Table.Column(
        Table.SelectRows(x,each Record.Field(_,"id")=y),
    "date")
),

//MaxDate = (x as table, y as number ) as date => List.Max(
MaxDate = (x as table, y as text ) as date => List.Max( //MODIFIED CODE
    Table.Column(
        Table.SelectRows(x,each Record.Field(_,"id")=y),
    "date")
),

AddIsFirstDateColumn = Table.AddColumn(ResultTable,
        "IsFirstDate",
        each Record.Field(_,"date") = MinDate(TempTable,Record.Field(_,"id"))
),

AddIsLastDateColumn =Table.AddColumn(AddIsFirstDateColumn,
        "IsLastDate",
        each Record.Field(_,"date") = MaxDate(TempTable,Record.Field(_,"id"))
),

//getFirstValue = (x as table,y as number) as number => List.Min(
getFirstValue = (x as table,y as text) as number => List.Min( //MODIFIED CODE
    Table.Column(Table.SelectRows(
        x, each Record.Field(_,"id")=y and Record.Field(_,"IsFirstDate") = true
    ),"b")
),

FirstValue = Table.AddColumn(
    AddIsLastDateColumn,"FirstValue",
    each getFirstValue(AddIsLastDateColumn,Record.Field(_,"id"))
),

//getLastValue = (x as table,y as number) as number => List.Min(
getLastValue = (x as table,y as text) as number => List.Min( //MODIFIED CODE
    Table.Column(Table.SelectRows(
        x, each Record.Field(_,"id")=y and Record.Field(_,"IsLastDate") = true
    ),"c")
),

LastValue = Table.AddColumn(
    FirstValue,"LastValue",
    each getLastValue(AddIsLastDateColumn,Record.Field(_,"id"))
),

RemoveDuplicates = Table.SelectRows(LastValue,each [IsLastDate] = true),
RemoveColumns = Table.RemoveColumns(RemoveDuplicates,{"IsLastDate","IsFirstDate","b","c"})
in
    RemoveColumns

Great, thanks @sreenathv !

 

Jenni

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.