Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
DemoFour
Responsive Resident
Responsive Resident

Removing Duplicates by Date logged on

Afternoon All.


I have a problem where people are adding duplicates in to our referral system, whilst this issue is being addressed, I still have to remove the duplicates that have already been added, to create some reports. 
I have already removed duplicates entered on the same date in the Query editor, by using the group by function and then removing all that are >1 from the list. 
I have then isolated the duplicates with different dates, to leave a table of just the client number and date logged on. Using Group By on the client number and removing any that are = 1

I can share this query if needed, I do not really know M and it’s taken me a while to get to this stage!

 

The problem

 

I need to get a number of days between each client record left, so that I can then remove any that are duplicates within a certain time frame.

This will leave all the others as genuine re-referrals back into the system.

 

I have looked at this question and the related article and solution https://community.powerbi.com/t5/Power-Query/Grouping-Date-sets-within-a-month/m-p/1428838#M44645

 

But it does not quite fit the bill and as the answer is specific to the problem set, it will not work for me or it needs a tweak, or this is where my M understanding lets me down.

 

When I tried to implement the suggested way, I got the dates to move down a row, but then they did not relate to the same client and so it was not a workable solution. 

 

Data Sample

 

Client Number      Logged On
15500                    12/10/2020
15500                    31/07/2020
10306                    12/10/2020
10306                    09/10/2020
13954                    09/10/2020
13954                    02/10/2020
15806                    09/10/2020
15806                    11/09/2020
13763                    09/10/2020
13763                    17/09/2020
16017                    09/10/2020
16017                    02/10/2020
15347                    08/10/2020
15347                    23/07/2020
15347                    14/07/2020
15386                    07/10/2020
15386                    17/07/2020
15238                    07/10/2020
15238                    01/07/2020
15388                    07/10/2020
15388                    17/07/2020
9932                      07/10/2020
9932                     17/07/2020
15904                    07/10/2020
15904                    24/09/2020
15823                    07/10/2020
15823                    02/10/2020
15823                    14/09/2020
15779                    02/10/2020
15779                    09/09/2020
15990                    02/10/2020
15990                    12/08/2020
12643                    02/10/2020
12643                    10/08/2020
15789                    30/09/2020
15789                     10/09/2020
15914                     29/09/2020
15914                     25/09/2020
14786                     24/09/2020
14786                     30/07/2020
12757                     17/09/2020
12757                     16/09/2020
10393                     16/09/2020
10393                      25/06/2020
7965                       15/09/2020
7965                        14/09/2020
15446                       14/09/2020
15446                       27/07/2020
14293                       10/09/2020
14293                       29/07/2020
15770                       09/09/2020
15770                       08/09/2020
10382                       04/09/2020
10382                        18/08/2020
15749                       04/09/2020
15749                       28/08/2020
15387                       01/09/2020
15387                       17/07/2020
9066                        14/08/2020
9066                         22/07/2020
15222                        13/08/2020
15222                        29/06/2020
15452                       28/07/2020
15452                       27/07/2020
15395                       17/07/2020
15395                       01/07/2020
15215                       14/07/2020
15215                       29/06/2020
15230                       01/07/2020
15230                       30/06/2020
15232                       01/07/2020
15232                       30/06/2020
15240                       01/07/2020
15240                       25/06/2020
15219                       30/06/2020
15219                      29/06/2020
10623                      30/06/2020
10623                      29/06/2020

 

 

Now some of the duplicates have 2 entries but others have 3, so I think this is creating a bigger problem!

 

What I need to achieve is listing the dates in separate columns so that I can then find out the days between each entry, as some will be genuine re-referrals due the time lapse between them. Remove the ones that have been added in error and keep the ones that have a certain time frame between them and  are re-referrals not duplicate entries.  

I have found the Duration.Days function to do this, but I need to have the dates in columns to compare them to to do this. I am not sure how this will happen for more than 2 dates! 


Any help would be gratefully received and I can share the source data if there is a better solution than the route I am taking here. Or it would be easier to suggest a solution. 

I have read lots of posts, fumbled my way to this point but am now a bit stuck! 

Thank you in advance 🙂 

 

1 ACCEPTED SOLUTION

Hey @DemoFour ,

I meant that:

On the left. The query's nameOn the left. The query's name

 

But you could try that just using that:

let
    //ListDatesToDurations generates a list of durations between given dates, using a thresshold
    ListDatesToDurations = (datelist as list, threshold as number, optional noRecord as logical) as list =>
        let
            listItem = (x, l) => l{x},

            #"Sort and Distinct List" = List.Sort(List.Distinct(datelist)),
            listCount = List.Count(#"Sort and Distinct List"),
            durationsList = List.Generate(
                () => [d = null, k=1],
                each [k]<=listCount,
                each
                    [d = Duration.Days(listItem([k], #"Sort and Distinct List") - listItem([k]-1, #"Sort and Distinct List")),
                    k = [k] + 1],
                each [d] ),
            datedurations = List.Zip({#"Sort and Distinct List",durationsList}),
            thresholdCheck = 
                if List.Count(List.Select(durationsList, (item) => (item??(threshold+1)) <=threshold)) > 0
                then @ListDatesToDurations(List.Transform(List.Select(datedurations, (item) => ((item{1})??(threshold+1)) > threshold), each _{0}), threshold, true)
                else datedurations,
            singleDateCheck = if listCount = 1 then {{datelist{0}, null}} else thresholdCheck,
            toRecords = List.Transform(singleDateCheck, each Record.FromList(_, type [Date = date, #"Duration From Previous" = Int64.Type]))
        in
            if (noRecord??false) then singleDateCheck else toRecords,

    
    
    
    Source = #"All Contact Records (build)",
    // This is the table we want to join the next steps to later in the process
    #"Table of Referrals (Select Columns)" = Table.SelectColumns(#"All Contact Records (build)",{"Logged On", "Client Number", "Client Name", "Classification 2", "Staff Member"}),
    // Filtering happens after removing columns to reduce data load
        // Added extra step to remove null dates just in case
    #"Filtered Rows" = Table.SelectRows(#"Table of Referrals (Select Columns)", each ([Classification 2] = ".REFERRAL ") and ([Logged On] <> null)),
    
    // Find real referals
    // These two columns are chossen to reduce the data load as we just need to work with information to remove the duplicates
    #"Group for Durations" = Table.Group(Table.SelectColumns(#"Filtered Rows", {"Logged On", "Client Number"} ), {"Client Number"},
        {
            {"DateDurations", each 
            ListDatesToDurations([Logged On], 2)
            , type {(type [Date = date, #"Duration From Previous" = Int64.Type])} }
        }),
    #"Expanded DateDurations" = Table.ExpandRecordColumn(Table.ExpandListColumn(#"Group for Durations", "DateDurations"), "DateDurations", {"Date", "Duration From Previous"})
in
    #"Expanded DateDurations"

 




Feel free to connect with me:
LinkedIn

View solution in original post

10 REPLIES 10
Smauro
Solution Sage
Solution Sage

@DemoFour 

A more general solution, using a function:

//ListDatesToDurations generates a list of durations between given dates, using a thresshold
(datelist as list, threshold as number, optional noRecord as logical) as list =>
    let
        listItem = (x, l) => l{x},

        #"Sort and Distinct List" = List.Sort(List.Distinct(datelist)),
        listCount = List.Count(#"Sort and Distinct List"),
        durationsList = List.Generate(
            () => [d = null, k=1],
            each [k]<=listCount,
            each
                [d = Duration.Days(listItem([k], #"Sort and Distinct List") - listItem([k]-1, #"Sort and Distinct List")),
                k = [k] + 1],
            each [d] ),
        datedurations = List.Zip({#"Sort and Distinct List",durationsList}),
        thresholdCheck = 
            if List.Count(List.Select(durationsList, (item) => (item??(threshold+1)) <=threshold)) > 0
            then @ListDatesToDurations(List.Transform(List.Select(datedurations, (item) => ((item{1})??(threshold+1)) > threshold), each _{0}), threshold, true)
            else datedurations,
        singleDateCheck = if listCount = 1 then {{datelist{0}, null}} else thresholdCheck,
        toRecords = List.Transform(singleDateCheck, each Record.FromList(_, type [Date = date, #"Duration From Previous" = Int64.Type]))
    in
        if (noRecord??false) then singleDateCheck else toRecords

What it does is take a list of dates, a given threshold (which means "don't count anything less than or equal than that") and returns a list of records with the date and the duration from previous date. By default, it will also remove duplicate dates.

Add it as a new query and call it ListDatesToDurations.

 

To see it in action:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fdRdrsQgCAXgvfR5kiLgD2uZzP63cavSTMUz9/XLCVLQvt9HypnoeB2cz0QnE9PxeT1YIaeKWTAzZEkn1QeTUAFpZ7KVxbL+w/HIBos4p6sTexapRVDtydfHL+lCqYL0zbET0cENMss6E+ekkdv4nLiGyb3BJc3SUNo5bbVhenKobSa8h123PowUVZ58XbVlrLmxoLTztuDBKRap1VDa2ULajFB68nUxr619mYuiTpwvW9K5tn6kUGxwcIpsacxka3ByXlnr2HucoHM/8rkGrrmCa3xzWZnE5Df3TsqXq5Xcw2t/t8bdqJbfzOHyKJuAQTn3QdV1wYQW7Nzix7RxixVyanGTaiDtzDEtrfors53jc6Jyj6RtyhyfNfP86a4HTu4TKctYM3t7FXB8qtd/FL3gydsvg9O9YMBbJywEiwzutzWkGacZphXXVtpva2/QYJHBsW8q4xcT087P9OcP", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Client Number" = _t, #"Logged On" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Logged On", type date}}),
    
    #"Group for Durations" = Table.Group(#"Changed Type", {"Client Number"},
        {
            {"DateDurations", each 
            ListDatesToDurations([Logged On], 2)
            , type {(type [Date = date, #"Duration From Previous" = Int64.Type])} }
        }),
    #"Expanded DateDurations" = Table.ExpandRecordColumn(Table.ExpandListColumn(#"Group for Durations", "DateDurations"), "DateDurations", {"Date", "Duration From Previous"})
in
    #"Expanded DateDurations"

You see, we call it with a threshold of 2, meaning that any two dates less than 3 days apart will not be counted and the first one will remain.

To use it with your code:

...
    #"Group for Durations" = Table.Group(#"Previous Step", {"Client Number"},
        {
            {"DateDurations", each 
            ListDatesToDurations([Logged On], 2)
            , type {(type [Date = date, #"Duration From Previous" = Int64.Type])} }
        }),
    #"Expanded DateDurations" = Table.ExpandRecordColumn(Table.ExpandListColumn(#"Group for Durations", "DateDurations"), "DateDurations", {"Date", "Duration From Previous"})
in
    #"Expanded DateDurations"

where you'll need to change #"Previous Step" with your last step name.

Cheers,
Spyros




Feel free to connect with me:
LinkedIn

DemoFour
Responsive Resident
Responsive Resident

Hi again @Smauro as you can see I am still learning! 

Thank you for your reply, I have pasted your query but I get:

 

Expression.Error: The import ListDatesToDurations matches no exports. Did you miss a module reference? on the Group for Durations step 

 

and

 

Expression.Error: The import ListDatesToDurations matches no exports. Did you miss a module reference? on the Expanded DateDurations step. 

 

@DemoFour  You should name the first query (the function) you've just pasted ListDatesToDurations




Feel free to connect with me:
LinkedIn

DemoFour
Responsive Resident
Responsive Resident

@Smauroit now says Expression.Error: A cyclic reference was encountered during evaluation.

 

I named the query ListDatesToDurations in the right hand Query Settings is that what you meant? 


Hey @DemoFour ,

I meant that:

On the left. The query's nameOn the left. The query's name

 

But you could try that just using that:

let
    //ListDatesToDurations generates a list of durations between given dates, using a thresshold
    ListDatesToDurations = (datelist as list, threshold as number, optional noRecord as logical) as list =>
        let
            listItem = (x, l) => l{x},

            #"Sort and Distinct List" = List.Sort(List.Distinct(datelist)),
            listCount = List.Count(#"Sort and Distinct List"),
            durationsList = List.Generate(
                () => [d = null, k=1],
                each [k]<=listCount,
                each
                    [d = Duration.Days(listItem([k], #"Sort and Distinct List") - listItem([k]-1, #"Sort and Distinct List")),
                    k = [k] + 1],
                each [d] ),
            datedurations = List.Zip({#"Sort and Distinct List",durationsList}),
            thresholdCheck = 
                if List.Count(List.Select(durationsList, (item) => (item??(threshold+1)) <=threshold)) > 0
                then @ListDatesToDurations(List.Transform(List.Select(datedurations, (item) => ((item{1})??(threshold+1)) > threshold), each _{0}), threshold, true)
                else datedurations,
            singleDateCheck = if listCount = 1 then {{datelist{0}, null}} else thresholdCheck,
            toRecords = List.Transform(singleDateCheck, each Record.FromList(_, type [Date = date, #"Duration From Previous" = Int64.Type]))
        in
            if (noRecord??false) then singleDateCheck else toRecords,

    
    
    
    Source = #"All Contact Records (build)",
    // This is the table we want to join the next steps to later in the process
    #"Table of Referrals (Select Columns)" = Table.SelectColumns(#"All Contact Records (build)",{"Logged On", "Client Number", "Client Name", "Classification 2", "Staff Member"}),
    // Filtering happens after removing columns to reduce data load
        // Added extra step to remove null dates just in case
    #"Filtered Rows" = Table.SelectRows(#"Table of Referrals (Select Columns)", each ([Classification 2] = ".REFERRAL ") and ([Logged On] <> null)),
    
    // Find real referals
    // These two columns are chossen to reduce the data load as we just need to work with information to remove the duplicates
    #"Group for Durations" = Table.Group(Table.SelectColumns(#"Filtered Rows", {"Logged On", "Client Number"} ), {"Client Number"},
        {
            {"DateDurations", each 
            ListDatesToDurations([Logged On], 2)
            , type {(type [Date = date, #"Duration From Previous" = Int64.Type])} }
        }),
    #"Expanded DateDurations" = Table.ExpandRecordColumn(Table.ExpandListColumn(#"Group for Durations", "DateDurations"), "DateDurations", {"Date", "Duration From Previous"})
in
    #"Expanded DateDurations"

 




Feel free to connect with me:
LinkedIn

DemoFour
Responsive Resident
Responsive Resident

Thanks @Smauro and @Jimmy801 

 

I need to spend some time to have a look at what you have posted and try to understand it.

Thank you for your help, I may have more questions later. . . . . 

Have a great day

Jimmy801
Community Champion
Community Champion

Hello @DemoFour 

 

i didn't see any duplicates in your table. I tried now to group by client and identify the duration between the log-ons. I think this is what you need

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("tZdbisMwDEW3EvJdiCy/11K6t1lLVjZxJ0xj2UTXHgZMoB+HK189rD6fq/GeaNm/rmd9rJdfx8fwZmhjYlpfD5CxZqN4YchSGNVBGMo1Y7N3/8JID9J4bAhjDt/yNbYY7PB9AMbEWieQiaM6ECN9s05n0jjDtq43hDFOMknPaZSx6UzxutJhm4Z1EMY09xnXARhxn5wtCwRQQqjWuUx6p8obAQy7uht8Yr3rpA7CNFNEZ4yMLcY8rIMwWejkrE75Rgdgjil/NPiH4eCGfUOYA6h0fEyqB5ak1wqznEK1cea+4Mrhxm0E8jXkojJ8lk5tQ1Ax4tp4HP39OH0bId4UDAo1RDbfp3YWOv0LHyzm4AXVV6ttB7G907rO6db/gWMxLh0DXnYqGOVKEcd6zNy3/3maUYNySaY9yTekz7kprjiT5BBx+kDoKKIcSz2b9Bbaf57+PMHJp5yCLLQeeZZoGgbfV2S5DTGaDCvMgclSqKFqKA8VTklHnOHk3nL8xZhIB8w1mx8bTE9uwCDX+MkWa+AmTowrz5HQwxq/1YO4Vs9N3g/jxMNU8gCNizZOneukj4Kyhi6/29EEdlV7fQM=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Client Number  " = _t, #"   Logged On" = _t]),
    #"Trimmed Text" = Table.TransformColumns(Source,{{"   Logged On", each Date.From(Text.Trim(_), "it-IT"), type date}}),
    #"Renamed Columns" = Table.RenameColumns(#"Trimmed Text",{{"   Logged On", "Logged on"}}),
    Sort = Table.Sort(#"Renamed Columns",{{"Client Number  ", Order.Ascending}, {"Logged on", Order.Ascending}}),
    AddIndex = Table.AddIndexColumn(Sort, "Index", 0, 1),
    GroupByClient = Table.Group(AddIndex, {"Client Number  "}, {{"AllRows", each _, type table [#"Client Number  "=text, #"   Logged On"=date]}}),

    AddDuration = Table.TransformColumns
    (
        GroupByClient,
        {
            {
                "AllRows",
                (tbl)=> Table.AddColumn(tbl, "duration", (add)=>try Duration.TotalDays(add[Logged on]-Table.SelectRows(tbl, each [Index]=add[Index]-1)[Logged on]{0}) otherwise null , type number)
            }
        }
    ),
    #"Expanded AllRows" = Table.ExpandTableColumn(AddDuration, "AllRows", {"Logged on", "duration"}, {"Logged on", "duration"})
in
    #"Expanded AllRows"

 

Copy paste this code to the advanced editor in a new blank query to see how the solution works.

If this post helps or solves your problem, please mark it as solution (to help other users find useful content and to acknowledge the work of users that helped you)
Kudoes are nice too

Have fun

Jimmy

AlB
Super User
Super User

Hi @DemoFour,

I have not completely understood what you want to do. Based on the sample you've provided, can you explain what the expected result would be and the logic behind it?

 

Please mark the question solved when done and consider giving kudos if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers 

 

SU18_powerbi_badge

 

 

DemoFour
Responsive Resident
Responsive Resident

Thanks for your reply @AlB 

 

What I am trying to achieve would be as follow's

 

Client         Logged On 1        Logged on 2         Days Btw     Duplicate

1                01/01/20               03/01/20                2                   True

2                01/01/20               01/03/20                 60                False

this is so that I can remove the duplicate entries that staff have added without checking to see if a colleague has already done an entry for the client within a timeframe. 

 

However @Jimmy801 has achieve this by looking within the columns for each client and then finding the duration (Which I could do in DAX, but it needs to be done in the query editor in order to be effective) for each client entry.

Thank you for your query @Jimmy801 , I have just pasted that in and it gives the days between each referral date entry. 

As you pointed out, there were only duplicate clients left with different dates. I have already removed the duplicates on the same date. by doing a similar group by exercise with and index column and then removing duplicates in the index column.

I can use the function buttons in the ribbon as I get used to this work and I am slowly learning  the M related to these functions, but it is very slow as am very new to this and have already spent 12 months teaching myself DAX!

@Jimmy801could you please further explain the AddDuration step, as I used this previously to remove the duplicates on the same day. But I assume you have written this out and not used the ribbon to achieve this (the notes in my query are so I can check my logic as I try different things)?

let
    Source = #"All Contact Records (build)",
    #"Filtered Rows" = Table.SelectRows(Source, each ([Classification 2] = ".REFERRAL ")),
    // This is the table we want to join the next steps to later in the process
    #"Table of Referrals (Select Columns)" = Table.SelectColumns(#"Filtered Rows",{"Logged On", "Client Number", "Client Name", "Classification 2", "Staff Member"}),
    // These two columns are chossen to reduce the data load as we just need to work with information to remove the duplicates
    #"Select Columns" = Table.SelectColumns(#"Table of Referrals (Select Columns)", {"Logged On", "Client Number"} ),
    // This counts the number of enteries for each client made on the same date. Once this is complted any record that is > 1 is a same day duplicate. 
    #"Grouped Rows" = Table.Group(#"Select Columns", {"Logged On", "Client Number"}, {{"Count", each Table.RowCount(_), Int64.Type}}),
    // By removing >1 this will remove all same date duplicate records
    #"Filtered Duplicate Entries" = Table.SelectRows(#"Grouped Rows", each ([Count] = 1)),
    // Removed the column as not needed once the duplicates entered on the same day have been removed 
    #"Removed Count" = Table.RemoveColumns(#"Filtered Duplicate Entries",{"Count"}),
    // By grouping by client number, this enables a count of the client numbers and then anything = 1 can be removed to leave just the final duplicates on different dates.
    #"Client Number Count (Grouped Rows)" = Table.Group(#"Removed Count", {"Client Number"}, {{"Client Count", each Table.RowCount(_), Int64.Type}, {"Table", each _, type table [Logged On=nullable date, Client Number=nullable number]}}),
    // Removed the non duplicated clients
    #"Kept Duplicate Clients (Filtered Rows)" = Table.SelectRows(#"Client Number Count (Grouped Rows)", each ([Client Count] <> 1)),
    // Brought back the dates to then find time between each date to then remove duplicates from definate repeat referrals.
    #"Expanded Table" = Table.ExpandTableColumn(#"Kept Duplicate Clients (Filtered Rows)", "Table", {"Logged On"}, {"Logged On"}),
    // This column is no longer needed
    // 
    #"Removed Count Column" = Table.RemoveColumns(#"Expanded Table",{"Client Count"})
in
    #"Removed Count Column"



 

This is part of the problem of cleaning up the source data before I can take it into power bi to build reports from. I also need to merge this query with the demographic data so that if clients do come back twice it brings them in this query with a column to indicate a repeat referral. 

At some point once I have removed the duplicates I will need to join this back to the source table so that all the duplicates are removed and I am left with a list of clients who have genuine re-referrals in the future. 

Thank you everyone for your help, it's genuinely appreciated as I learn how to use Power BI to its full capacity. 

Hello @DemoFour 

 

First i group by client applying a function that shows all rows grouped.

The add-duration step transform this table, created by the group-function. It adds a new column to this tables, where the days between the current row and the previous row is shown.

Is this what you were looking for?

 

BR

 

Jimmy

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors