Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!
Afternoon All.
I have a problem where people are adding duplicates in to our referral system, whilst this issue is being addressed, I still have to remove the duplicates that have already been added, to create some reports.
I have already removed duplicates entered on the same date in the Query editor, by using the group by function and then removing all that are >1 from the list.
I have then isolated the duplicates with different dates, to leave a table of just the client number and date logged on. Using Group By on the client number and removing any that are = 1
I can share this query if needed, I do not really know M and it’s taken me a while to get to this stage!
The problem
I need to get a number of days between each client record left, so that I can then remove any that are duplicates within a certain time frame.
This will leave all the others as genuine re-referrals back into the system.
I have looked at this question and the related article and solution https://community.powerbi.com/t5/Power-Query/Grouping-Date-sets-within-a-month/m-p/1428838#M44645
But it does not quite fit the bill and as the answer is specific to the problem set, it will not work for me or it needs a tweak, or this is where my M understanding lets me down.
When I tried to implement the suggested way, I got the dates to move down a row, but then they did not relate to the same client and so it was not a workable solution.
Data Sample
Client Number Logged On
15500 12/10/2020
15500 31/07/2020
10306 12/10/2020
10306 09/10/2020
13954 09/10/2020
13954 02/10/2020
15806 09/10/2020
15806 11/09/2020
13763 09/10/2020
13763 17/09/2020
16017 09/10/2020
16017 02/10/2020
15347 08/10/2020
15347 23/07/2020
15347 14/07/2020
15386 07/10/2020
15386 17/07/2020
15238 07/10/2020
15238 01/07/2020
15388 07/10/2020
15388 17/07/2020
9932 07/10/2020
9932 17/07/2020
15904 07/10/2020
15904 24/09/2020
15823 07/10/2020
15823 02/10/2020
15823 14/09/2020
15779 02/10/2020
15779 09/09/2020
15990 02/10/2020
15990 12/08/2020
12643 02/10/2020
12643 10/08/2020
15789 30/09/2020
15789 10/09/2020
15914 29/09/2020
15914 25/09/2020
14786 24/09/2020
14786 30/07/2020
12757 17/09/2020
12757 16/09/2020
10393 16/09/2020
10393 25/06/2020
7965 15/09/2020
7965 14/09/2020
15446 14/09/2020
15446 27/07/2020
14293 10/09/2020
14293 29/07/2020
15770 09/09/2020
15770 08/09/2020
10382 04/09/2020
10382 18/08/2020
15749 04/09/2020
15749 28/08/2020
15387 01/09/2020
15387 17/07/2020
9066 14/08/2020
9066 22/07/2020
15222 13/08/2020
15222 29/06/2020
15452 28/07/2020
15452 27/07/2020
15395 17/07/2020
15395 01/07/2020
15215 14/07/2020
15215 29/06/2020
15230 01/07/2020
15230 30/06/2020
15232 01/07/2020
15232 30/06/2020
15240 01/07/2020
15240 25/06/2020
15219 30/06/2020
15219 29/06/2020
10623 30/06/2020
10623 29/06/2020
Now some of the duplicates have 2 entries but others have 3, so I think this is creating a bigger problem!
What I need to achieve is listing the dates in separate columns so that I can then find out the days between each entry, as some will be genuine re-referrals due the time lapse between them. Remove the ones that have been added in error and keep the ones that have a certain time frame between them and are re-referrals not duplicate entries.
I have found the Duration.Days function to do this, but I need to have the dates in columns to compare them to to do this. I am not sure how this will happen for more than 2 dates!
Any help would be gratefully received and I can share the source data if there is a better solution than the route I am taking here. Or it would be easier to suggest a solution.
I have read lots of posts, fumbled my way to this point but am now a bit stuck!
Thank you in advance 🙂
Solved! Go to Solution.
Hey @DemoFour ,
I meant that:
But you could try that just using that:
let
//ListDatesToDurations generates a list of durations between given dates, using a thresshold
ListDatesToDurations = (datelist as list, threshold as number, optional noRecord as logical) as list =>
let
listItem = (x, l) => l{x},
#"Sort and Distinct List" = List.Sort(List.Distinct(datelist)),
listCount = List.Count(#"Sort and Distinct List"),
durationsList = List.Generate(
() => [d = null, k=1],
each [k]<=listCount,
each
[d = Duration.Days(listItem([k], #"Sort and Distinct List") - listItem([k]-1, #"Sort and Distinct List")),
k = [k] + 1],
each [d] ),
datedurations = List.Zip({#"Sort and Distinct List",durationsList}),
thresholdCheck =
if List.Count(List.Select(durationsList, (item) => (item??(threshold+1)) <=threshold)) > 0
then @ListDatesToDurations(List.Transform(List.Select(datedurations, (item) => ((item{1})??(threshold+1)) > threshold), each _{0}), threshold, true)
else datedurations,
singleDateCheck = if listCount = 1 then {{datelist{0}, null}} else thresholdCheck,
toRecords = List.Transform(singleDateCheck, each Record.FromList(_, type [Date = date, #"Duration From Previous" = Int64.Type]))
in
if (noRecord??false) then singleDateCheck else toRecords,
Source = #"All Contact Records (build)",
// This is the table we want to join the next steps to later in the process
#"Table of Referrals (Select Columns)" = Table.SelectColumns(#"All Contact Records (build)",{"Logged On", "Client Number", "Client Name", "Classification 2", "Staff Member"}),
// Filtering happens after removing columns to reduce data load
// Added extra step to remove null dates just in case
#"Filtered Rows" = Table.SelectRows(#"Table of Referrals (Select Columns)", each ([Classification 2] = ".REFERRAL ") and ([Logged On] <> null)),
// Find real referals
// These two columns are chossen to reduce the data load as we just need to work with information to remove the duplicates
#"Group for Durations" = Table.Group(Table.SelectColumns(#"Filtered Rows", {"Logged On", "Client Number"} ), {"Client Number"},
{
{"DateDurations", each
ListDatesToDurations([Logged On], 2)
, type {(type [Date = date, #"Duration From Previous" = Int64.Type])} }
}),
#"Expanded DateDurations" = Table.ExpandRecordColumn(Table.ExpandListColumn(#"Group for Durations", "DateDurations"), "DateDurations", {"Date", "Duration From Previous"})
in
#"Expanded DateDurations"
@DemoFour
A more general solution, using a function:
//ListDatesToDurations generates a list of durations between given dates, using a thresshold
(datelist as list, threshold as number, optional noRecord as logical) as list =>
let
listItem = (x, l) => l{x},
#"Sort and Distinct List" = List.Sort(List.Distinct(datelist)),
listCount = List.Count(#"Sort and Distinct List"),
durationsList = List.Generate(
() => [d = null, k=1],
each [k]<=listCount,
each
[d = Duration.Days(listItem([k], #"Sort and Distinct List") - listItem([k]-1, #"Sort and Distinct List")),
k = [k] + 1],
each [d] ),
datedurations = List.Zip({#"Sort and Distinct List",durationsList}),
thresholdCheck =
if List.Count(List.Select(durationsList, (item) => (item??(threshold+1)) <=threshold)) > 0
then @ListDatesToDurations(List.Transform(List.Select(datedurations, (item) => ((item{1})??(threshold+1)) > threshold), each _{0}), threshold, true)
else datedurations,
singleDateCheck = if listCount = 1 then {{datelist{0}, null}} else thresholdCheck,
toRecords = List.Transform(singleDateCheck, each Record.FromList(_, type [Date = date, #"Duration From Previous" = Int64.Type]))
in
if (noRecord??false) then singleDateCheck else toRecords
What it does is take a list of dates, a given threshold (which means "don't count anything less than or equal than that") and returns a list of records with the date and the duration from previous date. By default, it will also remove duplicate dates.
Add it as a new query and call it ListDatesToDurations.
To see it in action:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fdRdrsQgCAXgvfR5kiLgD2uZzP63cavSTMUz9/XLCVLQvt9HypnoeB2cz0QnE9PxeT1YIaeKWTAzZEkn1QeTUAFpZ7KVxbL+w/HIBos4p6sTexapRVDtydfHL+lCqYL0zbET0cENMss6E+ekkdv4nLiGyb3BJc3SUNo5bbVhenKobSa8h123PowUVZ58XbVlrLmxoLTztuDBKRap1VDa2ULajFB68nUxr619mYuiTpwvW9K5tn6kUGxwcIpsacxka3ByXlnr2HucoHM/8rkGrrmCa3xzWZnE5Df3TsqXq5Xcw2t/t8bdqJbfzOHyKJuAQTn3QdV1wYQW7Nzix7RxixVyanGTaiDtzDEtrfors53jc6Jyj6RtyhyfNfP86a4HTu4TKctYM3t7FXB8qtd/FL3gydsvg9O9YMBbJywEiwzutzWkGacZphXXVtpva2/QYJHBsW8q4xcT087P9OcP", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Client Number" = _t, #"Logged On" = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Logged On", type date}}),
#"Group for Durations" = Table.Group(#"Changed Type", {"Client Number"},
{
{"DateDurations", each
ListDatesToDurations([Logged On], 2)
, type {(type [Date = date, #"Duration From Previous" = Int64.Type])} }
}),
#"Expanded DateDurations" = Table.ExpandRecordColumn(Table.ExpandListColumn(#"Group for Durations", "DateDurations"), "DateDurations", {"Date", "Duration From Previous"})
in
#"Expanded DateDurations"
You see, we call it with a threshold of 2, meaning that any two dates less than 3 days apart will not be counted and the first one will remain.
To use it with your code:
...
#"Group for Durations" = Table.Group(#"Previous Step", {"Client Number"},
{
{"DateDurations", each
ListDatesToDurations([Logged On], 2)
, type {(type [Date = date, #"Duration From Previous" = Int64.Type])} }
}),
#"Expanded DateDurations" = Table.ExpandRecordColumn(Table.ExpandListColumn(#"Group for Durations", "DateDurations"), "DateDurations", {"Date", "Duration From Previous"})
in
#"Expanded DateDurations"
where you'll need to change #"Previous Step" with your last step name.
Cheers,
Spyros
Hi again @Smauro as you can see I am still learning!
Thank you for your reply, I have pasted your query but I get:
Expression.Error: The import ListDatesToDurations matches no exports. Did you miss a module reference? on the Group for Durations step
and
Expression.Error: The import ListDatesToDurations matches no exports. Did you miss a module reference? on the Expanded DateDurations step.
@Smauroit now says Expression.Error: A cyclic reference was encountered during evaluation.
I named the query ListDatesToDurations in the right hand Query Settings is that what you meant?
Hey @DemoFour ,
I meant that:
But you could try that just using that:
let
//ListDatesToDurations generates a list of durations between given dates, using a thresshold
ListDatesToDurations = (datelist as list, threshold as number, optional noRecord as logical) as list =>
let
listItem = (x, l) => l{x},
#"Sort and Distinct List" = List.Sort(List.Distinct(datelist)),
listCount = List.Count(#"Sort and Distinct List"),
durationsList = List.Generate(
() => [d = null, k=1],
each [k]<=listCount,
each
[d = Duration.Days(listItem([k], #"Sort and Distinct List") - listItem([k]-1, #"Sort and Distinct List")),
k = [k] + 1],
each [d] ),
datedurations = List.Zip({#"Sort and Distinct List",durationsList}),
thresholdCheck =
if List.Count(List.Select(durationsList, (item) => (item??(threshold+1)) <=threshold)) > 0
then @ListDatesToDurations(List.Transform(List.Select(datedurations, (item) => ((item{1})??(threshold+1)) > threshold), each _{0}), threshold, true)
else datedurations,
singleDateCheck = if listCount = 1 then {{datelist{0}, null}} else thresholdCheck,
toRecords = List.Transform(singleDateCheck, each Record.FromList(_, type [Date = date, #"Duration From Previous" = Int64.Type]))
in
if (noRecord??false) then singleDateCheck else toRecords,
Source = #"All Contact Records (build)",
// This is the table we want to join the next steps to later in the process
#"Table of Referrals (Select Columns)" = Table.SelectColumns(#"All Contact Records (build)",{"Logged On", "Client Number", "Client Name", "Classification 2", "Staff Member"}),
// Filtering happens after removing columns to reduce data load
// Added extra step to remove null dates just in case
#"Filtered Rows" = Table.SelectRows(#"Table of Referrals (Select Columns)", each ([Classification 2] = ".REFERRAL ") and ([Logged On] <> null)),
// Find real referals
// These two columns are chossen to reduce the data load as we just need to work with information to remove the duplicates
#"Group for Durations" = Table.Group(Table.SelectColumns(#"Filtered Rows", {"Logged On", "Client Number"} ), {"Client Number"},
{
{"DateDurations", each
ListDatesToDurations([Logged On], 2)
, type {(type [Date = date, #"Duration From Previous" = Int64.Type])} }
}),
#"Expanded DateDurations" = Table.ExpandRecordColumn(Table.ExpandListColumn(#"Group for Durations", "DateDurations"), "DateDurations", {"Date", "Duration From Previous"})
in
#"Expanded DateDurations"
Hello @DemoFour
i didn't see any duplicates in your table. I tried now to group by client and identify the duration between the log-ons. I think this is what you need
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("tZdbisMwDEW3EvJdiCy/11K6t1lLVjZxJ0xj2UTXHgZMoB+HK189rD6fq/GeaNm/rmd9rJdfx8fwZmhjYlpfD5CxZqN4YchSGNVBGMo1Y7N3/8JID9J4bAhjDt/yNbYY7PB9AMbEWieQiaM6ECN9s05n0jjDtq43hDFOMknPaZSx6UzxutJhm4Z1EMY09xnXARhxn5wtCwRQQqjWuUx6p8obAQy7uht8Yr3rpA7CNFNEZ4yMLcY8rIMwWejkrE75Rgdgjil/NPiH4eCGfUOYA6h0fEyqB5ak1wqznEK1cea+4Mrhxm0E8jXkojJ8lk5tQ1Ax4tp4HP39OH0bId4UDAo1RDbfp3YWOv0LHyzm4AXVV6ttB7G907rO6db/gWMxLh0DXnYqGOVKEcd6zNy3/3maUYNySaY9yTekz7kprjiT5BBx+kDoKKIcSz2b9Bbaf57+PMHJp5yCLLQeeZZoGgbfV2S5DTGaDCvMgclSqKFqKA8VTklHnOHk3nL8xZhIB8w1mx8bTE9uwCDX+MkWa+AmTowrz5HQwxq/1YO4Vs9N3g/jxMNU8gCNizZOneukj4Kyhi6/29EEdlV7fQM=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Client Number " = _t, #" Logged On" = _t]),
#"Trimmed Text" = Table.TransformColumns(Source,{{" Logged On", each Date.From(Text.Trim(_), "it-IT"), type date}}),
#"Renamed Columns" = Table.RenameColumns(#"Trimmed Text",{{" Logged On", "Logged on"}}),
Sort = Table.Sort(#"Renamed Columns",{{"Client Number ", Order.Ascending}, {"Logged on", Order.Ascending}}),
AddIndex = Table.AddIndexColumn(Sort, "Index", 0, 1),
GroupByClient = Table.Group(AddIndex, {"Client Number "}, {{"AllRows", each _, type table [#"Client Number "=text, #" Logged On"=date]}}),
AddDuration = Table.TransformColumns
(
GroupByClient,
{
{
"AllRows",
(tbl)=> Table.AddColumn(tbl, "duration", (add)=>try Duration.TotalDays(add[Logged on]-Table.SelectRows(tbl, each [Index]=add[Index]-1)[Logged on]{0}) otherwise null , type number)
}
}
),
#"Expanded AllRows" = Table.ExpandTableColumn(AddDuration, "AllRows", {"Logged on", "duration"}, {"Logged on", "duration"})
in
#"Expanded AllRows"
Copy paste this code to the advanced editor in a new blank query to see how the solution works.
If this post helps or solves your problem, please mark it as solution (to help other users find useful content and to acknowledge the work of users that helped you)
Kudoes are nice too
Have fun
Jimmy
Hi @DemoFour,
I have not completely understood what you want to do. Based on the sample you've provided, can you explain what the expected result would be and the logic behind it?
Please mark the question solved when done and consider giving kudos if posts are helpful.
Contact me privately for support with any larger-scale BI needs, tutoring, etc.
Cheers
Thanks for your reply @AlB
What I am trying to achieve would be as follow's
Client Logged On 1 Logged on 2 Days Btw Duplicate
1 01/01/20 03/01/20 2 True
2 01/01/20 01/03/20 60 False
this is so that I can remove the duplicate entries that staff have added without checking to see if a colleague has already done an entry for the client within a timeframe.
However @Jimmy801 has achieve this by looking within the columns for each client and then finding the duration (Which I could do in DAX, but it needs to be done in the query editor in order to be effective) for each client entry.
Thank you for your query @Jimmy801 , I have just pasted that in and it gives the days between each referral date entry.
As you pointed out, there were only duplicate clients left with different dates. I have already removed the duplicates on the same date. by doing a similar group by exercise with and index column and then removing duplicates in the index column.
I can use the function buttons in the ribbon as I get used to this work and I am slowly learning the M related to these functions, but it is very slow as am very new to this and have already spent 12 months teaching myself DAX!
@Jimmy801could you please further explain the AddDuration step, as I used this previously to remove the duplicates on the same day. But I assume you have written this out and not used the ribbon to achieve this (the notes in my query are so I can check my logic as I try different things)?
let
Source = #"All Contact Records (build)",
#"Filtered Rows" = Table.SelectRows(Source, each ([Classification 2] = ".REFERRAL ")),
// This is the table we want to join the next steps to later in the process
#"Table of Referrals (Select Columns)" = Table.SelectColumns(#"Filtered Rows",{"Logged On", "Client Number", "Client Name", "Classification 2", "Staff Member"}),
// These two columns are chossen to reduce the data load as we just need to work with information to remove the duplicates
#"Select Columns" = Table.SelectColumns(#"Table of Referrals (Select Columns)", {"Logged On", "Client Number"} ),
// This counts the number of enteries for each client made on the same date. Once this is complted any record that is > 1 is a same day duplicate.
#"Grouped Rows" = Table.Group(#"Select Columns", {"Logged On", "Client Number"}, {{"Count", each Table.RowCount(_), Int64.Type}}),
// By removing >1 this will remove all same date duplicate records
#"Filtered Duplicate Entries" = Table.SelectRows(#"Grouped Rows", each ([Count] = 1)),
// Removed the column as not needed once the duplicates entered on the same day have been removed
#"Removed Count" = Table.RemoveColumns(#"Filtered Duplicate Entries",{"Count"}),
// By grouping by client number, this enables a count of the client numbers and then anything = 1 can be removed to leave just the final duplicates on different dates.
#"Client Number Count (Grouped Rows)" = Table.Group(#"Removed Count", {"Client Number"}, {{"Client Count", each Table.RowCount(_), Int64.Type}, {"Table", each _, type table [Logged On=nullable date, Client Number=nullable number]}}),
// Removed the non duplicated clients
#"Kept Duplicate Clients (Filtered Rows)" = Table.SelectRows(#"Client Number Count (Grouped Rows)", each ([Client Count] <> 1)),
// Brought back the dates to then find time between each date to then remove duplicates from definate repeat referrals.
#"Expanded Table" = Table.ExpandTableColumn(#"Kept Duplicate Clients (Filtered Rows)", "Table", {"Logged On"}, {"Logged On"}),
// This column is no longer needed
//
#"Removed Count Column" = Table.RemoveColumns(#"Expanded Table",{"Client Count"})
in
#"Removed Count Column"
This is part of the problem of cleaning up the source data before I can take it into power bi to build reports from. I also need to merge this query with the demographic data so that if clients do come back twice it brings them in this query with a column to indicate a repeat referral.
At some point once I have removed the duplicates I will need to join this back to the source table so that all the duplicates are removed and I am left with a list of clients who have genuine re-referrals in the future.
Thank you everyone for your help, it's genuinely appreciated as I learn how to use Power BI to its full capacity.
Hello @DemoFour
First i group by client applying a function that shows all rows grouped.
The add-duration step transform this table, created by the group-function. It adds a new column to this tables, where the days between the current row and the previous row is shown.
Is this what you were looking for?
BR
Jimmy