Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
JeffWeir
Advocate IV
Advocate IV

Multilple recursive Inner and Anti Joins for robust fuzzy date matching - Can List.Accumulate help?

I’m trying to reconcile data from two different entities regarding movement of shipping containers, by doing a bulletproof fuzzy match on dates.  One is a shipping port, the other is a railway.

 

Here's a simplified sample file illustrating what I'm trying to do:

Match to closest day_20190722 buffer Revised.xlsx

 

I work at the port, and I want to ensure that the railway are charging us for the correct amout of containers they are moving to/from the port. But the dates in the two systems don’t always match for the reasons outlined below:

  • Containers have a unique ID and a direction of travel common to both systems…either RAIL IN or RAIL OUT.
  • The Port system records when containers enter or leave the port, on their way to/from the Rail depot.
  • The Rail system records when the containers are actually put on a train and railed somewhere. (Either inwards, or outwards as containers are sent to customers who load or unload them as the case may be, and then they go to the port so they can be loaded onto ships (if full) or stored at the port (if empty).
  • The containers might sit at the rail depot for minutes, days, or weeks. And they might make multiple outward or inward trips in a matter of days. (e.g. out, then in, then out, then in again).

I need to do a "Fuzzy Date match" in order to reconcile one system to the other, that doesn’t inadvertently double count when containers are moved rapidly back and forth between the port and the customer multiple times. So I need to progressively increase the 'mismatch' tolerence on those dates, and remove matches at each pass leaving just the unmatched rows to do increasingly desperate matches on. (If I don't remove the matches at each subsequent pass, then inevitably some container movements get incorrectly matched to their next similar move).

 

I have a hard-coded query that does this works fine, but it is cumbersome. I wonder if it is possible to use something like List.Accumulate to make the query dynamic. But I have 2 seed tables, and need three output tables, so it might be a pipe dream.

 

Here's a screenshot of sample input and output. I've color coded the matching records from each table to the 'Matched Records' table below. (Edit: I've revised this as per reply to @ImkeF below)

 

Revised output.PNG

 

My manual query performs the following steps:

  1. Loads data from Rail and from Port, and does an inner join.
  2. Effectively removes these matches from the original Rail and Port tables via an anti-join against each table.
  3. Does an inner join on the remaining records, but this time offsets the dates in the Port table by +/- 1 day. This finds some more matches.
  4. Performs steps 2 and 3 over and over, increasing the date offset by 1 each time until I have captured all date mismatches up to a week.

Here’s how that looks in the query dependency view:

Dependencies.PNG

 

And here's how I do the 'Fuzzy Date Match": At each 'pass', my Day X Matched step takes the unmatched records from the two tables at the end of the previous Day X Matched step, and creates a 'Date Offset' column using  this general approach:

= Table.AddColumn(Source, "Table2.Date", each List.Dates(Date.AddDays([Date],-X),2,#duration(2X,0,0,0)))

 

That X bit successively pads out the date in one table so that it will match dates 1, 2, ..., X days before or after.

 

All this requires a lot of different Queries. Here's the amount of queries required just to handle 5 days' worth of date offsets:

Queries.PNG

 

What I would like to do is use List.Accumulate or similar to make this dynamic. Here’s how that would look:

Recursion.PNG

 

Uwe from the data-insights blog has a fantastic 3 part series on recursion in PQ, including great references to the usual suspects (Chris Webb, @ImkeF , and others). This shows some great uses of List.Accumulate and List.Generate. But working out whether these functions are flexible enough to do the above is beyond me.

 

Anybody feeling brave enough to take a stab, or even give me their thoughts as to whether this is possible?

Match to closest day_20190722 buffer Revised.xlsx

 

9 REPLIES 9
ImkeF
Super User
Super User

I think a "simpler" group and sort would do the job here as well: 

1) Group by "ID" and "Direction"

2) Sort by "Date" and add Index column

3) Merge on ID, Direction and new Index column in FullOuter-mode: Matches and non-matches will be shown

 

See attached file

 

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Hi @ImkeF 

 

Unfortunately that approach isn’t robust under all circumstances, as it will always match an entry in one table with a previous entry in the other table even if a closer match is available. And this is a very real possibility in this dataset.

 

Revised sample file and screenshot: Match to closest day_20190722 buffer Imke.xlsx

 

Capture.PNG

Hi @JeffWeir ,

sorry, I lost track of this thread a bit. 

Is this still an issue?

 

Cheers, Imke

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

My hard-coded query works fine, but I'm trying to find the time to see if I can learn how to write a dynamic function where I can specify the number of days to check either side.  At the moment I progressively  match the tables on larger values of a 'tolerance' factor X using this condition:

Table1.Date = Table2.Date +/- X . 

 

And I'm effectiviely running this in a loop for values of X between 1 to 5, and removing successful matches on each pass leaving just unsuccessful ones. This lets me progressively increase the 'mismatch' tolerence on those dates, and remove matches at each pass leaving just the unmatched rows to do increasingly desperate matches on.

 

My  query performs the following steps:

  1. Loads data from Table1 and Table2, and does an inner join on unique ID and Date
  2. Effectively removes these matches from Table1 and Table2 via an anti-join against each referenced table.
  3. Does an inner join on the remaining records, but this time offsets the dates in the Port table by +/- 1 day. This finds some more matches.
  4. Performs steps 2 and 3 over and over, increasing the date offset by 1 each time until I have captured all date mismatches up to a week.

 

The query dependency tree that results looks like so:

 

Dependencies.PNG

 

 

All that takes a lot of setup. I really need to work out if this can be turned into a function that is a) dynamic and b) efficient.

 

I still mean to take a crack at it, but it might be beyond my beginning/intermediate M.

 

For what it's worth, here's my current manual approach.

 

Match to closest day_20190722 buffer3 Revised.xlsx

Hi @JeffWeir

although a dynamic recursive approach (using List.Accumulate or List.Generate) would work here, that would still be a considerable amount of code (and work for me) and I'm not sure if it would perform faster than the following approach:

 

let
    Source = Table1,
    AllowedRange = Table.AddColumn(Source, "Days", each {-5..5}),
    ExpandAllowedRange = Table.ExpandListColumn(AllowedRange, "Days"),
    DaysAbsoluteFigures = Table.AddColumn(ExpandAllowedRange, "AbsDays", each Number.Abs([Days])),
    AllowedDates = Table.AddColumn(DaysAbsoluteFigures, "Dates", each Date.AddDays([Date], [Days])),
    #"Changed Type" = Table.TransformColumnTypes(AllowedDates,{{"Dates", type date}}),
    MergeTable2 = Table.NestedJoin(#"Changed Type", {"Dates", "ID", "Direction"}, Table2, {"Date", "ID", "Direction"}, "Table2", JoinKind.Inner),
    Cleanup = Table.RemoveColumns(MergeTable2,{"Table2", "Dates"}),
    #"Sorted Rows" = Table.Buffer(Table.Sort(Cleanup,{{"AbsDays", Order.Ascending}})),
    #"Removed Duplicates" = Table.Distinct(#"Sorted Rows", {"Date", "ID", "Direction"})
in
    #"Removed Duplicates"

This produces the "Matched Records"-table. Just do some Anti-Joins to retrieve the unmatched tables from there.

If performance is too bad, please come back.

 

 

 

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

@ImkeFThat Sorted Rows > Remove Duplicates step is clever.

 

I'll tweak my code to use this approach and see if there is a performance improvement. I still need to do lots of different passes, because i want this to be a 'progressive' match, so it may not simplify the query chain overly. But it will be interesting to see if there's any performance difference. Thanks 🙂

Hi @JeffWeir 

this apprach was supposed to replace your whole chain. I've tweaked to code a little bit to come up with the correct Table2, but now it delivers all you need. No need at all to go through all these iterations. (results are shown in row 42 -57 of  Sheet1)

 

Please check the results in the attached file. 

 

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Hi @JeffWeir 

have you had a chance to test this out on your real data?

I'd be interested to hear how this scales on a large dataset.

 

Please mark my answer as solution if it solved your case - thanks.

 

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Sorry @ImkeF I've been away. I'll take a look later this week and post back.

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors