Solved: Re: Help needed: Remove duplicates - keep newest d...

IPGeorgiev · ‎04-03-2020

Hi All,

I have the following issue:

I have a folder where excel files with raw data are being uploaded. Each file has data which is included also the other files - Here how it works:

FileNumberOne29.03.2020 - has data for the period 24.03 - 28.03

FileNumberOne30.03.2020 - has data for the period 25.03 - 29.03

and so on...

Here a snapshot with dummy data to illustrate how it looks like:

First file has:

Second file has:

When , in PowerQuery, I select the columns in Yellow(all at once) and then click Remove duplicates and click close and apply I will get the following result:

All good here.. Positive feedback value update to 2 from 1 for the first 2 rows from 28.03. HOWEVER - the row in orange was a part of the file for 28.3, but not in the file for 29.3. Therefore this row is unique and is being added , however what I need is to make sure that if a row has a duplicate based on the columns in yellow it should be removed and keep only the newest data.

If a row is present in a previous version, but not in the newest one it should be removed (in this case the orange row) - please help me!!!!

Thanks!

Best regards,
Ivan

ImkeF · ‎04-07-2020

Hi @IPGeorgiev

please check the enclosed file.

I've faked the "from-folder"-experience so that I could use the combine-binaries-technique.

I've extracted the content from the newest file first then filtered out the first file from the from-folder-table before expanding its "Content"-column out. That generated all the other queries that you see in the file.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

View solution in original post

Greg_Deckler · ‎04-03-2020

So it sounds like you need to Sort the data first and then remove duplicates. I seem to recall that a Table.Buffer step is needed in order to make the sorting "stick". @ImkeF and @edhans can probably be more specific.

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

ImkeF · ‎04-03-2020

Hi @IPGeorgiev

not sure I fullly understand your request, but please de-select the Date-column from the yellow columns and check if that delivers what you want.

If not, please exactly explain what you mean with your last sentence. "If a row is present in a previous version, but not in the newest one it should be removed (in this case the orange row)... " If that's the case, you just have to keep the newest date.

If you have to sort your data before removing dups, make sure to use a buffer to be on the safe side: https://community.powerbi.com/t5/Community-Blog/Bug-warning-for-Table-Sort-and-removing-duplicates-i...

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

IPGeorgiev · ‎04-04-2020

Hi @ImkeF ,

I really appreciate your support on this one! thanks a lot!

1. I have tried with time date - removing it makes thing worse actually.

"If a row is present in a previous version, but not in the newest one it should be removed (in this case the orange row)... " - I mean the following:

As a data source I have selected a "FOLDER" in this Folder files with raw data are uploaded daily - each file has 7 days rolling data - meaning the data for the previous 6 days should be overwritten and the completely new data is for the previous day only. HOWEVER there might be changes for the last 6 days as well thats why when a new file is uploaded to the Folder we need to get the data from the new file and completely overwrite the data for the last 6 days prior to the previous day:

For example:

In file called - RAW1 - we have data for 25.03 - 31.03 - file uploaded in the folder on 01.04

In file called - RAW2 - we have data fore 26.03 -01.04 - file uplaoded in the folder on 02.04

In file called - RAW3 - we have data for 27.03 - 02.04 - file uploaded in the folder on 03.04

So in this case the data for 25.03 will remain from RAW1

The data for 26.03 will remain from the file RAW2.

The data for the period 27.03-02.04 will remain from RAW3

and so on..

No in my case I have one row in one of the files for a specific date which , when receiving the newer file is no longer there and this way it remain since it has no duplicate (the orange row in my first example).

Is there a way to achieve that what I am aiming for?

Thanks!

Best regards,
Ivan

ImkeF · ‎04-04-2020

Sorry @IPGeorgiev

but this is now clear to me.

Please post sample data with before and after that match exactly what you're after and that contain all relevant cases.

Make sure to follow these guidelines: https://community.powerbi.com/t5/Community-Blog/How-to-provide-sample-data-in-the-Power-BI-Forum/ba-...

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

IPGeorgiev · ‎04-04-2020

Hi @ImkeF ,

thanks a lot for you time!

The data below is an example - not real data as the real is confidential, however this data fully represents the issue:

So the data is being uplaoded in .xlsx format in a folder called Test/DailyQuality

the Power BI report is getting data from this folder.

On 31.03.2020 the file with data up to 30.03.2020 is uploaded. Here we will pretend like we have data for just 1 employee, where in reallity there are hundreds of employees. The sheet is called 31.03.2020

Then on the next day (01.04.2020) we upload also the newest file with data up to 31.03.2020 - Sheet called 01.04.2020.

When marking some of the columns (Those in Yellow in the previous post) and I click on remove duplicates I end up with the result in the sheet "Current Output"

Where the desired output is in the sheet DesiredOutput

Link to the file: https://drive.google.com/file/d/1413E47UNahdpLaPLa7fCdPvopp-zB7Cb/view?usp=sharing

Many thanks in advance!!

Best regards,

Ivan

ImkeF · ‎04-05-2020

Thanks, think I understand now.

Before expanding the file contents, I'd "split up the From-folder-table":

1) Deselect latest file and select only those rows where the Date.From([Upload Date]) - Date.From([Time Date]) = #duration(7,0,0,0).

Alternatively you could determine the earliest date from each file and keep just those rows.

2) Append 1 to the complete content of the latest file. With this method, you shouldn't have to remove duplicates at all, if my understanding is correct.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

IPGeorgiev · ‎04-05-2020

Hi @ImkeF ,

Many many thanks for the support! Solving this issue would really mean a lot to me!!

I know that this might be a stupid question, but would it be possible to be more specific on how to execute those two steps, I am afraid that my knowledge doesnt let me do it 😞

Before expanding the file content I have the following columns - Content, Source.Name, Extension, Date accessed, Date Modified, Date Created, Attributes, Folder path, Transform File, Transform File (2)

Many thanks in advance!

P.S.

Alternatively you could determine the earliest date from each file and keep just those rows.

Is it possible to - keep only the earliest date from all files, but the last one (last created)? This will also surely solve this problem.

Thanks a lot , I really appreciate your time and support!

Best regards,

Ivan

ImkeF · ‎04-07-2020

Hi @IPGeorgiev

please check the enclosed file.

I've faked the "from-folder"-experience so that I could use the combine-binaries-technique.

I've extracted the content from the newest file first then filtered out the first file from the from-folder-table before expanding its "Content"-column out. That generated all the other queries that you see in the file.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

IPGeorgiev · ‎04-07-2020

Hi @ImkeF ,

many thank for the response and the sample file! Appreciate it!

I have tried to reproduce it with the following code:

However then I get this:

I am definately doing something wrong, however I cant figure it out 😞

Thanks a lot!

Best regards,

Ivan

ImkeF · ‎04-07-2020

Hi @IPGeorgiev ,

looks like you have a non-csv-file in your folder.

Maybe you have to filter on filtype = "csv" just at the beginning. Then make sure that this is the reference step instead of "FakeFromFolder".

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

IPGeorgiev · ‎04-08-2020

Hi @ImkeF ,

thanks a lot for the fast reply!

Yes - all of the files (the raw data) is in xlsx format. I have converted the files to .csv and the preview is correct, however I have new issues 😞 :

1. OLE DB or ODBC error: [Expression.Error] The key didn't match any rows in the table.. - I did some research on the new and found several reasons why this error might occur, however

2.OLE DB or ODBC error: [DataFormat.Error] External table is not in the expected format.. - here I have tried to fix the format of each column to meet the actual data under the columnd + checked the source files and didnt found any unsual values.

In general - is it possible to use the same approach but use the xlsx files instead and not to convert them to csv?

Many thanks!

Best regards,
Ivan

ImkeF · ‎04-08-2020

Hi @IPGeorgiev

I see no reason at all why this shouldn't work with xlsx-file.

I'd suggest to start new from scratch: Import your xlsx-files from folder and check if all data comes in correctly before applying any further transformations.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

IPGeorgiev · ‎04-08-2020

Hi @ImkeF ,

I will try it - however are you able to advise how I should modify the code so that it will work with xlsx files?

And especially this row:

GetContentFromFirstFile = Csv.Document(GetFirstFile,[Delimiter=",", Columns=20, Encoding=1252, QuoteStyle=QuoteStyle.None]),

Many thanks!!

Best regards,
Ivan

ImkeF · ‎04-08-2020

Hi @IPGeorgiev ,

if you start from scratch and use the combine binaries, the matching formulas for the xlsx should automatically be applied.

If I find the time, I will create a video at the weekend that describes and explains all the steps for your example.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

IPGeorgiev · ‎04-08-2020

Hi @ImkeF ,

Many thanks! A video would be just awesome!

I have tried to reproduce the whole process and I got this code:

and also another version:

However I guess that something is wrong here since this doesnt give the desired result - it basically removes the newest file and keeps the rest with duplicate values 😞

I think that my error is in the first row and apparently doing this:

Date.From([Upload Date]) - Date.From([Time Date]) = #duration(7,0,0,0).

Thanks in advance!!!

Best regards,
Ivan