Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
hello PowerQuery community,
I am trying to create a dashboard basis some files that are emailed to us by another department. there are two issues that I am unable to solve which possibly require M coding. Am keen to explore what is possible to achieve with PowerQuery. I am using Excel 365 & PowerBI Desktop free version.
Let's suppose there are three input files.
file #1- filename = "Report_2020049.xlsx"
representation table
start_date | variable1 | variable2 | variable3 |
26-Apr-20 | |||
27-Apr-20 | |||
28-Apr-20 | |||
29-Apr-20 |
file #2- filename = "Report_20200430.xlsx"
representation table
start_date | variable1 | variable2 | variable3 |
27-Apr-20 | |||
28-Apr-20 | |||
29-Apr-20 | |||
30-Apr-20 |
file #3- filename = "Report_20200501.xlsx"
representation table
start_date | variable1 | variable4 | variable3 | variable2 |
28-Apr-20 | ||||
29-Apr-20 | ||||
30-Apr-20 | ||||
01-May-20 |
expected output after appending 3 files
start_date | variable1 | variable4 | variable3 | variable2 |
26-Apr-20 | ||||
27-Apr-20 | ||||
28-Apr-20 | ||||
29-Apr-20 | ||||
30-Apr-20 | ||||
01-May-20 |
As you may have noticed, the input files have rolling data for the last 4 days since the size of the data is huge. Each date here actually has close to 100,000+ rows of data.
As of now, I am manually maintaining a seperate sheet that consists of unique dates as has been created in the output table. Whenever the most recent file is received, I delete the previous 3 days data and replace it with the current 4-days data. One of the main reasons to consider the last 4 days is that there may be chances of data being overwritten so the most current version must be considered. Another big problem is that new column was added from 1st May onwards (Variable4) and must be included in the appended table.
to summarise my questions:
1) how do we append new data to existing query and delete older dates data? (one solultion i'm thinking is that I may have to generate a unique index key for each file and in case of duplication between dates, i have to delete the smallest values??) This is being achieved manually right now but I would rather have this achieved by PowerQuery.
2) while appending two tables of different structures, how does one ensure additional columns are also added?
Hi @Anonymous ,
I think something like this should work - you may need to adjust it for your needs:
The idea in brief:
1. Start from the LATEST file moving up the stream (this ensures new data overwrite old where they change)
2. Do anti join/merge on each next file/input to get only lines missing in the combined table
3. Combien the join output with the combined table
4. Repeat with the next file.
On large files, it may take quite a while to run, so I would suggest considering using some form of "incremental" refresh to speed up (so only the "new" file was processed)
The combiner function:
(existing as table, new as table)=>
let
Source = existing,
#"Merged Queries" = Table.NestedJoin(Source, {"Date"}, new, {"Date"}, "join", JoinKind.RightAnti),
#"Removed Other Columns" = Table.SelectColumns(#"Merged Queries",{"join"}),
#"Expanded join" = Table.ExpandTableColumn(#"Removed Other Columns", "join", Table.ColumnNames(new), Table.ColumnNames(new)),
#"Appended Query" = Table.Combine({#"Expanded join", Source})
in
#"Appended Query"
This is how it is called in the main query:
let
Source = {t3, t2, t1},
Output = List.Accumulate(Source, #table(type table[Date=date, Value = number, FileName = text], {}), (a, n)=> fCombineEx(a, n))
in
Output
These are the t1-3 tables:
DateValueFileName
|
DateValueFileName
|
DateValueFileName
|
New columns are added automatically.
Kind regards,
John
@Anonymous
I think the main problem with your report is that you have historical data from the past every day, and you want to delete them and keep the incremental data. Can you upload several report files to remove sensitive data? I'll see if I can solve and provide an optimization
You've asked several questions so let me see if I can tackle them one at a time:
Question 1) how do we append new data to existing query and delete older dates data?
Without having seen your data, I would take this approach. You are importing files, so just select the most recent 3/4 days of files. Do this:
Question 2) while appending two tables of different structures, how does one ensure additional columns are also added?
This should happen automatically when you import using the Combine operation above. To validate this, I went into one of the files and added a [Marketing Junk] column, saved it, then went back into Power Query and simply told it to refresh. This was the result:
For the Bike file, the column with the 'x' char I put in there for testing is there. For the Clothing and other files, the column is there, but null.
At this point, if you want to further transform by removing columns, I strongly recommend you select the columns you want and then use the Remove Other Columns functionality. The reason is, if you have a column today called [ID] and you are removing it, if next week the [ID] column is gone from your last 3 files, Power Query will break because there is no [ID] column to remove. but if you select all but the [ID] column and Remove Other Columns, Power Query uses a Table.SelectColumns() function and is really selecting what you want. It doesn't care that [ID] is there and gone.
The flip side of this is if you get a new [Valid Data] column in a new file, that will actually not show up either. It will be in Power Query, but Power Query will discard it.
In that case, it gets tricky. If your new column names follow a predictable pattern (variable1, variable2, variable3) then we can wild-card the Table.SelectColumns().
But that is getting more complex. See how this works for you right now, and let me know if I understood your questions properly.
DAX is for Analysis. Power Query is for Data Modeling
Proud to be a Super User!
MCSA: BI ReportingThank you so much for the detailed response. The 2nd question is answered but i would like to explain my 1st question again.
Problem statement is not to consider the files which are the most recent. In my current situation, each file contains data for the last 4 days. So everyday I get a rolling 4 day report. for instance, today's report (5th May) will consists of files from 1st May to 4th May. For me to maintain a continuous updated dataset (file name = Manual_dataset), i have to manually copy the rows between 1st May to 4th May in a seperate sheet. This seperate file consists of the latest data till today. So tomorrow (6th May) when I receive the next report, I now have updated data from 2nd to the 5th May. Now I have to open the Manual_dataset and then perform these steps
i) delete existing data from 2nd to 4th May
ii) copy data from 2nd to 5th May from the latest report
I followed your instructions and encountered the following error.
As you will see, for a single date (1st May), there are 3 files that are being reference. It is desired to remove data from the older files and only maintain the one from 4th May.
I hope I have explained my question better. Do let me know if my query is within the realm of being achieved within the free version of PowerBi desktop.
Hi @Anonymous
I'm still not sure I fully understand your problem, but here is the approach I would take.
If I'm still not understanding, let me know.
If you need help with implementing the above logic, I'll need sample files to mess with. See links below to facilitate that.
How to get good help fast. Help us help you.
How to Get Your Question Answered Quickly
How to provide sample data in the Power BI Forum
DAX is for Analysis. Power Query is for Data Modeling
Proud to be a Super User!
MCSA: BI Reportingthank you for the response. however, i do not think this will work since this process has to be done on a daily basis. I am trying to look for ways to make it as less manual as possible.
The method I proposed is 100% automatic once you know how much data you need.
If you could be really clear with some sample data, as has been requested, we can more easily help. Again, links below for supplying sample data.
How to get good help fast. Help us help you.
How to Get Your Question Answered Quickly
How to provide sample data in the Power BI Forum
DAX is for Analysis. Power Query is for Data Modeling
Proud to be a Super User!
MCSA: BI ReportingWell, the general answer to this is to use a Folder query. Depending on what transformations you do you may be able to get away with having additional columns in some files but it is not idea. Power Query is amazing but it is not magic, it can't simply *poof* away all manner of awful data. @ImkeF and @edhans might have some ideas.
Thanks for the response Greg. Would you be able to shed some light on Folder Query?
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.