Solved: Handle directories of CSVs where columns change sl...

huntingtonog · ‎01-05-2022

Hello everyone,

I have a Sharepoint site as a Source in a Dataflow. Getting refresh errors due to the typical:

Error: Expression.Error: The column 'XXXXXX' of the table wasn't found. Request ID: <>

Main query:

let
  Source = SharePoint.Files("https://XXX.sharepoint.com/sites/Data", [ApiVersion = 15]),
  #"Filtered rows" = Table.SelectRows(Source, each Text.Contains([Name], "_Daily_20")),
  #"Sorted rows" = Table.Sort(#"Filtered rows", {{"Name", Order.Ascending}}),
  #"Filtered hidden files" = Table.SelectRows(#"Sorted rows", each [Attributes]?[Hidden]? <> true),
  #"Invoke custom function" = Table.AddColumn(#"Filtered hidden files", "Transform file", each #"Transform file"([Content])),
  #"Renamed columns" = Table.RenameColumns(#"Invoke custom function", {{"Name", "Source.Name"}}),
  #"Removed other columns" = Table.SelectColumns(#"Renamed columns", {"Source.Name", "Transform file"}),
  #"Expanded table column" = Table.ExpandTableColumn(#"Removed other columns", "Transform file", Table.ColumnNames(#"Transform file"(#"Sample file"))),
  #"Changed column type" = Table.TransformColumnTypes(#"Expanded table column", {{"Source.Name", type text}, {"OperatorID", Int64.Type}, {"Operator", type text}, {"Choke", Int64.Type}, {" Reason", type text}, {"WI", type number}}),
  #"Duplicated column" = Table.DuplicateColumn(#"Removed columns", "Source.Name", "Source.Name - Copy"),
  #"Renamed columns 1" = Table.RenameColumns(#"Duplicated column", {{"Source.Name - Copy", "Report Date"}}),
  #"Split column by positions" = Table.SplitColumn(#"Renamed columns 1", "Report Date", Splitter.SplitTextByPositions({39}), {"Report Date.1"}),
  #"Split column by position" = Table.SplitColumn(#"Split column by positions", "Report Date.1", Splitter.SplitTextByPositions({0, 8}, false), {"Report Date.2", "Report Date.3"}),
  #"Removed columns 1" = Table.RemoveColumns(#"Split column by position", {"Report Date.3"}),
  #"Changed column type 2" = Table.TransformColumnTypes(#"Removed columns 1", {{"Report Date.2", type date}}),
  #"Renamed columns 2" = Table.RenameColumns(#"Changed column type 2", {{"Report Date.2", "Report Date"}}),
  #"Renamed columns 3" = Table.RenameColumns(#"Renamed columns 2", {{" Reason", "Reason"}})
in
  #"Renamed columns 3"

Edit to include "Transform file" M query:

let
  Source = (Parameter as binary) => let
  Source = Csv.Document(Parameter, [Delimiter = ",", Encoding = 1252]),
  #"Promoted headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
  #"Changed column type" = Table.TransformColumnTypes(#"Promoted headers", {{"OperatorID", Int64.Type}, {"Operator", type text}, {"Choke", Int64.Type}, {" Reason", type text}, {"WI", type number}})
in
  #"Changed column type"
in
  Source

Edit to include "Transform Sample File" M query:

let
  Source = Csv.Document(Parameter, [Delimiter = ",", Encoding = 1252]),
  #"Promoted headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
  #"Changed column type" = Table.TransformColumnTypes(#"Promoted headers", {{"OperatorID", Int64.Type}, {"Operator", type text}, {"Operator Lease ID", Int64.Type}, {"Choke", Int64.Type}, {" Reason", type text}, {"WI", type number}})
in
  #"Changed column type"

I connect to the Sharepoint site then filter using a prefix that gets me to all of the CSV files within a certain directory. Issue is caused by the column schema occasionally changing. Sometimes the WI column drops from the daily csv report we get. Sometimes it's there.

What's the best tactic for handling this? Do I need to have two queries that merge... one that handles filtering down to one column schema realization and one that filters down to the other, then merge them? How do I filter down to a certain column realization when the query seems to break at Source->Expand Binary?

jennratten · ‎01-05-2022

You should be able to go to duplicate the main query, right-click the step after the invoke custom function step and choose delete until end. That will remove all of the steps that follow. Then select all columns of the table and Keep Rows > Rows with Errors. That should then give you a shorter list of the files that have errors, which are likely missing a column that has been referenced. You can use this as a list of errors to debug and also on the primary query, go to the invoke custom function step and insert a step to remove rows with errors. This essentially allows your primary query to continue to function while problematic files and funneled out for debugging and tracked.

View solution in original post

jennratten · ‎01-05-2022

Hello! It looks like you have included the script for the query that combines the files. Can you please post the script for the transformation function? It is being invoked on line 4 of the script (snip below). You can also check the script for the transformation function, specifically looking at the line for Csv.Document and see if the column number argument is specified. If it is, you can remove it. It is optional and it may be the reason your WI column is sometimes not present as it appears to be the last column in the source file.

 #"Invoke custom function" = Table.AddColumn(#"Filtered hidden files", "Transform file", each #"Transform file"([Content])),

Aside from what could be found in the transformation function, I find it helpful to avoid specifying column names in the script when possible.

huntingtonog · ‎01-05-2022

I edited the original post to include the queries you requested.

jennratten · ‎01-05-2022

It looks like the number of columns argument is not included, so that wouldn't be the issue. Can you use Excel or a text editor to open one of the CSV files in which column WI was dropped, and verify that it is actually included in the source file?

huntingtonog · ‎01-05-2022

There's hundreds of CSV files, I'm guessing 90% of them have the WI column and 10% don't. Too many to manually inspect though. Trying to figure out a way to handle without having to do that.

jennratten · ‎01-05-2022

You should be able to go to duplicate the main query, right-click the step after the invoke custom function step and choose delete until end. That will remove all of the steps that follow. Then select all columns of the table and Keep Rows > Rows with Errors. That should then give you a shorter list of the files that have errors, which are likely missing a column that has been referenced. You can use this as a list of errors to debug and also on the primary query, go to the invoke custom function step and insert a step to remove rows with errors. This essentially allows your primary query to continue to function while problematic files and funneled out for debugging and tracked.

Handle directories of CSVs where columns change slightly

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024