Solved: Re: Import Entire Worksheet Rather Than UsedRange

JimChisholm · ‎07-20-2022

I have a need when connecting to an external Excel file (via Excel.Workbook) to import the entire worksheet including any leading blank rows or columns.
The default behavior is to return only the UsedRange (save to CSV behaves the same way).
For example, if rows 1-6 and columns A-D are blank in the source worksheet the UsedRange will start at cell E7 which will be loaded to row1:Column1 in the PQ editor. The blank rows and columns - areas outside the UsedRange - are ignored.
I need cell E7 data to appear at row7:Column5.
My workaround is to set the Style of A1 in the source worksheet to something other than Normal. That will define the top left corner of the UsedRange to be A1 and the blank rows and columns will be imported.
But, I'd rather not have to do that.
Thanks for any insight.

jbwtp · ‎07-25-2022

Hi @JimChisholm,

Try this one. I've optimised the performance by truncating unnecessary checks after the first header match is found, also limiting the number of rows to check for the header (in this case top 100 rows, which seems reasonably high). If you sure this number can be less than 100, change it to what you think is most appropriate. The smaller the number the quicker the code.

let
    Source = Excel.Workbook(File.Contents("C:\...\Downloads\Test.xlsx"), null, true),
    Sheet1_Sheet = Source{[Item="Sheet1",Kind="Sheet"]}[Data],

    fTrimTable = (tbl as table, header as text) =>
        let 
            t = Table.Buffer(tbl),
            columns = List.Buffer(Table.ColumnNames(t)),
            rowsToCheck = 100,
            Column = List.Select(columns, each List.PositionOf(List.FirstN(Table.Column(t, _),rowsToCheck), header)>0){0},
            Row = List.PositionOf(Table.Column(t, Column), header),
            ScrollRows = Table.RemoveFirstN (t, Row),
            ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(columns, List.PositionOf(columns, Column))),
            #"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
        in
            #"Promoted Headers",
    
    Trimmed = fTrimTable(Sheet1_Sheet, "Header100")
in
   Trimmed

I've tested it on a file with 600 columns and 2000 rows (about 12Mb), on my fairly old laptop it takes about 10-15 sec to refresh, much better then the older version.

For records: the problem with the old code was that (a) it was loading the entire Excel file each time for each column checked for the position of the header, and (b) checking position of the header in the list of thousands of itemsis not quick, when it is not on hte list. Both not a big problem if the colunm is found quickly (ideal case it is in the A1 cell), but very quickly become one as the number of columns to check overgrow 3-5, in my testing for searhing of "Header100" - the 100th column - it took forever and I had to break the cyle. Optimised version does not have any of the above problems.

Kind regards,

John

View solution in original post

jbwtp · ‎07-24-2022

Hi @JimChisholm,

Would this be a help in resolving this problem?

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("bU67DsIwDPyXzh2wE8f2zsDOWHUIohIzgv/HF9QoqEiR5dzLtyzTNB/eOu/wZav37cn/2C9FsV3ft8fxw12T+pb7JnvMub4qfGQgNVvMYh4zUYnJgQ9SboSASLAZwplOSCAalSCMESoOfW4ub4h6GaWATDATaz/gjBZZf6SNQEdXVLHWVGGyKDEI0Z0UrAiKOiqqAGBcXz8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"(blank)" = _t, #"(blank).1" = _t, #"(blank).2" = _t, #"(blank).3" = _t, #"(blank).4" = _t, #"(blank).5" = _t, #"(blank).6" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"(blank)", type text}, {"(blank).1", type text}, {"(blank).2", type text}, {"(blank).3", type text}, {"(blank).4", type text}, {"(blank).5", type text}, {"(blank).6", type text}}),
    
    fTrimTable = (t as table, header as text)=>
        let 
            FindFirstCell = List.Accumulate(Table.ColumnNames(t), null, (a, n)=> if List.PositionOf(Table.Column(t, n), header) > 0  then [Row= List.PositionOf(Table.Column(t, n), "Header1"), Column = n] else a),
            ScrollRows = Table.RemoveFirstN (t, FindFirstCell[Row]),
            ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(Table.ColumnNames(ScrollRows), List.PositionOf(Table.ColumnNames(ScrollRows), FindFirstCell[Column]))),
            #"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
        in
            #"Promoted Headers",
    
    Output = fTrimTable(#"Changed Type", "Header1")
in Output

The code searches for the left top corner of the data regardless of the spreadsheet structure (as long as the column name is unique, but can be re-written to take just a first appearance).

Kind regards,

John

JimChisholm · ‎07-25-2022

@jbwtp

Hi John,

Using this approach would require I get header names from the user - that of the top left data column and the index column. Easy enough to do.

However, it takes nearly 3 minutes to complete this query on a 450 column by 1,450 row worksheet.

Directly deleting rows and columns based on user-provided input is nearly instantaneous so I'll have to stick with that approach for now.

Thanks for providing this approach. I'll certainly add it to my toolbox.

Jim.

jbwtp · ‎07-25-2022

Hi @JimChisholm,

Try this one. I've optimised the performance by truncating unnecessary checks after the first header match is found, also limiting the number of rows to check for the header (in this case top 100 rows, which seems reasonably high). If you sure this number can be less than 100, change it to what you think is most appropriate. The smaller the number the quicker the code.

let
    Source = Excel.Workbook(File.Contents("C:\...\Downloads\Test.xlsx"), null, true),
    Sheet1_Sheet = Source{[Item="Sheet1",Kind="Sheet"]}[Data],

    fTrimTable = (tbl as table, header as text) =>
        let 
            t = Table.Buffer(tbl),
            columns = List.Buffer(Table.ColumnNames(t)),
            rowsToCheck = 100,
            Column = List.Select(columns, each List.PositionOf(List.FirstN(Table.Column(t, _),rowsToCheck), header)>0){0},
            Row = List.PositionOf(Table.Column(t, Column), header),
            ScrollRows = Table.RemoveFirstN (t, Row),
            ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(columns, List.PositionOf(columns, Column))),
            #"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
        in
            #"Promoted Headers",
    
    Trimmed = fTrimTable(Sheet1_Sheet, "Header100")
in
   Trimmed

I've tested it on a file with 600 columns and 2000 rows (about 12Mb), on my fairly old laptop it takes about 10-15 sec to refresh, much better then the older version.

For records: the problem with the old code was that (a) it was loading the entire Excel file each time for each column checked for the position of the header, and (b) checking position of the header in the list of thousands of itemsis not quick, when it is not on hte list. Both not a big problem if the colunm is found quickly (ideal case it is in the A1 cell), but very quickly become one as the number of columns to check overgrow 3-5, in my testing for searhing of "Header100" - the 100th column - it took forever and I had to break the cyle. Optimised version does not have any of the above problems.

Kind regards,

John

JimChisholm · ‎07-26-2022

@jbwtp

Success!
I tried it on my notoriously badly-structured dataset and performance is no longer an issue.
The user will now have to supply only one piece of information - the name of the top left header.
I'll put the max row check number in a variable on a configuration sheet.
Thanks for your solution John. I think this is a good general purpose approach that should find wide use.

Jim.

jbwtp · ‎07-20-2022

Hi @JimChisholm,

I could not find a way to re-configure the settings to take the entire spreadsheet inside PQ, but you can use a workaround by placing a character in the most right-bottom cell of the desired range. In my testing I just put it in the most right-bottom cell [which is XDF1048576] to fetch the entire spreadsheet and it worked.

Kind regrds,

John

JimChisholm · ‎07-21-2022

Hi @jbwtp ,

The UsedRange is defined by the top left and the bottom right cells.

If you start with a UsedRange of $E$7:$G$13 (columns A-D are empty; rows 1-6 are empty) then placing a character (or changing the Style) in the bottom right-most cell expands the UsedRange to $E$7:$XFD$1048576.
The top left corner is not affected.
In the case of a UsedRange of $E$7:$G$13, the bottom right corner ($G$13) of the UsedRange is correct - it's the empty top rows and left columns I need.

So, other than manually manipulating the UsedRange to define the area to be imported - I don't see a way to import the entire worksheet.
Jim.

jbwtp · ‎07-21-2022

Hi @JimChisholm,

I think you are right, but maybe it is better to start from the other end of the story: why do you need those empty lines/cells? What are you trying to achieve?

Thanks,

John

JimChisholm · ‎07-22-2022

Hi @jbwtp ,

I have a generic file translation process that takes the following file structure parameters from the user:

Header row number
Index column number

This allows the process to handle all the messy worksheets that users create (leading blank rows and columns, multi-row headers, odd-spellings and line feeds) that I have no control over.

If they can at least look at the target worksheet and tell me two simple things about the block of data I'll have all the information I need to process the file - if PQ actually ignored the UsedRange and imported the entire sheet.

Since PQ won't do that I'll ask the user for two more parameters -

Number of leading blank rows
Number of leading blank columns

Then I'll calculate the following to get the as-imported parameters needed for processing -

Number of Rows to Delete =
Header Row Number - Number of Blank Rows - 1
Index Column =
Index Column Number - Number of Blank Columns

Import Entire Worksheet Rather Than UsedRange

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024