cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
JimChisholm
Frequent Visitor

Import Entire Worksheet Rather Than UsedRange

I have a need when connecting to an external Excel file (via Excel.Workbook) to import the entire worksheet including any leading blank rows or columns.
The default behavior is to return only the UsedRange (save to CSV behaves the same way).   
For example, if rows 1-6 and columns A-D are blank in the source worksheet the UsedRange will start at cell E7 which will be loaded to row1:Column1 in the PQ editor. The blank rows and columns - areas outside the UsedRange - are ignored. 
I need cell E7 data to appear at row7:Column5.
My workaround is to set the Style of A1 in the source worksheet to something other than Normal. That will define the top left corner of the UsedRange to be A1 and the blank rows and columns will be imported.
But, I'd rather not have to do that. 
Thanks for any insight.

1 ACCEPTED SOLUTION

Hi @JimChisholm,

 

Try this one. I've optimised the performance by truncating unnecessary checks after the first header match is found, also limiting the number of rows to check for the header (in this case top 100 rows, which seems reasonably high). If you sure this number can be less than 100, change it to what you think is most appropriate. The smaller the number the quicker the code.

 

 

 

let
    Source = Excel.Workbook(File.Contents("C:\...\Downloads\Test.xlsx"), null, true),
    Sheet1_Sheet = Source{[Item="Sheet1",Kind="Sheet"]}[Data],

    fTrimTable = (tbl as table, header as text) =>
        let 
            t = Table.Buffer(tbl),
            columns = List.Buffer(Table.ColumnNames(t)),
            rowsToCheck = 100,
            Column = List.Select(columns, each List.PositionOf(List.FirstN(Table.Column(t, _),rowsToCheck), header)>0){0},
            Row = List.PositionOf(Table.Column(t, Column), header),
            ScrollRows = Table.RemoveFirstN (t, Row),
            ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(columns, List.PositionOf(columns, Column))),
            #"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
        in
            #"Promoted Headers",
    
    Trimmed = fTrimTable(Sheet1_Sheet, "Header100")
in
   Trimmed

 

 

I've tested it on a file with 600 columns and 2000 rows (about 12Mb), on my fairly old laptop it takes about 10-15 sec to refresh, much better then the older version.

 

For records: the problem with the old code was that (a) it was loading the entire Excel file each time for each column checked for the position of the header, and (b) checking position of the header in the list of thousands of itemsis not quick, when it is not on hte list. Both not a big problem if the colunm is found quickly (ideal case it is in the A1 cell), but very quickly become one as the number of columns to check overgrow 3-5, in my testing for searhing of "Header100" - the 100th column - it took forever and I had to break the cyle. Optimised version does not have any of the above problems.

 

Kind regards,

John

 

View solution in original post

8 REPLIES 8
jbwtp
Memorable Member
Memorable Member

Hi @JimChisholm,

 

Would this be a help in resolving this problem?

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("bU67DsIwDPyXzh2wE8f2zsDOWHUIohIzgv/HF9QoqEiR5dzLtyzTNB/eOu/wZav37cn/2C9FsV3ft8fxw12T+pb7JnvMub4qfGQgNVvMYh4zUYnJgQ9SboSASLAZwplOSCAalSCMESoOfW4ub4h6GaWATDATaz/gjBZZf6SNQEdXVLHWVGGyKDEI0Z0UrAiKOiqqAGBcXz8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"(blank)" = _t, #"(blank).1" = _t, #"(blank).2" = _t, #"(blank).3" = _t, #"(blank).4" = _t, #"(blank).5" = _t, #"(blank).6" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"(blank)", type text}, {"(blank).1", type text}, {"(blank).2", type text}, {"(blank).3", type text}, {"(blank).4", type text}, {"(blank).5", type text}, {"(blank).6", type text}}),
    
    fTrimTable = (t as table, header as text)=>
        let 
            FindFirstCell = List.Accumulate(Table.ColumnNames(t), null, (a, n)=> if List.PositionOf(Table.Column(t, n), header) > 0  then [Row= List.PositionOf(Table.Column(t, n), "Header1"), Column = n] else a),
            ScrollRows = Table.RemoveFirstN (t, FindFirstCell[Row]),
            ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(Table.ColumnNames(ScrollRows), List.PositionOf(Table.ColumnNames(ScrollRows), FindFirstCell[Column]))),
            #"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
        in
            #"Promoted Headers",
    
    Output = fTrimTable(#"Changed Type", "Header1")
in Output

 

The code searches for the left top corner of the data regardless of the spreadsheet structure (as long as the column name is unique, but can be re-written to take just a first appearance).

 

Kind regards,

John

@jbwtp 

Hi John,

Using this approach would require I get header names from the user - that of the top left data column and the index column. Easy enough to do.

However, it takes nearly 3 minutes to complete this query on a 450 column by 1,450 row worksheet.

Directly deleting rows and columns based on user-provided input is nearly instantaneous so I'll have to stick with that approach for now.

Thanks for providing this approach. I'll certainly add it to my toolbox.

Jim.

Hi @JimChisholm,

 

Try this one. I've optimised the performance by truncating unnecessary checks after the first header match is found, also limiting the number of rows to check for the header (in this case top 100 rows, which seems reasonably high). If you sure this number can be less than 100, change it to what you think is most appropriate. The smaller the number the quicker the code.

 

 

 

let
    Source = Excel.Workbook(File.Contents("C:\...\Downloads\Test.xlsx"), null, true),
    Sheet1_Sheet = Source{[Item="Sheet1",Kind="Sheet"]}[Data],

    fTrimTable = (tbl as table, header as text) =>
        let 
            t = Table.Buffer(tbl),
            columns = List.Buffer(Table.ColumnNames(t)),
            rowsToCheck = 100,
            Column = List.Select(columns, each List.PositionOf(List.FirstN(Table.Column(t, _),rowsToCheck), header)>0){0},
            Row = List.PositionOf(Table.Column(t, Column), header),
            ScrollRows = Table.RemoveFirstN (t, Row),
            ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(columns, List.PositionOf(columns, Column))),
            #"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
        in
            #"Promoted Headers",
    
    Trimmed = fTrimTable(Sheet1_Sheet, "Header100")
in
   Trimmed

 

 

I've tested it on a file with 600 columns and 2000 rows (about 12Mb), on my fairly old laptop it takes about 10-15 sec to refresh, much better then the older version.

 

For records: the problem with the old code was that (a) it was loading the entire Excel file each time for each column checked for the position of the header, and (b) checking position of the header in the list of thousands of itemsis not quick, when it is not on hte list. Both not a big problem if the colunm is found quickly (ideal case it is in the A1 cell), but very quickly become one as the number of columns to check overgrow 3-5, in my testing for searhing of "Header100" - the 100th column - it took forever and I had to break the cyle. Optimised version does not have any of the above problems.

 

Kind regards,

John

 

@jbwtp 

Success!
I tried it on my notoriously badly-structured dataset and performance is no longer an issue.
The user will now have to supply only one piece of information - the name of the top left header.
I'll put the max row check number in a variable on a configuration sheet.
Thanks for your solution John. I think this is a good general purpose approach that should find wide use.

Jim.

jbwtp
Memorable Member
Memorable Member

Hi @JimChisholm,

 

I could not find a way to re-configure the settings to take the entire spreadsheet inside PQ, but you can use a workaround by placing a character in the most right-bottom cell of the desired range. In my testing I just put it in the most right-bottom cell [which is XDF1048576] to fetch the entire spreadsheet and it worked.

 

Kind regrds,

John

Hi @jbwtp ,

The UsedRange is defined by the top left and the bottom right cells.

If you start with a UsedRange of $E$7:$G$13 (columns A-D are empty; rows 1-6 are empty) then placing a character (or changing the Style) in the bottom right-most cell expands the UsedRange to $E$7:$XFD$1048576.
The top left corner is not affected.
In the case of a UsedRange of $E$7:$G$13, the bottom right corner ($G$13) of the UsedRange is correct - it's the empty top rows and left columns I need.

So, other than manually manipulating the UsedRange to define the area to be imported - I don't see a way to import the entire worksheet.
Jim.

Hi @JimChisholm,

 

I think you are right, but maybe it is better to start from the other end of the story: why do you need those empty lines/cells? What are you trying to achieve?

 

Thanks,

John

Hi @jbwtp ,

I have a generic file translation process that takes the following file structure parameters from the user:

  1. Header row number
  2. Index column number 

This allows the process to handle all the messy worksheets that users create (leading blank rows and columns, multi-row headers, odd-spellings and line feeds) that I have no control over.

If they can at least look at the target worksheet and tell me two simple things about the block of data I'll have all the information I need to process the file - if PQ actually ignored the UsedRange and imported the entire sheet.

Since PQ won't do that I'll ask the user for two more parameters - 

  1. Number of leading blank rows
  2. Number of leading blank columns

Then I'll calculate the following to get the as-imported parameters needed for processing -

  1. Number of Rows to Delete =
    Header Row Number - Number of Blank Rows - 1
  2. Index Column =
    Index Column Number - Number of Blank Columns

 

Helpful resources

Announcements
November 2022 Update

Check it Out!

Click here to read more about the November 2022 updates!

Power BI Dev Camp Session 27

Ted's Dev Camp

This session walks through creating a new Azure AD B2C tenant and configuring it with user flows and custom policies.

Health and Life Sciences Power BI User Group

Health and Life Sciences Power BI User Group

Power BI specialists at Microsoft have created a community user group where customers in the provider, payor, pharma, health solutions, and life science industries can collaborate.

Top Solution Authors
Top Kudoed Authors