cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Nart
Frequent Visitor

Segmenting a query with so many columns (data from a long questionnaire)

I have a 700x800 dataset coming from a long questionnaire (700 questions x 800 submissions).

To work with the data in Power BI, I need to unpivot every few columns, that represent a question, together.
I thought it would be a bad idea to do all that unpivoting in one query, so I started referencing the main cleaned query and using "Choose Columns" to keep only a few related columns to unpivot in a separate query. 

It worked great for a while, but as the number of referencing queries increased, I started noticing some performance issues, which raised several questions in my head:

 

1- Is power BI really copying the whole table and then deleting the columns I exclude in every sub-query? and if so, is there a way to just copy\reference only a selected group of columns? (maybe something like the DAX SelectColumns) without so much waste of processing time.

2- Am I mistaken to think that this segmentation is my only option here?

1 ACCEPTED SOLUTION
CNENFRNL
Super User III
Super User III

Hi, @Nart , you're only one inch away from goal. After unpivoting the dataset, extract the serial no of questions. i.e. "Q1" from "Q1-A1", "Q1-A2", "Q1-A3" ... a transformed dataset is like this and it's ready for slice and dice from quite a few dimensions, such as index, gender, age, question, etc.

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUfIFYkMzIOEM4hmAeFAamW1IBBvBj9WJVjICstxAPAuQ2UYEzMNlrwGGepDZxlCzjYxgZmMzC7vbsPMRZpvAwsQSZLYxHrOwmYnbj7GxAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Index = _t, gender = _t, age = _t, camp = _t, #"Q1-A1" = _t, #"Q1-A2" = _t, #"Q1-A3" = _t, #"Q1-A4" = _t, #"Q1-A5" = _t, #"Q2-A1" = _t, #"Q2-A2" = _t, #"Q2-A3" = _t, #"Q2-A4" = _t, #"Q2-A5" = _t, #"Q2-A6" = _t, #"Q3-A1" = _t, #"Q3-A2" = _t, #"Q3-A3" = _t, #"Q4-A1" = _t, #"Q4-A2" = _t, #"Q4-A3" = _t, #"Q4-A4" = _t]),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(Source, {"Index", "gender", "age", "camp"}, "Opt", "Answer"),
    #"Inserted Text Before Delimiter" = Table.AddColumn(#"Unpivoted Other Columns", "Question", each Text.BeforeDelimiter([Opt], "-"), type text),
    #"Changed Type" = Table.TransformColumnTypes(#"Inserted Text Before Delimiter",{{"Answer", Int64.Type}, {"Index", Int64.Type}})
in
    #"Changed Type"

 

Screenshot 2020-12-08 100259.png

View solution in original post

9 REPLIES 9
Icey
Community Support
Community Support

Hi @Nart ,

 

Please let us know if the replies above are helpful.

 

If they are, please always accept the replies making sense as solution to your question so that people who may have the same question can get the solution directly.

 

If not, please let us know.

 

 

Best Regards,

Icey

Nart
Frequent Visitor

Nevertheless, I'm still interested in knowing whether there's a way to pick a group of columns in a new query without pulling the whole original query first. 

CNENFRNL
Super User III
Super User III

Hi, @Nart , sure thing, you might want to use Table.SelectColumns() to extract specific columns from the original dataset.

CNENFRNL
Super User III
Super User III

Hi, @Nart 

How do you define correlations among questions? I think another table is necessary for the dataset regardless of its form (one-dimenstional table after unpivoting or original two-dimensional table).

 

As you know, DAX runs on VertiPaq Engine, which is in-memory columnar database. Here's a brief introduction for your reference.

https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=2

 

To my superfacial understanding, VertiPaq favors one-d table.

 

"...
Single-column access is very fast, because it reads a single block of memory, and then it computes whatever aggregation you need on that memory block.

If an expression uses many columns, the algorithm is more complex because it requires the engine to access different memory areas at different times, keeping track of the progress in some temporary area.

The more columns you need to compute an expression, the harder it becomes to produce a final value, up to a point where it is easier to rebuild the row storage out of the column store to compute the expression.
..."
(qutotation from afore-mentioned article)

 

 

Nart
Frequent Visitor

@CNENFRNL Thanks a lot! that was right to my point.

I guess I'll just create a long table and use a duplicate with questions that require cross-question filtering.

I'll let you know how that works.

Nart
Frequent Visitor

I am using the reference feature as I said, not duplicating. But still, the referencing step seems to pull the entire table into the new query while I just need a few columns. 

 

The reason I went with segmenting is that I need to unpivot groups of columns together. so If I do this while all columns are in the same query, I'd be creating thousands of new rows every time. 

Here's how my dataset looks like. 

questionnaire data sample.png

Each question is represented across several columns.
To viaualize and analyze the answers, I'm unpivoting the columns for each question separately to get all its answers in one column.

CNENFRNL
Super User III
Super User III

Hi, @Nart , you're only one inch away from goal. After unpivoting the dataset, extract the serial no of questions. i.e. "Q1" from "Q1-A1", "Q1-A2", "Q1-A3" ... a transformed dataset is like this and it's ready for slice and dice from quite a few dimensions, such as index, gender, age, question, etc.

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUfIFYkMzIOEM4hmAeFAamW1IBBvBj9WJVjICstxAPAuQ2UYEzMNlrwGGepDZxlCzjYxgZmMzC7vbsPMRZpvAwsQSZLYxHrOwmYnbj7GxAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Index = _t, gender = _t, age = _t, camp = _t, #"Q1-A1" = _t, #"Q1-A2" = _t, #"Q1-A3" = _t, #"Q1-A4" = _t, #"Q1-A5" = _t, #"Q2-A1" = _t, #"Q2-A2" = _t, #"Q2-A3" = _t, #"Q2-A4" = _t, #"Q2-A5" = _t, #"Q2-A6" = _t, #"Q3-A1" = _t, #"Q3-A2" = _t, #"Q3-A3" = _t, #"Q4-A1" = _t, #"Q4-A2" = _t, #"Q4-A3" = _t, #"Q4-A4" = _t]),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(Source, {"Index", "gender", "age", "camp"}, "Opt", "Answer"),
    #"Inserted Text Before Delimiter" = Table.AddColumn(#"Unpivoted Other Columns", "Question", each Text.BeforeDelimiter([Opt], "-"), type text),
    #"Changed Type" = Table.TransformColumnTypes(#"Inserted Text Before Delimiter",{{"Answer", Int64.Type}, {"Index", Int64.Type}})
in
    #"Changed Type"

 

Screenshot 2020-12-08 100259.png

View solution in original post

Nart
Frequent Visitor

Thanks @CNENFRNL , I did try that earlier, but what I've lost is the ability to filter questions by other questions to see how they correlate. Plus, wouldn't the whole (very long) database have to be loaded and filtered in every visual in that case? wouldn't that cause report pages and interactive filters to load much more slowly?

Do you think it would be a reasonable performance tradeoff?

mahoneypat
Super User IV
Super User IV

Power Query can handle much more than 700 columns.  Why did you initially decide to segment?  In any case, are you using the Reference feature between your queries?

Reference vs Duplicate in Power BI; Power Query Back to Basics - RADACAD

Regards,

Pat





Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.


@mahoneypa HoosierBI on YouTube


Helpful resources

Announcements
PBI User Groups

Welcome to the User Group Public Preview

Check out new user group experience and if you are a leader please create your group!

MBAS Attendee Badge

Claim Your Badge & Digital Swag!

Check out how to claim yours today!

secondImage

Are You Ready?

Test your skills now with the Cloud Skills Challenge.

Top Solution Authors