cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Super User III
Super User III

Column values change when loading into model

Hi all,

See attached file for support.

We build a simple table in the query editor (M code below). Note we are creating a custom column "Value" with randomly generated numbers. In the query editor, the final result shows a different value for each row:

image.png

 

However, after loading the table into the model, column Value shows a unique value:

image.png 

Any idea what is going on?

Many thanks

 

 

let
    Source = List.Numbers(1, NumCustomers),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "CustomerId"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"CustomerId", Int64.Type}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Value", each Number.RandomBetween(0,1000), type number)
in
    #"Added Custom"

 

 

2 ACCEPTED SOLUTIONS
Super User I
Super User I

Hi @AlB 

 

I'll try to explain a little bit.

No matter the queries we write, in the end we have the PQ Engine to do the work for us. In order to make this engine fast enough, there have been written a whole lot of optimisations. So, it will do some steps together, it will rid of some all the way because they are deemed not needed, etc. That's why it's a functional language: it starts from the result and works its way up using only parts that it is actually going to need. And this is probably why the "DataMashup" has the Mashup in its name.

Some of these optimisations give birth to "bugs" like the one you came accross.

 

Using:

 

    #"Added Custom" = Table.AddColumn(#"Changed Type", "Value", each Number.RandomBetween(0,1000), type number),

 

The PQe will "understand" that the row values are not needed: We ask for a number, we do not make use of any row value, and then return it. So, when mashing this thing up, it will actually compute Number.RandomBetween(0,1000) once and then add it in every row.

 

It is a bit smarter than that of course, we cannot "trick" it to use the row values if they're not needed:

 

    #"Added Custom" = Table.AddColumn(#"Changed Type", "Value", each Number.RandomBetween(0,1000) + [CustomerId] - [CustomerId], type number),

 

will also never use the rows' values.

 

To actually see this happening in PQ Editor one can use a Transformer function:

 

Table.TransformColumns(#"Added Custom",{{"Value", each Number.RandomBetween(0,1000), type number}})

 

with which every row should be filled by the same number.

 

OK, now we need to actually force it to calculate this thing for every row. And this is why Index or Buffer are commonly used:

i) By indexing after the random number generation, the PQe actually has to go through every row, look at its values, save their position and values and then go on.

ii) By buffering, it loads the whole thing into memory, so it actually computes the random numbers once again for every row, but also keeping all that.

 

There's another solution not really talked about, which is for me the fastest one I know of: Replacing. Table.ReplaceValues actually reads every row and discards it, being a bit faster than indexing. These two steps should work as expected:

 

    #"Added Custom" = Table.AddColumn(#"Changed Type", "Value", each null, type number),
    #"Replaced Value" = Table.ReplaceValue(#"Added Custom",null,each Number.RandomBetween(0, 1000),Replacer.ReplaceValue,{"Value"})

 

 

Last, you can avoid all that by using List.Random, which is actually made to compute different random numbers. However, afterwards you'll need to zip it with your data and expand etc etc, or add an index and transform the index to the corresponding list's random value. All that taking some valueable calculation time. But take those two steps:

 

    #"Zip with List Random" = List.Zip({#"Changed Type"[CustomerId], List.Transform(List.Random(List.Count(#"Changed Type"[CustomerId])), each Value.Multiply(_, 1000))}),
    #"List to Table" = Table.FromList(#"Zip with List Random", (x) => x, {"CustomerId", "Value"})
in
    #"List to Table"

 

 

Hope it makes some sense now.

Best,

Spyros




Feel free to connect with me:
LinkedIn

View solution in original post

Hi @AlB 

 

I'm glad you found this helpful. I've read a lot about how the mechanics work, but this documentation is found mostly on forums, by reading responses by people actually involved in the engine's developement (I am not one of them 🙂 ). There's a good series by Ben Gribaudo about M, and it does meddle a little bit with these topics (lazy evaluation) but not so much.

 

OK, now to explaining this last bit. The query preview actually computes some things and has its own buffer, which is the one helping us to see if what we're doing actually works. When we're previewing the results, they are buffered in the DataMashup so we can see and work with them, we can see how each step is computed by clicking on them etc.  (btw this is why a good optimisation when working with large data models is disabling background preview).

 

In a way, the preview is a bit less lazy (less optimisations in our case) than the actual query plan, in regards to trying to put the steps in order, letting you look at these steps etc. So, Number.RandomBetween(0,1000) gets evaluated and shown.

 

However, when the queries are applied, the queries are not looked at individually but get reevaluated all together, compressed and put into the Tabular model's Storage Engine (VertiPaq). The preview's results are not used. These optimisations are a lot better - we see that sometimes while working with very complicated queries taking about the same time for 1000 rows (preview) and for the actual evaluation. One such optimisation is assuming idempotence for every function: given the same arguments the function will return the same result. With this, Number.RandomBetween(0,1000) is evaluated once and the result is subsequentl added in every row.

 

Of course, with static data we see no difference, but with random data we do.




Feel free to connect with me:
LinkedIn

View solution in original post

9 REPLIES 9
Super User III
Super User III

Hi @Smauro 

First, of all, thanks very much for taking the time to put together such a fantastic, detailed and extensive explanation. I really appreciate it. 

I understand the whole idea but still something bugs me. In the example I provided, the engine seems to calculate already one different random value for each row, since it is showing them. It is only when loading into the model that it seems to calculate again and, that time, with one same value for all rows. This doesn't seem to make much sense. Why is it, apparently, calculating the values twice? I mean, following what you explained, I would expect it to, in the last PQ step, optimize by just calculating one random value and showing that only random value in all rows and then loading that same unique value into the model. Showing different values first and then loading only one seems incoherent.

What do you think?

By the way, are you or have you been involved with the development of the PQ engine?

Many thanks    

   

Hi @AlB 

 

I'm glad you found this helpful. I've read a lot about how the mechanics work, but this documentation is found mostly on forums, by reading responses by people actually involved in the engine's developement (I am not one of them 🙂 ). There's a good series by Ben Gribaudo about M, and it does meddle a little bit with these topics (lazy evaluation) but not so much.

 

OK, now to explaining this last bit. The query preview actually computes some things and has its own buffer, which is the one helping us to see if what we're doing actually works. When we're previewing the results, they are buffered in the DataMashup so we can see and work with them, we can see how each step is computed by clicking on them etc.  (btw this is why a good optimisation when working with large data models is disabling background preview).

 

In a way, the preview is a bit less lazy (less optimisations in our case) than the actual query plan, in regards to trying to put the steps in order, letting you look at these steps etc. So, Number.RandomBetween(0,1000) gets evaluated and shown.

 

However, when the queries are applied, the queries are not looked at individually but get reevaluated all together, compressed and put into the Tabular model's Storage Engine (VertiPaq). The preview's results are not used. These optimisations are a lot better - we see that sometimes while working with very complicated queries taking about the same time for 1000 rows (preview) and for the actual evaluation. One such optimisation is assuming idempotence for every function: given the same arguments the function will return the same result. With this, Number.RandomBetween(0,1000) is evaluated once and the result is subsequentl added in every row.

 

Of course, with static data we see no difference, but with random data we do.




Feel free to connect with me:
LinkedIn

View solution in original post

Super User I
Super User I

Hi @AlB 

 

I'll try to explain a little bit.

No matter the queries we write, in the end we have the PQ Engine to do the work for us. In order to make this engine fast enough, there have been written a whole lot of optimisations. So, it will do some steps together, it will rid of some all the way because they are deemed not needed, etc. That's why it's a functional language: it starts from the result and works its way up using only parts that it is actually going to need. And this is probably why the "DataMashup" has the Mashup in its name.

Some of these optimisations give birth to "bugs" like the one you came accross.

 

Using:

 

    #"Added Custom" = Table.AddColumn(#"Changed Type", "Value", each Number.RandomBetween(0,1000), type number),

 

The PQe will "understand" that the row values are not needed: We ask for a number, we do not make use of any row value, and then return it. So, when mashing this thing up, it will actually compute Number.RandomBetween(0,1000) once and then add it in every row.

 

It is a bit smarter than that of course, we cannot "trick" it to use the row values if they're not needed:

 

    #"Added Custom" = Table.AddColumn(#"Changed Type", "Value", each Number.RandomBetween(0,1000) + [CustomerId] - [CustomerId], type number),

 

will also never use the rows' values.

 

To actually see this happening in PQ Editor one can use a Transformer function:

 

Table.TransformColumns(#"Added Custom",{{"Value", each Number.RandomBetween(0,1000), type number}})

 

with which every row should be filled by the same number.

 

OK, now we need to actually force it to calculate this thing for every row. And this is why Index or Buffer are commonly used:

i) By indexing after the random number generation, the PQe actually has to go through every row, look at its values, save their position and values and then go on.

ii) By buffering, it loads the whole thing into memory, so it actually computes the random numbers once again for every row, but also keeping all that.

 

There's another solution not really talked about, which is for me the fastest one I know of: Replacing. Table.ReplaceValues actually reads every row and discards it, being a bit faster than indexing. These two steps should work as expected:

 

    #"Added Custom" = Table.AddColumn(#"Changed Type", "Value", each null, type number),
    #"Replaced Value" = Table.ReplaceValue(#"Added Custom",null,each Number.RandomBetween(0, 1000),Replacer.ReplaceValue,{"Value"})

 

 

Last, you can avoid all that by using List.Random, which is actually made to compute different random numbers. However, afterwards you'll need to zip it with your data and expand etc etc, or add an index and transform the index to the corresponding list's random value. All that taking some valueable calculation time. But take those two steps:

 

    #"Zip with List Random" = List.Zip({#"Changed Type"[CustomerId], List.Transform(List.Random(List.Count(#"Changed Type"[CustomerId])), each Value.Multiply(_, 1000))}),
    #"List to Table" = Table.FromList(#"Zip with List Random", (x) => x, {"CustomerId", "Value"})
in
    #"List to Table"

 

 

Hope it makes some sense now.

Best,

Spyros




Feel free to connect with me:
LinkedIn

View solution in original post

Super User III
Super User III

@v-alq-msft 

Many thanks for the link. Very interesting

Any idea why there's a need to include an index column. Why does it work differently with or without the index, since we are not doing anything with it?

Thanks

 

Community Support
Community Support

Hi, @AlB 

 

To create multiple columns containing random numbers, add an index column each time before using the Number.Random(). You can delete the columns later.

 

You can also wrap the step before the Number.Random() step with Table.Buffer() to generate different number on each row. For example, if you added a custom column in the step previous to adding random number column, your step would look like Table.Buffer(Table.AddColumn(...)). In this case you will not need to add an index column.

 

 

For further information, I'd like to suggest you refer to the following article.

Random Number Generation in Power BI 

 

Best Regards

Allan

 

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

 

 

Super User III
Super User III

@Payeras_BI 

Thanks.  Yes, I'd seen that with the extra steps. I'd like to know the reason for that behaviour though. It seeems quite strange (and inconsistent) that the values it shows on the column depend on whether you have more steps afterwards, or what kind of steps you have afterwards. 

 

Super User I
Super User I

Hi @AlB ,

 

It appears that adding the index column after adding the custom column is what forces PQ to evaluate each row separately from that point onwards (including the load in PBI).

 

Without that Index column you would get the same behaviour in PQ editor if you continued on working there. See below: 

Payeras_BI_0-1610971399688.png

Adding any extra step:

Payeras_BI_1-1610971422660.png

 

Regards,

 


Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos :winking_face:
Mallorca, Spain
Super User III
Super User III

@Payeras_BI 

Thanks. Any idea why?

Super User I
Super User I

Hi @AlB ,

 

It seems to need the creation of an Index Column to work.

 

let
    Source = List.Numbers(1,NumCustomers),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "CustomerId"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"CustomerId", Int64.Type}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Value", each Number.RandomBetween(0,1000), type number),
    #"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 0, 1, Int64.Type),
    #"Removed Columns" = Table.RemoveColumns(#"Added Index",{"Index"})
in
    #"Removed Columns"

 

Payeras_BI_0-1610966307918.png

 

Regards,


Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos :winking_face:
Mallorca, Spain

Helpful resources

Announcements
secondImage

Congratulations!

We are excited to announce the Power BI Super Users!

Wave Release 2

Check out the updates in Power BI.

Overview of Power BI 2020 release wave 2!

Get Ready for Power BI Dev Camp

Get Ready for Power BI Dev Camp

Mark your calendars and join us for our next Power BI Dev Camp!.

secondImage

The largest Power BI virtual conference

100+ sessions, 100+ speakers, Product managers, MVPs, and experts. All about Power BI. Attend online or watch the recordings.

Top Solution Authors