cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
markus_zhang
Advocate II
Advocate II

How can I further optimize this DAX formula?

Hi experts,

I have a table with 12 million rows and two columns:

 

email    |    logins 

 

Basically it's a table of login records for each account (email) between 2018-10-17 and 2019-01-16.

 

I set up a simple Matrix as the following:

 

Date
10/17/2018 0:00
10/24/2018 0:00
10/31/2018 0:00
11/7/2018 0:00
11/14/2018 0:00
11/21/2018 0:00
11/28/2018 0:00
12/5/2018 0:00
12/12/2018 0:00
12/19/2018 0:00
12/26/2018 0:00
1/2/2019 0:00
1/9/2019 0:00
1/16/2019 0:00

 

So that I can put measures in the columns. One of the measures is the Active Mmeber Counts, basically I have a slicer to let the user play with different values of Active Days, and this measure will show different numbers of Active Members in the Matrix (e.g. if Active day is defined as 28 days, it will definitely show more Active members than the one with 21 days). Here is the formula:

 

A few things to clarify:

1. WeekMatrix[Date] is the Date column at the top

2. [ActiveThreshold] is the Active Days user can play with the slicer, could be from 14 days to 30 days

3. Basically the formula creates a helper column to decide if that particular account (email) is considered active by looking at the most recent date he/she has a login and comparing with the Date in the Matrix, and then do a DISTINCTCOUNT as emails have duplicates.

 

IF(
    ISFILTERED(WeekMatrix[Date]),
    CALCULATE(
        DISTINCTCOUNT(NTKLogin[email]),
        FILTER(
            ADDCOLUMNS(
                FILTER(
                    ALL(NTKLogin),
                    NTKLogin[logins] <= SELECTEDVALUE(WeekMatrix[Date])
                ),
                "StillActive",
                IF(
                    DATEDIFF(
                        MAXX(
                            FILTER(
                                FILTER(
                                    ALL(NTKLogin),
                                    NTKLogin[logins] <= SELECTEDVALUE(WeekMatrix[Date])
                                ),
                                AND(
                                    NTKLogin[email] = EARLIER(NTKLogin[email]),
                                    NTKLogin[logins] <= SELECTEDVALUE(WeekMatrix[Date])
                                )
                            ),
                            NTKLogin[logins]
                        ),
                        SELECTEDVALUE(WeekMatrix[Date]),
                        DAY
                    ) <= [ActiveThreshold], //  Still active if within threshold
                    1,
                    0
                )
            ),
            AND(
                [StillActive] = 1,
                NTKLogin[logins] <= SELECTEDVALUE(WeekMatrix[Date])
            )
        )
    ),
    ERROR("This measure can only be evaluated under the filter context of WeekMatrix[Date]")
)

Basically I already took a few optimization, and I ran just one query in DAX Studio for the date 2018-12-15 (i.e. instead of using SELECTEDVALUE(WeekMatrix[Date]), give me the number of active members at 2018-12-15), results are (Cache cleared):

 

7,553 ms for 3,219,824 rows

7,760 ms for 3,219,824 rows

 

Now because I have 14 rows in the Date column, it still means a LOT of time (I never measured in real-time but it feels like 5 minutes at least). This is already better than the old version as that one would give me out of memory error.

 

My question is:

Is there way to further cut down the time running this formula?

One thing that I can think about is: Say the threshold is 28 days, and I'm checking the number of Active Members, say, for 2019-01-03, now there is no reason to check the dates between 2019-01-03 and 2018-12-06 (they are all in range of 28 days and all users that have activity in this range should be automatically considered as Active), but I'm not sure where to insert this check. Maybe this formula is too complicated for me already...

 

I already made sure the following thing:

1. I cannot remove more rows/columns

2. As few visualization as possible, just the slicer and matrix

3. I tried to filter as much as possible (a lot of FILTER() in the formula)

 

Thank you in advance!

1 ACCEPTED SOLUTION

Hi @markus_zhang

 

 Unless I'm missing something, what your code does in a slightly convoluted manner is count how many users (emails) have logged in on the date selected in the slicer or any of the previous [ActiveThreshold] days. If so, you can simplify the code a whole lot:

 

Measure =
IF (
    ISFILTERED ( WeekMatrix[Date] ),
    VAR _SelectedDate = SELECTEDVALUE ( WeekMatrix[Date] )
    RETURN
        CALCULATE (
            DISTINCTCOUNT ( NTKLogin[email] ),
            NTKLogin[logins] <= _SelectedDate, NTKLogin[logins] >= _SelectedDate - [ActiveThreshold]
        ),
    ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" )
)

 

View solution in original post

17 REPLIES 17
Greg_Deckler
Super User
Super User

Why not start with something like:

 

Measure = 
VAR __max = MAX('Slicer'[Days])
VAR __email = MAX('Table'[email])
VAR __table = SUMMARIZE('Table',[email],"__date",MAX([Date]))
VAR __table1 = ADDCOLUMNS(__table,"__days",DATEDIFF([__date],TODAY(),DAYS))
VAR __count = COUNTX(FILTER(__table1,[__days]<=__max && [email] = __email),[email])
RETURN
IF(ISBLANK(__count),0,1)

Something like that, you could then filter on this measure. (you want the 1's)


@ me in replies or I'll lose your thread!!!
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Learn Power BI 2nd Edition

Hi @Greg_Deckler thanks for the help!

I'll try it out whence the VPN recovers, in the mean time is it possible for you to elaborate a bit on this? What I'm trying to figure out is:

 

Did you mean that some of the formulas I wrote are inefficient (and I should use yours as substitution), or I should put as many as possible into VARs which will speed up execution? I think it's the former but need to make sure. Thank you!

I was originally just going to suggest that you use SUMMARIZE to create a temp table but then I ended up coding an alternative, *I think*. I can't really speak to which is more efficient but one of them certainly has less lines of code! 🙂


@ me in replies or I'll lose your thread!!!
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Learn Power BI 2nd Edition

Hi @markus_zhang

 

 Unless I'm missing something, what your code does in a slightly convoluted manner is count how many users (emails) have logged in on the date selected in the slicer or any of the previous [ActiveThreshold] days. If so, you can simplify the code a whole lot:

 

Measure =
IF (
    ISFILTERED ( WeekMatrix[Date] ),
    VAR _SelectedDate = SELECTEDVALUE ( WeekMatrix[Date] )
    RETURN
        CALCULATE (
            DISTINCTCOUNT ( NTKLogin[email] ),
            NTKLogin[logins] <= _SelectedDate, NTKLogin[logins] >= _SelectedDate - [ActiveThreshold]
        ),
    ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" )
)

 

 

Now I'm trying to figure out how to calculate the number of "At risk members", which is defined as who signed-up 29-85 days before and never signed-up 0-28 days before. If they signed up 85+ days before and never signed up since then, they are "Dormant".

 

A sample diagram:

-200 day: d

-100 day: b

-98 day: c

-54 day: a

-54 day: c

-40 day: c

-13 day: a

-10 day: b

 

So I have a total of 4 members, a and b are Active, c is At-risk and d is Dormant.

 

I don't think this can be achieved as easily as the Active Member Count measure because it involves basically two passes:

 

Pass 1 - Grab all members who signed-up in (-85, -29]

Pass 2 - Grab all members who signed-up in (-29, 0]

Then use maybe an INTERSECT to find members who signed up in both periods, and these should be excluded.

So basically it is:

 

DISTINCTCOUNT of all NTKLogin[email] between (-85, -29] relative to SELECTEDVALUE(WeekMatrix[Date])

    - DISTINCTCOUNT of all NTKLogin[email] of the INTERSECT

 

I think this is slightly better than the method I used before (very similar to the one I pasted, with ADDCOLUMNS), I'll try it out, see how fast it goes and get back to you guys.

 

#Edit: Can optimize the formula - If SELECTEDVALUE(WeekMatrix[Date]) is less than 85 days from the earliest date in NTKLogin, can substitute with a simple [All Members] - [Active members]

Why can't you just use MIN of the number of days? If you did something like:

 

Table = SUMMARIZE('Table',[User],"__days",MIN([Days]))

 

Your __days column would dictact what category they fell into.


@ me in replies or I'll lose your thread!!!
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Learn Power BI 2nd Edition

Hi @Greg_Deckler thanks, it's a bit embarassing but I'm not familiar with SUMMARIZE(), guess it's time to learn it for real!

It's one of my favorites.

 

https://docs.microsoft.com/en-us/dax/summarize-function-dax

 

However, my understanding is that SUMMARIZE is sort of being phased out with the replacement being SUMMARIZECOLUMNS:

 

https://docs.microsoft.com/en-us/dax/summarizecolumns-function-dax

 

If you know SQL, think GROUP BY


@ me in replies or I'll lose your thread!!!
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Learn Power BI 2nd Edition

@AlB Thanks! There is a small modification I made to your code and then it works perfectly! It doesn't speed up a lot but I think it reduces the memory print.

 

Measure =
IF (
    ISFILTERED ( WeekMatrix[Date] ),
    VAR _SelectedDate = SELECTEDVALUE ( WeekMatrix[Date] )
    RETURN
        CALCULATE (
            DISTINCTCOUNT ( NTKLogin[email] ),
FILTER(
ALL(NTKLogin),
AND(NTKLogin[logins] <= _SelectedDate,
NTKLogin[logins] >= _SelectedDate - [ActiveThreshold])
)
        ),
    ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" )
)

TBH now I look back and have no idea how and why I came up with that monstrous query... 


@markus_zhang wrote:

@AlB Thanks! There is a small modification I made to your code and then it works perfectly! It doesn't speed up a lot but I think it reduces the memory print.

 

@markus_zhang   True, the simplified syntax for the CALCULATE filter arguments does not allow invoking measures.

 

Are you sure it doesn't speed it up considerably? I find that very hard to believe. Have you checked with a large enough table to be able to appreciate the difference or are you running the comparison on a small fragment of the original table?

 

I don't think the memory print would change at all, since we just modified the code in a measure.

 

   

@AlB

Weird, it does speed up considerably today when I'm working over VPN, actually it is almost instant. Makes sense though as it should speed up. Thanks!

 

For the memory print, I think I didn't clarify, it's the memory Power BI uses when it's evaluating the measures. For the original ones (I have three similar measures for Active, At Risk and Dormant members) evaluating all of them will usually gives me a out of memory error, but the new ones only pushes memory usage to about 48%.

@markus_zhang

 

Ah ok. I see what you mean. Yeah the memory used should be different as your measure I believe would use the Formula Engine heavily and therefore it needs to take up larger chunks of memory when it is executing 

AlB
Super User
Super User

@markus_zhang

 

You piqued my curiosity so I ran a quick test with a table with about half a million rows. Smaller than yours but well, I didn't want to wait much.  I measured the performance with DAX Studio.

The results are telling; see the screen caps below. The modified measure is more than 50 times faster Smiley Surprised You can also see the massive difference in the percentage of time spent in the formula engine versus in the storage engine.

 

Initial measureInitial measureModified measureModified measure

 

AlB
Super User
Super User

@markus_zhang

 

Perhaps a new thread should be opened for this but anyway. The number of active, dormant and at risk users could be calculated as follows, taking into account that the three segments are mutually exclusive.

 

Dormant users are all those who have not logged in during the last 85 days:

 

DormantUsers =
VAR _DormantOffset = 85 //Change this if necessary
VAR _DormantCutOffDate =
    SELECTEDVALUE ( WeekMatrix[Date] ) - _DormantOffset
VAR _AllUsers =
    DISTINCTCOUNT ( NTKLogin[email] )
VAR _UsersAfterDormantCutOff =
    CALCULATE (
        DISTINCTCOUNT ( NTKLogin[email] ),
        NTKLogin[logins] >= _DormantCutOffDate
    )
RETURN
    _AllUsers - _UsersAfterDormantCutOff

 

It could also be done as you suggested with set functions 

TableA = all emails

TableB = all emails with login in the last 85 days

COUNTROWS(EXCEPT(A, B))

 

but I think the approach above is simpler and probably faster (although probably not by a lot)

Thanks @AlB I already got those measures, I think we are using the same method but your code is more readable, I'm going to create some VARs too.

Really appreciate you and @Greg_Deckler, can't do without you guys' help.

 

Here is my version (I use a slightly different def of At Risk, Inactive and Dormant in my code but the gist is the same)

I also have to use a filter inside CALCULATE to grab ALL(NTKLogin) and filter it further.

 

In Risk Member Count = 
VAR _SelectedDate = SELECTEDVALUE (WeekMatrix[Date])
RETURN
IF (
    ISFILTERED ( WeekMatrix[Date] ),
        CALCULATE (
            DISTINCTCOUNT ( NTKLogin[email] ),
            FILTER(
                ALL(NTKLogin),
                AND(
                    NTKLogin[logins] <= _SelectedDate,
                    NTKLogin[logins] > _SelectedDate - [_AtRiskThreshold]
                )
            )
        ) - [Active Member Count],
    ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" )
)
Inactive Member Count = 
VAR _SelectedDate = SELECTEDVALUE (WeekMatrix[Date])
RETURN
IF (
    ISFILTERED ( WeekMatrix[Date] ),
        CALCULATE (
            DISTINCTCOUNT ( NTKLogin[email] ),
            FILTER(
                ALL(NTKLogin),
                AND(
                    NTKLogin[logins] <= _SelectedDate,
                    NTKLogin[logins] > _SelectedDate - [_InactiveThreshold]
                )
            )
        ) - [Active Member Count] - [In Risk Member Count],
    ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" )
)

 

 

AlB
Super User
Super User

@markus_zhang

 

Then, active users are those who have logged in over the last 28 days: 

 

ActiveUsers =
VAR _ActiveOffset = 28 //Change this if necessary
VAR _ActiveCutOffDate =
    SELECTEDVALUE ( WeekMatrix[Date] ) - _ActiveOffset
RETURN
    CALCULATE (
        DISTINCTCOUNT ( NTKLogin[email] ),
        NTKLogin[logins] >= _ActiveCutOffDate
    )

 

and finally, at risk users are all the rest so we can use the previous two measures:

 

AtRiskUsers =
VAR _AllUsers =
    DISTINCTCOUNT ( NTKLogin[email] )
RETURN
    _AllUsers - ( [ActiveUsers] + [DormantUsers] )

 

AlB
Super User
Super User

@markus_zhang

 

Take a look at this video by master Ferrari. Interesting stuff in it and a good way to ignite your interest in optimization. 

Helpful resources

Announcements
Microsoft Build 768x460.png

Microsoft Build is May 24-26. Have you registered yet?

Come together to explore latest innovations in code and application development—and gain insights from experts from around the world.

May 23 2022 epsiode 5 without aka link.jpg

The Power BI Community Show

Welcome to the Power BI Community Show! Jeroen ter Heerdt talks about the importance of Data Modeling.

Power BI Dev Camp Session 22 with aka link 768x460.jpg

Check it out!

Mark your calendars and join us on Thursday, May 26 at 11a PDT for a great session with Ted Pattison!

charticulator_carousel_with_text (1).png

Charticulator Design Challenge

Put your data visualization and design skills to the test! This exciting challenge is happening now through May 31st!