Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!
Hi experts,
I have a table with 12 million rows and two columns:
email | logins
Basically it's a table of login records for each account (email) between 2018-10-17 and 2019-01-16.
I set up a simple Matrix as the following:
Date |
10/17/2018 0:00 |
10/24/2018 0:00 |
10/31/2018 0:00 |
11/7/2018 0:00 |
11/14/2018 0:00 |
11/21/2018 0:00 |
11/28/2018 0:00 |
12/5/2018 0:00 |
12/12/2018 0:00 |
12/19/2018 0:00 |
12/26/2018 0:00 |
1/2/2019 0:00 |
1/9/2019 0:00 |
1/16/2019 0:00 |
So that I can put measures in the columns. One of the measures is the Active Mmeber Counts, basically I have a slicer to let the user play with different values of Active Days, and this measure will show different numbers of Active Members in the Matrix (e.g. if Active day is defined as 28 days, it will definitely show more Active members than the one with 21 days). Here is the formula:
A few things to clarify:
1. WeekMatrix[Date] is the Date column at the top
2. [ActiveThreshold] is the Active Days user can play with the slicer, could be from 14 days to 30 days
3. Basically the formula creates a helper column to decide if that particular account (email) is considered active by looking at the most recent date he/she has a login and comparing with the Date in the Matrix, and then do a DISTINCTCOUNT as emails have duplicates.
IF( ISFILTERED(WeekMatrix[Date]), CALCULATE( DISTINCTCOUNT(NTKLogin[email]), FILTER( ADDCOLUMNS( FILTER( ALL(NTKLogin), NTKLogin[logins] <= SELECTEDVALUE(WeekMatrix[Date]) ), "StillActive", IF( DATEDIFF( MAXX( FILTER( FILTER( ALL(NTKLogin), NTKLogin[logins] <= SELECTEDVALUE(WeekMatrix[Date]) ), AND( NTKLogin[email] = EARLIER(NTKLogin[email]), NTKLogin[logins] <= SELECTEDVALUE(WeekMatrix[Date]) ) ), NTKLogin[logins] ), SELECTEDVALUE(WeekMatrix[Date]), DAY ) <= [ActiveThreshold], // Still active if within threshold 1, 0 ) ), AND( [StillActive] = 1, NTKLogin[logins] <= SELECTEDVALUE(WeekMatrix[Date]) ) ) ), ERROR("This measure can only be evaluated under the filter context of WeekMatrix[Date]") )
Basically I already took a few optimization, and I ran just one query in DAX Studio for the date 2018-12-15 (i.e. instead of using SELECTEDVALUE(WeekMatrix[Date]), give me the number of active members at 2018-12-15), results are (Cache cleared):
7,553 ms for 3,219,824 rows
7,760 ms for 3,219,824 rows
Now because I have 14 rows in the Date column, it still means a LOT of time (I never measured in real-time but it feels like 5 minutes at least). This is already better than the old version as that one would give me out of memory error.
My question is:
Is there way to further cut down the time running this formula?
One thing that I can think about is: Say the threshold is 28 days, and I'm checking the number of Active Members, say, for 2019-01-03, now there is no reason to check the dates between 2019-01-03 and 2018-12-06 (they are all in range of 28 days and all users that have activity in this range should be automatically considered as Active), but I'm not sure where to insert this check. Maybe this formula is too complicated for me already...
I already made sure the following thing:
1. I cannot remove more rows/columns
2. As few visualization as possible, just the slicer and matrix
3. I tried to filter as much as possible (a lot of FILTER() in the formula)
Thank you in advance!
Solved! Go to Solution.
Unless I'm missing something, what your code does in a slightly convoluted manner is count how many users (emails) have logged in on the date selected in the slicer or any of the previous [ActiveThreshold] days. If so, you can simplify the code a whole lot:
Measure =
IF (
ISFILTERED ( WeekMatrix[Date] ),
VAR _SelectedDate = SELECTEDVALUE ( WeekMatrix[Date] )
RETURN
CALCULATE (
DISTINCTCOUNT ( NTKLogin[email] ),
NTKLogin[logins] <= _SelectedDate, NTKLogin[logins] >= _SelectedDate - [ActiveThreshold]
),
ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" )
)
Why not start with something like:
Measure = VAR __max = MAX('Slicer'[Days]) VAR __email = MAX('Table'[email]) VAR __table = SUMMARIZE('Table',[email],"__date",MAX([Date])) VAR __table1 = ADDCOLUMNS(__table,"__days",DATEDIFF([__date],TODAY(),DAYS)) VAR __count = COUNTX(FILTER(__table1,[__days]<=__max && [email] = __email),[email]) RETURN IF(ISBLANK(__count),0,1)
Something like that, you could then filter on this measure. (you want the 1's)
Hi @Greg_Deckler thanks for the help!
I'll try it out whence the VPN recovers, in the mean time is it possible for you to elaborate a bit on this? What I'm trying to figure out is:
Did you mean that some of the formulas I wrote are inefficient (and I should use yours as substitution), or I should put as many as possible into VARs which will speed up execution? I think it's the former but need to make sure. Thank you!
I was originally just going to suggest that you use SUMMARIZE to create a temp table but then I ended up coding an alternative, *I think*. I can't really speak to which is more efficient but one of them certainly has less lines of code! 🙂
Unless I'm missing something, what your code does in a slightly convoluted manner is count how many users (emails) have logged in on the date selected in the slicer or any of the previous [ActiveThreshold] days. If so, you can simplify the code a whole lot:
Measure =
IF (
ISFILTERED ( WeekMatrix[Date] ),
VAR _SelectedDate = SELECTEDVALUE ( WeekMatrix[Date] )
RETURN
CALCULATE (
DISTINCTCOUNT ( NTKLogin[email] ),
NTKLogin[logins] <= _SelectedDate, NTKLogin[logins] >= _SelectedDate - [ActiveThreshold]
),
ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" )
)
Now I'm trying to figure out how to calculate the number of "At risk members", which is defined as who signed-up 29-85 days before and never signed-up 0-28 days before. If they signed up 85+ days before and never signed up since then, they are "Dormant".
A sample diagram:
-200 day: d
-100 day: b
-98 day: c
-54 day: a
-54 day: c
-40 day: c
-13 day: a
-10 day: b
So I have a total of 4 members, a and b are Active, c is At-risk and d is Dormant.
I don't think this can be achieved as easily as the Active Member Count measure because it involves basically two passes:
Pass 1 - Grab all members who signed-up in (-85, -29]
Pass 2 - Grab all members who signed-up in (-29, 0]
Then use maybe an INTERSECT to find members who signed up in both periods, and these should be excluded.
So basically it is:
DISTINCTCOUNT of all NTKLogin[email] between (-85, -29] relative to SELECTEDVALUE(WeekMatrix[Date])
- DISTINCTCOUNT of all NTKLogin[email] of the INTERSECT
I think this is slightly better than the method I used before (very similar to the one I pasted, with ADDCOLUMNS), I'll try it out, see how fast it goes and get back to you guys.
#Edit: Can optimize the formula - If SELECTEDVALUE(WeekMatrix[Date]) is less than 85 days from the earliest date in NTKLogin, can substitute with a simple [All Members] - [Active members]
Why can't you just use MIN of the number of days? If you did something like:
Table = SUMMARIZE('Table',[User],"__days",MIN([Days]))
Your __days column would dictact what category they fell into.
Hi @Greg_Deckler thanks, it's a bit embarassing but I'm not familiar with SUMMARIZE(), guess it's time to learn it for real!
It's one of my favorites.
https://docs.microsoft.com/en-us/dax/summarize-function-dax
However, my understanding is that SUMMARIZE is sort of being phased out with the replacement being SUMMARIZECOLUMNS:
https://docs.microsoft.com/en-us/dax/summarizecolumns-function-dax
If you know SQL, think GROUP BY
@AlB Thanks! There is a small modification I made to your code and then it works perfectly! It doesn't speed up a lot but I think it reduces the memory print.
Measure =
IF (
ISFILTERED ( WeekMatrix[Date] ),
VAR _SelectedDate = SELECTEDVALUE ( WeekMatrix[Date] )
RETURN
CALCULATE (
DISTINCTCOUNT ( NTKLogin[email] ),
FILTER(
ALL(NTKLogin),
AND(NTKLogin[logins] <= _SelectedDate,
NTKLogin[logins] >= _SelectedDate - [ActiveThreshold])
)
),
ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" )
)
TBH now I look back and have no idea how and why I came up with that monstrous query...
@markus_zhang wrote:@AlB Thanks! There is a small modification I made to your code and then it works perfectly! It doesn't speed up a lot but I think it reduces the memory print.
@markus_zhang True, the simplified syntax for the CALCULATE filter arguments does not allow invoking measures.
Are you sure it doesn't speed it up considerably? I find that very hard to believe. Have you checked with a large enough table to be able to appreciate the difference or are you running the comparison on a small fragment of the original table?
I don't think the memory print would change at all, since we just modified the code in a measure.
Weird, it does speed up considerably today when I'm working over VPN, actually it is almost instant. Makes sense though as it should speed up. Thanks!
For the memory print, I think I didn't clarify, it's the memory Power BI uses when it's evaluating the measures. For the original ones (I have three similar measures for Active, At Risk and Dormant members) evaluating all of them will usually gives me a out of memory error, but the new ones only pushes memory usage to about 48%.
Ah ok. I see what you mean. Yeah the memory used should be different as your measure I believe would use the Formula Engine heavily and therefore it needs to take up larger chunks of memory when it is executing
You piqued my curiosity so I ran a quick test with a table with about half a million rows. Smaller than yours but well, I didn't want to wait much. I measured the performance with DAX Studio.
The results are telling; see the screen caps below. The modified measure is more than 50 times faster You can also see the massive difference in the percentage of time spent in the formula engine versus in the storage engine.
Perhaps a new thread should be opened for this but anyway. The number of active, dormant and at risk users could be calculated as follows, taking into account that the three segments are mutually exclusive.
Dormant users are all those who have not logged in during the last 85 days:
DormantUsers = VAR _DormantOffset = 85 //Change this if necessary VAR _DormantCutOffDate = SELECTEDVALUE ( WeekMatrix[Date] ) - _DormantOffset VAR _AllUsers = DISTINCTCOUNT ( NTKLogin[email] ) VAR _UsersAfterDormantCutOff = CALCULATE ( DISTINCTCOUNT ( NTKLogin[email] ), NTKLogin[logins] >= _DormantCutOffDate ) RETURN _AllUsers - _UsersAfterDormantCutOff
It could also be done as you suggested with set functions
TableA = all emails
TableB = all emails with login in the last 85 days
COUNTROWS(EXCEPT(A, B))
but I think the approach above is simpler and probably faster (although probably not by a lot)
Thanks @AlB I already got those measures, I think we are using the same method but your code is more readable, I'm going to create some VARs too.
Really appreciate you and @Greg_Deckler, can't do without you guys' help.
Here is my version (I use a slightly different def of At Risk, Inactive and Dormant in my code but the gist is the same)
I also have to use a filter inside CALCULATE to grab ALL(NTKLogin) and filter it further.
In Risk Member Count = VAR _SelectedDate = SELECTEDVALUE (WeekMatrix[Date]) RETURN IF ( ISFILTERED ( WeekMatrix[Date] ), CALCULATE ( DISTINCTCOUNT ( NTKLogin[email] ), FILTER( ALL(NTKLogin), AND( NTKLogin[logins] <= _SelectedDate, NTKLogin[logins] > _SelectedDate - [_AtRiskThreshold] ) ) ) - [Active Member Count], ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" ) )
Inactive Member Count = VAR _SelectedDate = SELECTEDVALUE (WeekMatrix[Date]) RETURN IF ( ISFILTERED ( WeekMatrix[Date] ), CALCULATE ( DISTINCTCOUNT ( NTKLogin[email] ), FILTER( ALL(NTKLogin), AND( NTKLogin[logins] <= _SelectedDate, NTKLogin[logins] > _SelectedDate - [_InactiveThreshold] ) ) ) - [Active Member Count] - [In Risk Member Count], ERROR ( "This measure can only be evaluated under the filter context of WeekMatrix[Date]" ) )
Then, active users are those who have logged in over the last 28 days:
ActiveUsers = VAR _ActiveOffset = 28 //Change this if necessary VAR _ActiveCutOffDate = SELECTEDVALUE ( WeekMatrix[Date] ) - _ActiveOffset RETURN CALCULATE ( DISTINCTCOUNT ( NTKLogin[email] ), NTKLogin[logins] >= _ActiveCutOffDate )
and finally, at risk users are all the rest so we can use the previous two measures:
AtRiskUsers = VAR _AllUsers = DISTINCTCOUNT ( NTKLogin[email] ) RETURN _AllUsers - ( [ActiveUsers] + [DormantUsers] )
Take a look at this video by master Ferrari. Interesting stuff in it and a good way to ignite your interest in optimization.
User | Count |
---|---|
140 | |
113 | |
104 | |
77 | |
64 |
User | Count |
---|---|
135 | |
122 | |
101 | |
71 | |
61 |