Solved: Struggling with TOPN

masplin · ‎05-16-2021

I am trying to classify companies based on the frequency of specific words in their description which each work being associated with a differnet sector.

Here is a description "We provide services and expertise in the field of X-ray protein crystallography to pharmaceutical and biotechnological companies. Our services include: We strive to integrate the work we provide seamlessly and efficiently into the processes and workflow of our customers."

If I just have 2 keywords I have this table "Testing Keywords"

Sector Keyword Weight

Manufacturing	process	3
Bioanalytical	protein	2

My column measure is, but just returns not found when I woudl expect "Manufacturing" as the sector that has the highest score of the keyword weights added up.

=
VAR
keywordinstance=SUMX(Testing_Keyword,
    
        Testing_Keyword[Weight]*
        (LEN(LOWER(CRO_Company[Description]))-LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),"")))/
        LEN(Testing_Keyword[Keyword])
    )

RETURN
IFERROR(CALCULATE(DISTINCT(Testing_Keyword[Sector]),TOPN(1,VALUES(Testing_Keyword[Sector]),keywordinstance)),"Not Found")

If I change the distinct bit to keywordinstance its 5. so this formula is giving the totla across all sectors not the TopN?

Any help appreciated as surei'm doing something very silly

d_gosbell · ‎05-17-2021

Yes, it was a bit hard to tell if it was working with just the one row of test data. I think the following is closer, but potentially you could still end up with multiple sectors with the same score as I'm not sure how you would want to break the ties.

Column = CONCATENATEX(
    var _topn =
    TOPN(1, ALL(Testing_Keyword[Sector]),
        var _currentSector = Testing_Keyword[Sector]
        var _sectorKeywords = CALCULATETABLE(GROUPBY(Testing_Keyword, Testing_Keyword[Weight], Testing_Keyword[Keyword]), TREATAS( {_currentSector}, Testing_Keyword[Sector] ))
        var _score = SUMX(
            _sectorKeywords,
            Testing_Keyword[Weight]  * 
            (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
            LEN(Testing_Keyword[Keyword])
            )
        return _score
    )
    // if the _topn variable contains all the rows from the keyword table this probably
    // means that none of the keywords matched so they all scored 0 so we should 
    // filter them all out
    var _condition = if(COUNTROWS(all(Testing_Keyword[Sector])) = COUNTROWS(_topn),FALSE(), True())
    return filter( _topn, _condition)
,[Sector]
,",")

While I was testing this I also built the following column so I could see the weighted score per sector, this may be helpful if you want to do further debugging yourself.

Column 2 = 
CONCATENATEX( ALL(Testing_Keyword[Sector]),
var _currentSector = Testing_Keyword[Sector]
var _sectorKeywords = CALCULATETABLE(GROUPBY(Testing_Keyword, Testing_Keyword[Weight], Testing_Keyword[Keyword]), TREATAS( {_currentSector}, Testing_Keyword[Sector] ))
var _score = SUMX(
        _sectorKeywords,
        Testing_Keyword[Weight]  * 
        (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
        LEN(Testing_Keyword[Keyword])
        )
        return _currentSector & " (" & _score & ") "
)

View solution in original post

d_gosbell · ‎05-17-2021

Yes, it was a bit hard to tell if it was working with just the one row of test data. I think the following is closer, but potentially you could still end up with multiple sectors with the same score as I'm not sure how you would want to break the ties.

Column = CONCATENATEX(
    var _topn =
    TOPN(1, ALL(Testing_Keyword[Sector]),
        var _currentSector = Testing_Keyword[Sector]
        var _sectorKeywords = CALCULATETABLE(GROUPBY(Testing_Keyword, Testing_Keyword[Weight], Testing_Keyword[Keyword]), TREATAS( {_currentSector}, Testing_Keyword[Sector] ))
        var _score = SUMX(
            _sectorKeywords,
            Testing_Keyword[Weight]  * 
            (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
            LEN(Testing_Keyword[Keyword])
            )
        return _score
    )
    // if the _topn variable contains all the rows from the keyword table this probably
    // means that none of the keywords matched so they all scored 0 so we should 
    // filter them all out
    var _condition = if(COUNTROWS(all(Testing_Keyword[Sector])) = COUNTROWS(_topn),FALSE(), True())
    return filter( _topn, _condition)
,[Sector]
,",")

While I was testing this I also built the following column so I could see the weighted score per sector, this may be helpful if you want to do further debugging yourself.

Column 2 = 
CONCATENATEX( ALL(Testing_Keyword[Sector]),
var _currentSector = Testing_Keyword[Sector]
var _sectorKeywords = CALCULATETABLE(GROUPBY(Testing_Keyword, Testing_Keyword[Weight], Testing_Keyword[Keyword]), TREATAS( {_currentSector}, Testing_Keyword[Sector] ))
var _score = SUMX(
        _sectorKeywords,
        Testing_Keyword[Weight]  * 
        (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
        LEN(Testing_Keyword[Keyword])
        )
        return _currentSector & " (" & _score & ") "
)

masplin · ‎05-17-2021

sadly blew my excel up!!!!! i have 2,500 companies so must be too much for it despite 64GB of RAM

d_gosbell · ‎05-19-2021

@masplin wrote:

sadly blew my excel up!!!!! i have 2,500 companies so must be too much for it despite 64GB of RAM

Hmm, that should not blow up with such a small data set. How many keywords do you have?

Another approach would be to do it in Power Query although the keyword lookup is a bit tricky. I coded a function by hand in the attached pbix file, I don't think there is an easy way of doing it with the User Interface. (this file also has the DAX based approach in it)

masplin · ‎05-17-2021

Brilliant. Really appreciate your help. Will spend some time understand the syntax of your solution

d_gosbell · ‎05-16-2021

Your problem here is that you are calcuating the keyword instance once at the start of your expression and then storing the value in a variable. So when you call TOPN that same value is being used for every keyword meaning that all keywords are being returned by TOPN since they all have the same score.

So in the code below I've simply moved the expression you had in your variable inside the TOPN so it is re-calculated for each row in Testing_Keyword. Then I've wrapped the whole thing in a CONCATENATEX to produce a single scalar value and I've added an extra check to filter out results where all keywords are returned (this will happen if no keywords are found because they will all get the same score of 0)

Column = CONCATENATEX(
    var _topn =
    TOPN(1, Testing_Keyword,
        Testing_Keyword[Weight] * 
        (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
        LEN(Testing_Keyword[Keyword])
    )
    // if the _topn variable contains all the rows from the keyword table this probably
    // means that none of the keywords matched so they all scored 0 so we should 
    // filter them all out
    var _condition = if(COUNTROWS(Testing_Keyword) = COUNTROWS(_topn),FALSE(), True())
    return filter( _topn, _condition)
,[Sector]
,",")

masplin · ‎05-17-2021

Hi. That very elegant. The only isssue is if it finds more than 1 keyword it is outputing a string of sectors such as this, which must mean its not summing up the weights to find the sector with the higest total score. I think it needs a SUMX somewhere?

Bioanalytical,Bioanalytical,Bioanalytical

The above is the description

"Bioassay GmbH offers efficacy and safety testing services with the use of in vivo rodent and in vitro models." Where keywords are in vivo (10) , in vitro (10), assay (5) and testing services (1 for CRO). so score for bioanlytical is 25 with a single output.

Originally I just wanted one sector output, the one with the higest score, but sometimes you get ties. I'm not sure how your code is handling that as does seem to be not using alphabetical when more than 1 sector so must be using the weight.

Actually can't work out any pattern here.

"DKFZ German Cancer Research Center provides antibody, proteomic, genomic, and sequencing services."

Cancer 3 Biomedical

genom 3 bionalytical

sequencing 5 bioanalytical

So bioanalytical scores 8 so is top sector but output is in reverse order. in this should be just bioanalytical as no tie.

Biomedical,Bioanalytical

Sure you are very close, but I don't really understnad the code.

Much appreciated Mike

Struggling with TOPN

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly