Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
masplin
Impactful Individual
Impactful Individual

Struggling with TOPN

I am trying to classify companies based on the frequency of specific words in their description which each work being associated with a differnet sector.   

 

Here is a description "We provide services and expertise in the field of X-ray protein crystallography to pharmaceutical and biotechnological companies. Our services include: We strive to integrate the work we provide seamlessly and efficiently into the processes and workflow of our customers."

 

If I just have 2 keywords I have this table "Testing Keywords"

 

Sector                                     Keyword           Weight

Manufacturingprocess                3
Bioanalyticalprotein                2

 

My column measure is, but just returns not found when I woudl expect "Manufacturing" as the sector that has the highest score of the keyword weights added up. 

 

 

=
VAR
keywordinstance=SUMX(Testing_Keyword,
    
        Testing_Keyword[Weight]*
        (LEN(LOWER(CRO_Company[Description]))-LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),"")))/
        LEN(Testing_Keyword[Keyword])
    )

RETURN
IFERROR(CALCULATE(DISTINCT(Testing_Keyword[Sector]),TOPN(1,VALUES(Testing_Keyword[Sector]),keywordinstance)),"Not Found")

 

  

If I change the distinct bit to keywordinstance its 5. so this formula is giving the totla across all sectors not the TopN? 

 

Any help appreciated as surei'm doing something very silly

1 ACCEPTED SOLUTION
d_gosbell
Super User
Super User

Yes, it was a bit hard to tell if it was working with just the one row of test data. I think the following is closer, but potentially you could still end up with multiple sectors with the same score as I'm not sure how you would want to break the ties.

 

Column = CONCATENATEX(
    var _topn =
    TOPN(1, ALL(Testing_Keyword[Sector]),
        var _currentSector = Testing_Keyword[Sector]
        var _sectorKeywords = CALCULATETABLE(GROUPBY(Testing_Keyword, Testing_Keyword[Weight], Testing_Keyword[Keyword]), TREATAS( {_currentSector}, Testing_Keyword[Sector] ))
        var _score = SUMX(
            _sectorKeywords,
            Testing_Keyword[Weight]  * 
            (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
            LEN(Testing_Keyword[Keyword])
            )
        return _score
    )
    // if the _topn variable contains all the rows from the keyword table this probably
    // means that none of the keywords matched so they all scored 0 so we should 
    // filter them all out
    var _condition = if(COUNTROWS(all(Testing_Keyword[Sector])) = COUNTROWS(_topn),FALSE(), True())
    return filter( _topn, _condition)
,[Sector]
,",")

 

While I was testing this I also built the following column so I could see the weighted score per sector, this may be helpful if you want to do further debugging yourself.

 

Column 2 = 
CONCATENATEX( ALL(Testing_Keyword[Sector]),
var _currentSector = Testing_Keyword[Sector]
var _sectorKeywords = CALCULATETABLE(GROUPBY(Testing_Keyword, Testing_Keyword[Weight], Testing_Keyword[Keyword]), TREATAS( {_currentSector}, Testing_Keyword[Sector] ))
var _score = SUMX(
        _sectorKeywords,
        Testing_Keyword[Weight]  * 
        (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
        LEN(Testing_Keyword[Keyword])
        )
        return _currentSector & " (" & _score & ") "
)

 

View solution in original post

6 REPLIES 6
d_gosbell
Super User
Super User

Yes, it was a bit hard to tell if it was working with just the one row of test data. I think the following is closer, but potentially you could still end up with multiple sectors with the same score as I'm not sure how you would want to break the ties.

 

Column = CONCATENATEX(
    var _topn =
    TOPN(1, ALL(Testing_Keyword[Sector]),
        var _currentSector = Testing_Keyword[Sector]
        var _sectorKeywords = CALCULATETABLE(GROUPBY(Testing_Keyword, Testing_Keyword[Weight], Testing_Keyword[Keyword]), TREATAS( {_currentSector}, Testing_Keyword[Sector] ))
        var _score = SUMX(
            _sectorKeywords,
            Testing_Keyword[Weight]  * 
            (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
            LEN(Testing_Keyword[Keyword])
            )
        return _score
    )
    // if the _topn variable contains all the rows from the keyword table this probably
    // means that none of the keywords matched so they all scored 0 so we should 
    // filter them all out
    var _condition = if(COUNTROWS(all(Testing_Keyword[Sector])) = COUNTROWS(_topn),FALSE(), True())
    return filter( _topn, _condition)
,[Sector]
,",")

 

While I was testing this I also built the following column so I could see the weighted score per sector, this may be helpful if you want to do further debugging yourself.

 

Column 2 = 
CONCATENATEX( ALL(Testing_Keyword[Sector]),
var _currentSector = Testing_Keyword[Sector]
var _sectorKeywords = CALCULATETABLE(GROUPBY(Testing_Keyword, Testing_Keyword[Weight], Testing_Keyword[Keyword]), TREATAS( {_currentSector}, Testing_Keyword[Sector] ))
var _score = SUMX(
        _sectorKeywords,
        Testing_Keyword[Weight]  * 
        (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
        LEN(Testing_Keyword[Keyword])
        )
        return _currentSector & " (" & _score & ") "
)

 

masplin
Impactful Individual
Impactful Individual

sadly blew my excel up!!!!!  i have 2,500 companies so must be too much for it despite 64GB of RAM


@masplin wrote:

sadly blew my excel up!!!!!  i have 2,500 companies so must be too much for it despite 64GB of RAM


Hmm, that should not blow up with such a small data set. How many keywords do you have?

 

Another approach would be to do it in Power Query although the keyword lookup is a bit tricky. I coded a function by hand in the attached pbix file, I don't think there is an easy way of doing it with the User Interface. (this file also has the DAX based approach in it)

 

 

masplin
Impactful Individual
Impactful Individual

Brilliant. Really appreciate your help. Will spend some time understand the syntax of your solution

d_gosbell
Super User
Super User

Your problem here is that you are calcuating the keyword instance once at the start of your expression and then storing the value in a variable. So when you call TOPN that same value is being used for every keyword meaning that all keywords are being returned by TOPN since they all have the same score.

 

So in the code below I've simply moved the expression you had in your variable inside the TOPN so it is re-calculated for each row in Testing_Keyword. Then I've wrapped the whole thing in a CONCATENATEX to produce a single scalar value and I've added an extra check to filter out results where all keywords are returned (this will happen if no keywords are found because they will all get the same score of 0)

 

Column = CONCATENATEX(
    var _topn =
    TOPN(1, Testing_Keyword,
        Testing_Keyword[Weight] * 
        (LEN(LOWER(CRO_Company[Description])) - LEN(SUBSTITUTE(LOWER(CRO_Company[Description]),LOWER(Testing_Keyword[Keyword]),""))) /
        LEN(Testing_Keyword[Keyword])
    )
    // if the _topn variable contains all the rows from the keyword table this probably
    // means that none of the keywords matched so they all scored 0 so we should 
    // filter them all out
    var _condition = if(COUNTROWS(Testing_Keyword) = COUNTROWS(_topn),FALSE(), True())
    return filter( _topn, _condition)
,[Sector]
,",")

 

masplin
Impactful Individual
Impactful Individual

Hi. That very elegant.  The only isssue is if it finds more than 1 keyword it is outputing a string of sectors such as this, which must mean its not summing up the weights to find the sector with the higest total score.  I think it needs  a SUMX somewhere?

Bioanalytical,Bioanalytical,Bioanalytical

 

The above is the description 

"Bioassay GmbH offers efficacy and safety testing services with the use of in vivo rodent and in vitro models."    Where keywords are in vivo (10) , in vitro (10), assay (5) and testing services (1 for CRO). so score for bioanlytical is 25 with a single output.

Originally I just wanted one sector output, the one with the higest score, but sometimes you get ties.  I'm not sure how your code is handling that as does seem to be not using alphabetical when more than 1 sector so must be using the weight. 

 

Actually can't work out any pattern here.

"DKFZ German Cancer Research Center provides antibody, proteomic, genomic, and sequencing services."

 

Cancer 3 Biomedical

genom 3 bionalytical

sequencing 5 bioanalytical

 

So bioanalytical scores 8 so is top sector but output is in reverse order. in this should be just bioanalytical as no tie. 

Biomedical,Bioanalytical

 

Sure you are very close, but I don't really understnad the code.

 

Much appreciated Mike

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.