cancel
Showing results for 
Search instead for 
Did you mean: 

Power BI implements the KNN algorithm

Scenario: 

K-nearest neighbors (KNN) algorithm is a type of supervised machine learning algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The following two properties would define KNN well. 

(1) Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification. 

(2) Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data. 

Please refer to k-nearest neighbors algorithm for more information if you are not familiar with machine learning algorithms. 

Normally it can be implemented in programming languages like Python, R or MATLAB, but I want to try to implement the KNN algorithm with DAX in Power BI. 

Sample data: 

Training data table 

3.JPG

Testing data table 

4.JPG
Operations:
 

  1.  Calculate the distance between each test data point and the training set data point and take the 5 values with the smallest distance as the neighboring points of the test point.  

Since each record in the test set is compared with that in the training set and the difference is figured out, we can then calculate the difference of Cartesian product between two tables and make K=5 

The value of K involves parameter tuning issue, which will have a great influence on the result.  

If the value is set too small, it will increase the complexity of the model and prone to overfitting. If it is set too large, the classification will be blurred. 

Generally, the value of K shall be set less than 20 based on the cross-validation. 

In this practice, I chose the value of k as k = sqrt(N)=5, where N represents the number of samples in the training data set.  

Please refer to A Simple Introduction to K-Nearest Neighbors Algorithm for details.  

Table =  

VAR T1 = 

ADDCOLUMNS( 

    GENERATEALL( 'Test data', 'Training data' ), 

    "Distance", 

    CEILING( 

        SQRT(  

            ([Quantity] - [__Quantity])^2 + ([Price]-[__Price])^2 

        ), 

        0.01 

    ) 

) 

VAR T2 =  

ADDCOLUMNS( 

    T1, 

    "IsKNN", 

    IF( 

        RANKX( 

            FILTER( T1, [__Product] = EARLIER([__Product]) ), 

            [Distance], , ASC, Skip 

        ) <= 5, 

        1, 0 

    ) 

) 

RETURN 

FILTER( 

    T2, 

    [IsKNN] = 1 

) 

  1.  Count the number of adjacent points under each category to see which category has more adjacent points. And each record will be divided into the category with more adjacent points. 

The category with the smallest distance will be selected if the number is the same.  

number of nearby points =  

COUNTX( 

    FILTER( ALL('Table'), [__Product] = EARLIER([__Product]) && [Category] = EARLIER([Category]) ), 

    [Distance] 

) 

  

Column =  

VAR __min =  

MINX(  

    FILTER( 'Table', [__Product] = EARLIER([__Product]) ), 

    [number of nearby points] 

) 

VAR __max =  

MAXX(  

    FILTER( 'Table', [__Product] = EARLIER([__Product]) ), 

    [number of nearby points] 

) 

VAR x =  

MINX( 

    FILTER( 'Table', [__Product] = EARLIER([__Product] ) ), 

    [Distance] 

) 

RETURN 

IF( 

    __max = __min, 

    CALCULATE( 

        MAX([Category]), 

        FILTER( 

            'Table', 

            [Distance] = x && [__Product] = EARLIER( [__Product] )  

        ) 

    ), 

    [Category] 

) 

 

Cartesian product of the training data table and the test data table. 

  1.  Complete the classification of the test data and return the result table. 

Result =  

SUMMARIZE( 

    'Table', 

    [__Product], 

    "Maximum number of nearby points", MAX('Table'[number of nearby points]) 

) 

  

Category =  

LOOKUPVALUE( 

    'Table'[Column], 

    'Table'[__Product], 'Result'[__Product], 

    'Table'[number of nearby points], 'Result'[Maximum number of nearby points] 

 

 

Result

GetImage (10).pngGetImage (9).pngGetImage (8).png

 

Summary: 

According to the above operations, we can see that KNN algorithm can actually be accomplished by Power BI. But considering the limitations on query performance and the data set size, it is recommended to use Python or R. 

  

Author: Lionel Chen 

Reviewer: Zoey Huan