Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Use Correlation Algorithms to Avoid AutoML Picking Inappropriate Features

The general advertisement is that the Auto ML is designed for "business analysts to build machine learning models". However, in some situations there could be realistic problems, unless someone is having a data science background.

Here is the case:

  1. Use the Power BI sample projects (Supplier Quality Analysis Sample)
  2. Extend the metrics query by creating a HasDefect column, which we would like to have as a label for future prediction. In my case: = Table.AddColumn(#"Added Custom", "HasDefect", each if [Defect Type ID]<=1 then 0 else 1)
  3. Create a data flow and entity based on the Metrics query
  4. Proceed to creating an ML model and follow the wizard
  5. For historical outcome field select "HasDefect"
  6. When you go to "Customize inputs" step you will get something like:
  7. 1.png
  8. Now the problem is that HasDefect is very closely correlated to the column DefectType and a user without a datascience background will "successfully" train a model with 100% accuracy
  9. Here is what a simple Python visual with Pearson correlation shows: there is an evident correlation between the 2nd and 4th columns.
  10. 2.png
  11. Below is the code to generate this. However, as you may see, I tried to address the "Defect" string column by encoding it to integer, so that it could take part in the correlation, as the defect text is 1-to-1 match with the defect ID, which determines the defect type, which is related with the label "Has defect". However, I did not manage to make it identify this correlation, due to the shuffled order.
  12. from sklearn import preprocessing

    le = preprocessing.LabelEncoder()

    le.fit(dataset['Defect'])

    dataset['Defect'] = le.transform(dataset['Defect'])

    # Paste or type your script code here:

    import matplotlib.pyplot as pyplot

    corr = dataset.corr('pearson')

    pyplot.matshow(corr)

    print(dataset)

    pyplot.show()

    My suggestion is: Please run a correlation algorithm (or improve the existing), so that features that are correlated with the label are not suggested. Otherwise, "business analysts" will create models that are not useful and this would degrade the value of the excellent job done here.
Status: New
Comments
v-qiuyu-msft
Community Support

Hi @ivelin_andreev

 

Thank you for your feedback. I would suggest you post a idea here: https://ideas.powerbi.com/forums/265200-power-bi-ideas

 

Best Regards,
Qiuyun Yu 

ivo_andreev
Regular Visitor

Hi @v-qiuyu-msft,

 

That was in fact my initial idea, but unfortunately I am very limited to the formatting and I could not upload pictures. As well, this could be considered both as improvement and as an issue and the fact that there are images even supports it being an issue further. Do you know whether uploading of images in the ideas is achievable somehow?

 

Regards

v-qiuyu-msft
Community Support

Hi @ivo_andreev

 

I'm afraid it's not able to add images in idea. You can add this thread link in the idea as you already shared the detail information in this thread. 

 

You can share the feedback about Ideas forum here: https://community.powerbi.com/t5/Community-Feedback/bd-p/community-feedback

 

Best Regards,
Qiuyun Yu