The general advertisement is that the Auto ML is designed for "business analysts to build machine learning models". However, in some situations there could be realistic problems, unless someone is having a data science background.
- Use the Power BI sample projects (Supplier Quality Analysis Sample)
- Extend the metrics query by creating a HasDefect column, which we would like to have as a label for future prediction. In my case: = Table.AddColumn(#"Added Custom", "HasDefect", each if [Defect Type ID]<=1 then 0 else 1)
- Create a data flow and entity based on the Metrics query
- Proceed to creating an ML model and follow the wizard
- For historical outcome field select "HasDefect"
- When you go to "Customize inputs" step you will get something like:
- Now the problem is that HasDefect is very closely correlated to the column DefectType and a user without a datascience background will "successfully" train a model with 100% accuracy
- Here is what a simple Python visual with Pearson correlation shows: there is an evident correlation between the 2nd and 4th columns.
- Below is the code to generate this. However, as you may see, I tried to address the "Defect" string column by encoding it to integer, so that it could take part in the correlation, as the defect text is 1-to-1 match with the defect ID, which determines the defect type, which is related with the label "Has defect". However, I did not manage to make it identify this correlation, due to the shuffled order.
- from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(dataset['Defect'])
dataset['Defect'] = le.transform(dataset['Defect'])
# Paste or type your script code here:
import matplotlib.pyplot as pyplot
corr = dataset.corr('pearson')
My suggestion is: Please run a correlation algorithm (or improve the existing), so that features that are correlated with the label are not suggested. Otherwise, "business analysts" will create models that are not useful and this would degrade the value of the excellent job done here.