How to validate a classification model
Hello, this is my first blog which I am writing on ‘How to validate a classification model.
Let us know first what validation is, in machine learning, model validation is the process where a trained model is evaluated with a testing data set. This is done to check the generalization ability of the trained model.
For this, I am taking a data set named Wisconsin Breast Cancer Database to predict the effect on breast cancer whether it is benign(not harmful in effect) or it is malignant(very virulent or infectious).
I have taken a dataset named Wisconsin Breast Cancer from the UCI repository; as this breast cancer database was valued by Dr. William H. Wolberg; from the University of Wisconsin Hospitals, Madison.
This data set consists of 11 attributes with 1 class attribute and 699 no. of instances are there. Based on the other 10 attributes we are going to predict the harmfulness of a disease.
I have chosen ‘classs’ as a label to predict whether the effects are harmful or not. Here, 2 indicate a benign in the dataset, and 4 indicate that it is malignant.
Here I have shown what my data set looks like:-
Balancing the dataset
In this blog, I have balanced the dataset to make sure that the model doesn’t favor a particular class, as on splitting we get 458 instances as training set and 241 instances as a test set. Here, it is not balanced. It is particularly 60% and 30%. Generally, medical datasets are unbalanced. In real life it is not possible to get 50% of the people suffering from a disease. So, most of them are benign and we can work with the unbalanced data in the case of medical industry datasets.
Random forest classifier
I have chosen to build a model using a Random Forest Classifier.
A random forest is an ML classifier that fits several decision tree classifiers on various subsets or sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Performance measures for the classification model are:-
- Accuracy:- This is one of the commonly used measures it refers to the mean of correct predictions
- Precision:- it is the measure of the proportion of correctly predicted outcomes from the set of all the predictions. In another way, we can say that the actual true positives from all the predictions.
Precision is also known as true negative rate or specificity.
Precision = TP /(TP + FP)
- Recall: – Recall can be termed as the true positive predictions from all the actual positive predictions.
A recall is known as a true positive rate or sensitivity.
Recall = TP/(TP + FN).
Observation Table: –
|Number of estimators||Max depth||Accuracy Score||Precision score||Recall Score|
###Snapshots of the results, for n_estimator = 100 and max_depth = 7
- The validation found the model to be stable.
- The model predicts the results properly for the 3 models which are trained here whether benign or malignant.
- In this case, we have gone for the case of a balanced dataset but in real life, this is not possible in the medical industry. So, unbalanced datasets are used there.