Today I will be looking at the Wisconsin Breast Cancer Data Set a pre-processed, clean data set on SKLearn. The purpose of this is to classify whether the tumour is ‘benign’ or ‘malignant’ and I will be comparing and contrasting a number of different techniques as a means of understanding the data. As this is a relatively small data set, I don’t foresee any meaningful difference in the models’ power of prediciton. However, this should provide an insight into some of the different tools and metrics available.
A brief exploration of the data:
As mentioned in the description, this is a data set of 30 numerical values. These features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
Here is confirmation that we have a binary target (1,0) corresponding to ‘benign’ or ‘malignant’.
That shows a slight skew towards ‘benign’ (1).
Now let’s split the data and fit it to a Logistic Regression and evaluate its appropriateness. The .score() method returns the mean accuracy on the given test data and labels.
We also run a cross_val_score() on the training data with 3 folds. This splits the data into 3 folds, makes a prediction on each fold and evaluates each fold using a mini-model trained on the remaining folds. As this is a binary classification problem (i.e ‘benign’ or ‘malignant’), this fits the bill, though in cases where you are trying to solve a multiclass classification problem or if there is significant skew amongst classes, this may not be appropriate. In this case, there is a slight skew, with ‘benign’ instances making up 63% of the data. Imbalanced classification is fairly common in disease detection, where typically most of the data points are negative. In such cases, accuracy is not a good measure of model performance.
Cross_val_predict() does the same as cross_val_score, but returns the predictions made on each fold.
This is necessary if we want to evaluate the model using a confusion matrix. The rows in the confusion matrix represent the classes. Correct predictions run across the main diagonal, from top left to bottom right. In this case the top row is the negative class (i.e 0 or ‘malignant’), while the columns represent the predictions. So, the value on the top left pertains to true negative, or a correct prediction of the negative class. To the top right is false negative, or an incorrect prediction of ‘malign’. Below, the wrongly predicted positives or false positives, meaning the incorrect predictions of benign are on the left. While on the right, the value represents true positives, or correct ‘benign’ predictions.
These can be computed into ratios in the form of Precision and Recall. Both of these metrics can be found in the confusion matrix but emphasise different factors.
Recall is all True Positives divided by all True Positives + all False Negatives. In this case, it is the number of True Positives (correctly predicted instances of ‘benign’), divided by True Positives + False Negatives (incorrect predictions of ‘malignant’, i.e actually ‘benign’).
Whereas, Precision refers to True Positives, divided by True Positives (correctly predicted instances of ‘benign’) + False Positives (incorrect predictions of ‘benign’, i.e actually ‘malignant’).
As can be intuited, these is generally a trade-off between Precision and Recall. If we incorrectly labelled all results as ‘benign’, we would have a perfect Recall, but an atrocious Precision. Recall can be thought of as a model’s ability to find all relevant instances in a dataset, whereas Precision can be thought of as the proportion of data points the model suggests is relevant that actually are.
In this case, there is a marginal difference and the scores are quite high, suggesting the model is appropriate. They can be combined, into the F1 score.
Another way of evaluating the appropriateness of the findings is to plot the True Positive Rate (recall) vs. the False Positive Rate (ratio of negatives predictions incorrectly classified as positive). The ROC plots this function. In a perfect case, both TPR and FPR would be 1, while a perfectly random model would produce 0.5.
Next, we do the same for a Random Forrest Classifier. This is a more flexible model, so it is perhaps understandable that it has a higher accuracy for the training data.
The danger here, of course, is that the model overfits to the training data and generalises patterns unique to the training set onto the test set. This is where cross_val_score() becomes more important, as it allows us to train and evaluate a number of models on a limited training set. With a total data set of only 569, in reality, overfitting with a model of this type is very probable. Unsurprisingly, this model produces a better score on the training data.
We can see little meaningful difference in the evaluation metrics between the models.
With such a small data set, any optimisation would appear meaningless, as there is little defence against overfitting. Indeed, the Random Forest Classifier predicted 100% accuracy on the training data.
It is conceivable that in a medical setting, it would be in the interest of doctors to idetnfy the features that had the highest predictive power. If a test could be conducted at a lower of the price with only a slightly worse accuracy, it may be preferable for a hospital to use. Of course, in the real world there would be a whole host of ethical questions to consider, but the principle could be applied to any number of problems. One would hope that more data was available as a precaution against overfitting on the training data. For our purposes, though, it is interest enough to play with some of the tools.
Helpfully, the SKLearn Random Forest Classifier class has a feature importance method.
After which it is fairly simple to look at.
And to graph.
If we then pass the five most important features through a box plot, they show a clear skew towards malignant.