Confusion Matrix - "accuracy(metric)" flawed ?_post#2

Hello friends, this is my second post related to the confusion matrix measurement metric. 

In my previous post, I explained how accuracy metrics may mislead judging the performance of a confusion matrix(an output from a classification model). We have also seen how Recall(True positive over Actual positive) metric in Example2 was able to catch the poor quality of positive class prediction. 

We left the earlier post at this thought......

    Does this mean we don't need precision? 

The answer is No, let's see another example for the same use case

Example3: Consider the same class of 500 school students and you are trying to predict how many are affected because of covid and you can recommend them for isolation/quarantine. 

This time you wanted to make sure there should not declare any student negative if he/she "may" be positive, you added some additional checks during testing, and the following was the result:


 

 

Actual

 

 

 

Positive

Negative

 

Predicted

Positive

10

10

20

Negative

0

480

480

 

 

10

490

 

Total sample size: 500

  • Ground Truth:
    • Actual positive: 10, Actual negative: 490
  • Prediction:
    • Predicted positive: 20, Predicted negative: 480
Accuracy: True +ve True -ve/Total samples = 495/500 = 98% accuracy
Recall: True+ve/ Total +ve(actual): 10/10 = 100%
Precision: True+ve/ Total +ve(predicted) : 10/10+ 10 = 50%

You can appreciate that Recall was not able to catch the 10 additional samples which were negative but classified as positive, this is where Precision helps. 

Let's summarize the results for all three examples(you can find the previous 2 examples here)


 

True +ve

Total +ve (actual)

Total +ve

(predicted)

Accuracy

Precision

(True+ve)/ Total +ve(predicted)

Recall

(True+ve)/ Total +ve(actual)

Example1

250

250

 

90%

 

100%

Example2

5

10

 

99%

 

50%

Example3

10

10

20

98%

50%

100%


Final summary for all the 3 examples

Based on all the 3 examples we can infer the following: 

1) Accuracy may not be the right metric, since real-world data is skewed for one type of class.

2) Recall will be able to catch cases where the model predicts fewer positives than actual.
Now, for non-mission-critical use cases like how many students brushed in the morning, it may be OK but for use cases like covid infection, we would like Recall to be close to 100%. Recall of 100% means that +ve class was correctly identified(with some tradeoff for negative class)

3) Precision: For Precision value lower than 100% indicates there were some -ve class samples incorrectly identified as +ve which may be OK for mission-critical use cases(like covid predication, cancer detection), we can advise for a retest. However, if a positive case is misclassified as negative it may prove to be fatal. 

Now, this leads to another question, what-if different models predict different Recall & Precision scores for same the data set. 
If the results are aligned(all models predicting similar Recall and Precision direction) we are good however if the models predict Precision and Recall metrics that are not aligned(in the example table below) it will not be an easy decision. 

 

Precision

Recall

ClassificationModel1

Good

OK

Classification Model2

OK

Good



Don't worry there is another metric that can help us in the situation, let's see it in the next post 😉

References: 






Comments

Popular posts from this blog

Confusion Matrix - Is "accuracy(metric)" flawed ?

ROC-AUC explained

R2 or Adjusted R2(Where is the adjustment ?)