R2 or Adjusted R2(Where is the adjustment ?)

Hello friends, this post is about "adjusted r2" starting with a small embarrassing story. 

My professor explained R2 followed by its limitations and introduced us to "adjusted r2". I understood R2 very well but adjusted R2 was hazy and I could not gather the courage to raise my hand 🙋‍♂️ and ask additional questions to clarify my understanding. 

Consequence 🤦‍♂️.....  now I have to spend extra time going through "innumerable" online posts and videos to understand the concept. Finally, I got some insights and am happy to share 🙂

First thing first: 

What is R2 ? -> It is the measure of the proportion of the variability in the data that your model is able to explain. Let's take some examples:

1. Say you bought 10 notebooks and recorded the # of pages w.r.t to weight and generated a linear model. 

In my manufactured example data, you can observe how poor R2 is, only 7.2% and i.e because data is spread everywhere and I chose to apply a linear regression model(how cruel 😁) 



Let's see it one more time(with some sensible data)




In this case, r2 is able to explain 98% variability in your data. 
Check out this video for a deeper dive 

Now let's go to the main topic, what is the problem with r2.


You can find the jupyter notebook here.

1. R2 score with what I will refer to as "Sensible data"


2. R2 score with and I added another feature, avg rainfall( on the day you bought the book 😁), I refer to as "
Sensible_with_Invalid_feature"



No major change in r2 score 
99.35388587596515 Vs 99.35388587596515

3. R2 score with and I added another feature, height of the buyer  😁 😁, I refer to as "Sensible_with_two_Invalid_feature"


Did you see now what happened? The r2 score went up from ~99.35 -> ~99.49.



This is the problem with R2, the score will never go down but it "may" increase if the model catches some "noisy" correlation in the features. 

Adjusted R2 helps to balance this by punishing features that are not adding "enough" meaningful contribution to prediction and rewarding the features which are meaningful. 



Let's see how it is a fight between "1" and "2" 

1) If p is high 🔼, then "2" is low🔽 and "3" is high.🔼 and Adjusted R2 is low. 

Would this mean Adjusted R2 will always be lower? No, if "1" can compensate or offset for the penalty by "2" overall Adjusted R2 score will improve. 

The post is long already, I will create a new post with adjusted r2 metrics against the same data. 

Trying this helped me understand the concept better, you can use the jupyter notebook and improvise it further & if you discover something do share... 

Have a good day!

References:

Jupyter Notebook

Other ML related blog post


Comments

Popular posts from this blog

Confusion Matrix - Is "accuracy(metric)" flawed ?

ROC-AUC explained