Logistic Regression : Classification if you use Python, Regression if you use R !!
Every now and then, one stumbles upon on the statement “Logistic Regression is not regression, it is a classification algorithm” .
On hearing this, many statisticians cringe (and rightfully so).
Is Logistic Regression really a classification algorithm and not regression ?
Absolutely not.
Machine learning has usurped and renamed many statistical techniques. Often to the extent that they now disbelieve and reject its statistical origins.
So how do we drive home the point that, Logistic Regression is indeed Regression.
Through a meme of course.
We live in a meme culture. If meme can become cryptocurrency, why not use it to drive home the point : Logistic Regression is indeed Regression.
Logistic Regression is indeed Regression
Now this meme should have really cleared things up.
The next obvious question is Why do people develop the misconception that “Logistic Regression is not Regression” ?
Your usual suspects are Modern ML books (often 100 pages or lesser), blogs, private courses, pop data scientists on YouTube etc.
But the next question is, when one is implementing logistic regression, why is it that they are not aware about the various types of link functions, GLM families etc. Why doesn’t the intuition of Logistic Regression being Regression set in?
Anecdotally, I have noticed two things.
1. People who transition to Data Science from a statistical or related backgrounds rarely make this mistake.
2. People who learn R first and then move to Python too don’t make this mistake.
With regards to point 1, Their statistical training teaches the Logistic Regression the right way. This brings us to point 2.
Around 4–5 years ago, I predominantly used R. However, in the last few years I moved to python as part of NLP and ML work.
But during my time using R, I vividly remember the syntax for implementing Logistic Regression being intuitive. It kind of gave a peek into what the algorithm is all about.
If you notice in the above image, the R syntax clearly tells the user that the model is Generalized linear model. Additionally, it also tells you that we are using Binomial regression with a logit link. Now people who use this syntax will never say “Logistic regression is not regression”.
On the other hand, let us take the python syntax (specifically sklearn, since many data scientists use that).
As you can see, the syntax does not tell you much. It is just instantiation of the class. Of course, one needs to take a peek inside this class. But many don’t because they have been sold on ‘Implement machine learning in 2 lines of code’.
Any person without a statistical training, implementing logistic regression code in python (sklearn) will not even have a hint whether it is indeed regression! Again, they have been educated on ‘Supervised’ and ‘Unsupervised’ algorithm. To them logistic regression is a ‘supervised classification algorithm’ because they have heard and seen it being used only for *classification* purposes.
But because you start calling logistic regression as classification algorithm does not make it one. At best using logistic regression for ‘classification’ is one of the clever hacks.
A hack should not be misconstrued for the original. This is not about semantics but about learning the algorithm for what they are.
P.S. The goal here is not to trigger R vs Python debate. Each of their merits far outweigh their miniscule demerits.