The phrase “All models are wrong, some are useful” is quite loosely used. Some take it in a very literal sense to imply that “Modeling is a futile exercise”.
This is a terrible misunderstanding and we shall see why shortly.
“All models are wrong, some are useful” is an aphorism (meaning it is a concise expression of general truth). But the aphorism in this case leads to misinterpretation.
Firstly, it is important to understand what modeling is.
The purpose of modeling is to provide an abstraction of real process. Basically, a good approximation of reality.
Anybody who mistakes the abstraction…
We often come across YouTube videos , posts, blogs and private courses wherein they say “We accept the Null Hypothesis” instead of saying “We fail to reject the Null hypothesis”.
If you correct them, they would say what s the big difference? “The opposite of ‘Rejecting the Null’ is ‘Accepting’ isn’t it ?”.
Well, it is not so simple as it is construed. We need to rise above antonyms and understand one crucial concept. That crucial concept is ‘Popperian falsification’.
This concept or philosophy also holds key to why we use the language “Fail to reject the Null”.
Abstraction — some succinct definitions.
“Abstraction is the technique of hiding implementation by providing a layer over the functionality.”
“Abstraction, as a process, denotes the extracting of the essential details about an item, or a group of items, while ignoring the inessential details”
“Abstraction — Its main goal is to handle complexity by hiding unnecessary details from the user”
Abstraction as a concept and implementation in software engineering is good. But when extended to Data Science and overdone, becomes dangerous.
Recently, the issue of sklearn’s default L2 penalty in its logistic regression algorithm came up again.
This issue was first…
Every now and then, one stumbles upon on the statement “Logistic Regression is not regression, it is a classification algorithm” .
On hearing this, many statisticians cringe (and rightfully so).
Is Logistic Regression really a classification algorithm and not regression ?
Machine learning has usurped and renamed many statistical techniques. Often to the extent that they now disbelieve and reject its statistical origins.
So how do we drive home the point that, Logistic Regression is indeed Regression.
Through a meme of course.
We live in a meme culture. If meme can become cryptocurrency, why not use it to…
I was recently invited to judge a Data Science competition. The students were given the ‘heart disease prediction’ dataset, perhaps an improvised version of the one available on Kaggle. I had seen this dataset before and often come across various self-proclaimed data science gurus teaching naïve people how to predict heart disease through machine learning.
I believe the “Predicting Heart Disease using Machine Learning” is a classic example of how not to apply machine learning to a problem, especially where a lot of domain experience is required.
Let me unpack the various problems in applying machine learning to this data…
Dear Aspiring Data Scientist,
Before you start using ‘low code’ or ‘drag & drop’ data science tools, please learn the fundamentals.
Why aspire to be ‘Citizen Data Scientist’ when you can truly become a ‘Data Scientist.’
Don’t get swayed by the fancy titles like ‘Citizen Data Scientist.’ It is funny that so much hard selling is happening in data science.
I mean, just because we know how to use a thermometer or operate BP machine, should we start calling ourselves ‘Citizen Doctor’?
Wikipedia defines cheat sheets as a concise set of notes used for quick reference. Now the word that needs to be emphasized here is ‘quick reference’.
In programming, cheat sheets are OK because no one can remember all the syntax of a programming language. Especially if the programming language constantly evolves (like Python) or if the programmer finds himself/herself transitioning in and out of different programming languages.
A quick reference like a cheat sheet helps the programmer save time and focus on the larger problem.
A Full-stack Data Scientist is not a 10x Data scientist but a 1/10th Data Scientist.
Data science does involve coding but it shouldn’t be viewed through the prism of software programming. Words such as ‘Full Stack’ loaned from the software world should not be applied to Data Science.
Why Full-stack does not apply to Data Science
‘Full Stack’ in the software world means a person can do both front end and backend development. Often in both cases, the objective, design, and functionality is known. …
I have been a regular user of Spacy and I have used it to solve interesting problems in the past. One of the basic problems in NLP is finding similarities between words or phrases. Many NLP libraries provide the feature to check whether a word/phrase is similar or not through cosine similarity score.
Finding similarity between 2 words is easy. What if, you had say 200k words !! and wanted to check the similarity of each word against say a table containing 10k words. …
Many aspiring Data Scientists lament that despite doing many massive open online courses (MOOCs) they are not getting the break.
Why is that?
Because only contextual learning sticks.
There are so many interesting MOOCs out there and an aspiring data scientist feels like doing it all. Sometimes these never ending list of courses gives an aspiring data scientist a feeling of being a hamster on a wheel.
Data scientist/Statistician with business acumen. Hoping to amass knowledge and share it throughout my life. Rafa Nadal Fan, Love to read non fiction books.