The Big XY problem in Data Science

“Do you think the model accuracy metrics are good enough? What should we do to improve the accuracy metrics ?”.
The above question is very typical. You might have been asked this question at your work, Kaggle competitions or hackathons.
This is a typical XY problem. So let’s unpack what is a XY problem first.
XY problem is actually asking about your attempted solution rather than asking about your actual problem.
So basically you try to solve problem X, and you think solution Y would work, but instead of asking about X when you run into trouble, you ask about Y.
So in the above case, “Y” is asking about accuracy metrics and how to improve accuracy metrics.
While in reality the real problem “X” was whether the chosen ML model was the right one or not.
Accuracy metrics will only tell you how well your selected model is performing. It will not tell you whether the model you have chosen in the first place, is the right one or not.
This is why fixation with accuracy metrics often can be misleading.
Also, when you are choosing a ML model from a cohort of different models just based on accuracy metrics (like in case of low code ML libraries), you are committing a typical XY problem.
This XY problem has many such examples in Statistics and machine learning.
Illustration 1:
“I applied Box-cox transformation on my Dependent variable because the Dependent variable was not normal”.
Solution Y — Transform the variables through Box-cox transformation.
Problem X — Actually None (only the errors and residuals are assumed to follow normal distribution. Not the raw data itself).
Please refer the resources section for elaboration.
Illustration 2:
Applying K means
Solution Y — “I performed K means clusters and my clusters are not distinct enough” (poor silhouette score or no discernable elbow in elbow plot)
Problem X — K means will cluster almost anything, are we sure k means was the right choice !!
Please refer the resources section for elaboration
In Data Science, it is important to ask the right question first, before providing the ‘right’ solution.
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
For Data Science Consulting and Solutions;
Get in touch with us at:
Website: https://www.arymalabs.com/
References
- Box Cox Transformation: https://qr.ae/pGPhu5
- K means : https://lnkd.in/g_KHtYN