Avikar Banik
Geek Culture
Published in
8 min readJun 5, 2021

--

Photo by Myriam Jessier on Unsplash

Basic Concepts needed for Data Science work — Part 1

If you are working as a Data Scientist or want to be one, in both the cases there are a set of fundamental things that you need to be clear about. In this article I will explain 5 different concepts that are very important for you to have a conceptual understanding, so that you can develop a clear thought process while approaching any data science problem. The objective of this article is not to go into the mathematical details of these topics but focus more on the conceptual understanding. Following are the topics that will be covered. You can also view the YouTube video on this same topic in the video link provided at end of this article :

  • Probability vs Likelihood
  • Hypothesis Testing
  • p-value
  • Model Overfit vs Underfit
  • Regularization

Probability vs Likelihood:

Probability corresponds to the intensity of chance of occurrence of some event given a distribution of data. For example if there is a distribution with age of a group of students — the data will have some characteristics like mean and standard distribution. If I want to undertsand what is the probability of finding a student with age > 10 Years then, probability is defined as the chance of finding a student > 10 Years given the mean and standard deviation (which is fixed for the data set and cant change). So in case of calculating probability, the characteristics of the distribution cant be changed

Likelihood on the other hand is about finding the best characteristics of the data distribution which maximizes the chance of a certain event. For example what is the likelihood of age > 10 Years. In this case the characteristics of the distribution , i.e, mean and standard deviation will be changed to identify the optimum values that will maximize the chance of occurrence of age > 10

Image Courtesy ( StatQuest Youtube channel)

Hypothesis Testing :

To explain in layman’s term, Hypothesis Testing is a statistical technique to test whether the outcome of an experiment is due to mere chance/ randomness, or is it due to something done as part of the experiment. To do that, a hypothesis is formed and then test if the hypothesis should be accepted or not. The below terminologies need to be understood to understand hypothesis testing :

  1. Null hypothesis : This implies a state of no change. This assumes that the results of experiment are purely due to chance or randomness.
  2. Alternate hypothesis : This implies that the result of experiment is influenced by some non random causes.
  3. Significance level : Error tolerance limit or zone. But the question is what is Error in this case? Error is the scenario where an outcome is considered to have happened due to an influencing factor in the experimental process, whereas the outcome actually has happened just by mere chance and not due to any influencing factor.
  4. Type I Error : Same as significance level. This is the error when Null Hypothesis is rejected while it is actually True.
  5. Type II Error : This is the error where there is a failure to reject Null Hypothesis even when it is False.
  6. p-value: Chance or probability of randomness of the result
  7. Critical Value: It is a point (for left or right tail test) or points (two tail test), on the test distribution , which defines the rejection region. If test statistic is beyond the critical value(s) then Null Hypothesis is rejected.
  8. Right tail test: It is a test where your alternate hypothesis statement contains a greater than comparison.
  9. Left tail test : It is a test where your alternate hypothesis statement contains a lesser than comparison
  10. Two tail test: It is a test where your alternate hypothesis contains not equal kind of comparison.

There are different ways understand whether to accept or reject Null hypothesis. Following are the brief descriptions of those methods:

  • p-value method: In this method if p-value is less than significance level, then Null Hypothesis is rejected. (alternate hypothesis holds true)
  • critical value method: In this method the test statistic is calculated. If the test statistic is beyond the critical value, then Null Hypothesis is rejected

Types of Hypothesis Testing: There are different types of hypothesis test which are performed based on the objective and type of data. The below table defines the popular test methods and the scenarios based on which the test methods can be applied

p-value:

For simple understanding, p-value can be considered as the chance or probability of randomness of the result. For example , p-value of 0.01 means there is a 1% chance that the outcome of the experiment is random ( that is it happened by chance) . If p-value is high say 0.7 it means there is 70% probability that the result happened by randomness and not due to anything done as part of the experiment. Hence lower p-value suggests that the result is statistically more significant.

To say in more layman language, if alpha ( significance level ) is 5% ( 0.05 ) then it means that there is 5% of error which can be tolerated ( if error becomes more than 5% then the outcome is not statistically significant and can be assumed that the result is based on random chance and not due to goodness of the experiment ).

Hence in case of Hypothesis test , if p-value is less than alpha, then Null hypothesis is rejected — this means that the error is within the tolerance limit of 5%. This signifies that the result of the experiment is statistically significant and not due to mere chances because the error is within the tolerance limit. Hence null hypothesis is rejected and alternate hypothesis (which means that the result is statistically significant ) is accepted.

On the other hand if the p-value is more than 0.05 this means that the error is beyond the acceptable tolerance limit — this signifies that it is highly likely that the result is just due to random chance and not due to the goodness of the experiment.

Model Overfit vs Underfit:

To understand Overfit vs Underfit , it is first important to understand the difference between Bias and Variance.

Bias : It is the measure which defines how much the model prediction is differing from the actual target value. A low Bias means that the model is able to correctly capture the relationship between X and Y.

Variance: It is a measure which defines how much varied are the model prediction for different datasets. A High variance means that the output predicted by the model can be drastically different from dataset to dataset.

Overfit: This happens when a model has Low Bias & High Variance. This implies that the model has been able to correctly capture the relationship between X and Y when it was trained with the training data. However, when it is applied to new data for prediction, the outputs are drastically different from dataset to dataset as compared to how it performed based on the training data.

This may happen if:

  • Training data is not a clean data and contains garbage/noise
  • Volume of training data is inadequate
  • Model is highly complex

Underfit: This happens when a model has High Bias & Low Variance. This implies that the model has not been able to correctly capture the relationship between X and Y when it was trained with the training data. Hence, the goodness of the fit of the model itself is low. Thus there would be a high difference between the predicted value vs the actual values.

This may happen if :

  • Training data is not a clean data and contains garbage/noise
  • Volume of training data is inadequate
  • Model is over simplified

Regularization:

Regularization is a technique which is used to prevent overfitting or underfitting of models by calibrating the linear regression models by introducing an amount of Bias thereby resulting in a significant reduction in Variance. There are primarily two different type of regularization techniques:

  • Lasso ( L1) Regularization
  • Ridge ( L2) Regularization
  • Elastic Net Regression

Ridge Regularization: Ridge Regularization technique tries to minimize ( sum of squared residuals + (lambda * slope²)) OR ( sum of squared residuals + (lambda * coefficient²)) as compared to linear regression which tries to minimize the sum of the squared residuals. Lambda implies the intensity of the penalty and can vary between 0 to positive infinity. The above is the cost function in Ridge Regression.

In general when the slope is less, the predicted target value is less sensitive to the independent variable. In case of Ridge Regression, since the objective is to minimize the cost function hence — higher the value of hyperparameter lambda, the value of coefficients reduce to minimize the overall cost function. The point to be noted here is that even though the penalty can shrink the parameters till near 0 but it won’t make the parameters equal to 0. This means that it will not eliminate any parameter. Thus those coefficients do not impact the output as severely as before (that is previously for a small change in X, there used to be significant change in Y due to higher slope). However due to the penalty in Ridge, the slope reduces and hence for a small change is X, the change in Y is not impacted significantly.

The optimum value of lambda is generally determined using 10 fold cross validation and then taking the one resulting in lowest variance.

NOTE: Ridge Regression can be used with both continuous vs continuous variables as well as Discrete vs Continuous variables. Ridge regression can also be applied to logistic regression. When applied to logistic regression it tries to minimize the (sum of the likelihoods + lambda*slope²)

Lasso Regression: Lasso Regularization technique tries to minimize ( sum of squared residuals + (lambda * |slope|) ) (that is, penalty teem contains lambda multiplied with absolute value of slope).

In case of Lasso, while it is shrinking the value of the coefficients, it can reduce it till zero. This means it can eliminate any useless variables.

NOTE: Lasso Regression can be used with both continuous vs continuous variables as well as Discrete vs Continuous variables. Ridge regression can also be applied to logistic regression.

Elastic Net Regression: When there are huge number of parameters being used in the model and it is not possible to know about useful and useless parameters, in such cases it may be a confusion whether to choose Ridge or Lasso Regression. In such scenario we can use Elastic Net Regression — it is a kind of hybrid between Ridge and Lasso that tries to utilize the strength of both. Elastic Net Regression tries to minimize :

sum of squared residuals + (lambda * slope²)

+

sum of squared residuals + (lambda * |slope|)

Cross validation is used to choose best values of lambda1 and lambda2

Vist my Facebook Page @ facebook.com/FBTrainBrain/

Hope this article will help you to strengthen your conceptual understanding on these above topics. Also read the Part 2 of this article in this link Basic Concepts needed for Data Science work — Part 2

--

--