Linear Regression Algorithm

Table Of Contents:

  1. What Is Linear Regression?
  2. Equation For Linear Regression.
  3. Types Of Linear Regression.
  4. Linear Regression Line.
  5. How To Find The Best Fit Line?
  6. Cost Function For Linear Regression.
  7. Assumptions In Linear Regression.

(1) What Is Linear Regression Model?

  • It’sSupervised Learning algorithm which goal is to predict continuous, numerical values based on given data input.
  • If you want to mathematically model the behavior of a continuous variable you can use Linear Regression model.
  • First as a Data Scientist you need to find out, what are the factors affecting the continuous variable.
  • Then you can use Linear Regression model to combine all the factors or variables that will predict your target variable.
  • Linear Regression tries to find parameters of the linear function, so the distance between the all the points and the line is as small as possible.
  • Algorithm used for parameters update is called Gradient Descent.

(2) Equation For Linear Regression Model.

(3) Types Of Linear Regression Model.

Simple Linear Regression:

  • If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.

Multiple Linear Regression:

  • If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

(4) Linear Regression Line.

  • A regression line indicates a linear relationship between the dependent variables on the y-axis and the independent variables on the x-axis.
  • The correlation is established by analyzing the data pattern formed by the variables

Positive Linear Relationship:

  • If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship.

Negative Linear Relationship:

  • If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.

(5) How To Find A Best Fit Line?

  • When working with linear regression, our main goal is to find the best fit line that means the error between predicted values and actual values should be minimized.
  • The best fit line will have the least error.
  • The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost function.

(5) How To Find A Best Fit Line?

  • Regression tasks deal with continuous data. Cost functions available for Regression are,
  1. Mean Absolute Error
  2. Mean Squared Error
  3. Root Mean Squared Error
  4. Root Mean Squared Logarithmic Error

Mean Absolute Error:

  • Mean Absolute Error(MAE) is the mean absolute difference between the actual values and the predicted values.
  • MAE is more robust to outliers. The insensitivity to outliers is because it does not penalize high errors caused by outliers.
  • It is robust to outliers because the difference between the actual and the predicted values for the outliers is simple subtraction, we are not squaring our difference value.
  • Hence our Mean Absolute Error will not increase that much, our Best Fit Line won’t be affected much by the presence of outlier values.
  • The drawback of MAE is that it isn’t differentiable at zero and many Loss function Optimization algorithms involve differentiation to find optimal values for Parameters.

Mean Squared Error:

  • Mean Squared Error(MSE) is the mean squared difference between the actual and predicted values.
  • MSE penalizes high errors caused by outliers by squaring the errors.
  • The optimization algorithms benefit from penalization as it is helpful to find the optimal values for parameters
  • The drawback of MSE is that it is very sensitive to outliers.
  • When high errors (which are caused by outliers in the target) are squared it becomes, even more, a larger error.
  • MSE can be used in situations where high errors are undesirable. Because if you use MSE it will have a higher value of error for outliers, our main goal is to decrease the error by moving our regression line to fit the outlier values.

Root Mean Squared Error:

  • Root Mean Squared Error (RMSE) is the root squared mean of the difference between actual and predicted values.
  • RMSE can be used in situations where we want to penalize high errors but not as much as MSE does.
  • RMSE is highly sensitive to outliers as well.
  • The square root in RMSE makes sure that the error term is penalized but not as much as MSE.

Root Mean Squared Logarithmic Error:

  • Root Mean Squared Logarithmic Error (RMSLE) is very similar to RMSE but the log is applied before calculating the difference between actual and predicted values.

  • The large errors and small errors are treated equally. RMSLE can be used in situations where the target is not normalized or scaled.

  • RMSLE is less sensitive to outliers as compared to RMSE. It relaxes the penalization of high errors due to the presence of the log.

(6) Assumptions In Linear Regression.

  • Regression is a parametric approach.
  • ‘Parametric’ means it makes assumptions about data for the purpose of analysis
  • Due to its parametric side, regression is restrictive in nature.
  • It fails to deliver good results with data sets that don’t fulfill its assumptions.
  • Therefore, for a successful regression analysis, it’s essential to validate these assumptions.
  1. Linear Relationship Between Input and Output.
  2. No Multicollinearity.
  3. No Autocorrelation Of Error Terms.
  4. Homoscedasticity.
  5. Normal Distribution Of Error Terms.
  6. No Endogeneity.

(1) Linear & Additive Relationship Between Input and Output.

  • According to this assumption, the relationship between the independent and dependent variables should be linear.
  • The reason behind this relationship is that if the relationship will be non-linear which is certainly the case in the real-world data then the predictions made by our linear regression model will not be accurate and will vary from the actual observations a lot.
  • If you fit a linear model to a non-linear, non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model.
  • Also, this will result in erroneous predictions on an unseen data set.

How To Check:

  • Look for residual vs fitted value plots (explained below). Also, you can include polynomial terms (X, X², X³) in your model to capture the non-linear effect.

Residual vs Fitted Plot:

  • This scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values).
  • It is one of the most important plots which everyone must learn
  •  It reveals various useful insights including outliers.
  • The outliers in this plot are labeled by their observation number which makes them easy to detect.
  • There are two major things that you should learn:

    1. If there exists any pattern (maybe, a parabolic shape) in this plot, consider it a sign of non-linearity in the data. It means that the model doesn’t capture non-linear effects.

Solution:

  • To overcome the issue of non-linearity, you can do a nonlinear transformation of predictors such as log (X), √X, or X² transform the dependent variable.

Super Note:

  • By looking at the shape of the error you can guess what is the problem in the model.
  • If you get the parabolic shape you can tell that your model is missing the nonlinearity to capture.
  • If it has autocorrelation then you are missing an important variable.

Leave a Reply

Your email address will not be published. Required fields are marked *