Linear Regression – Evaluation Matrices


Linear Regression – Evaluation Metrices

Table Of Contents:

  1. Mean Absolute Error.
  2. Mean Squared Error.
  3. Root Mean Squared Error.
  4. R –  Squared Error
  5. Adjusted R – Squared Error.

(1) Mean Absolute Error.

  • Mean Absolute Error calculates the average difference between the calculated values and actual values.
  • It is also known as scale-dependent accuracy as it calculates error in observations taken on the same scale.
  • MAE provides a straightforward measure of the model’s accuracy, as it represents the average magnitude of errors without considering their direction.

Formula:

Example:

  • To calculate the MAE, we follow these steps:

    1. Calculate the absolute differences between the predicted and true prices for each house:

    House 1: |240 – 250| = 10
    House 2: |320 – 300| = 20
    House 3: |200 – 180| = 20
    House 4: |380 – 400| = 20
    House 5: |360 – 350| = 10

    1. Sum up the absolute differences:

    10 + 20 + 20 + 20 + 10 = 80

    1. Divide the sum by the total number of instances (in this case, 5) to get the mean:

    MAE = 80 / 5 = 16

Advantages:

  • Easy Interpretation: MAE is expressed in the same units as the target variable, making it easy to interpret and understand. For example, if the target variable represents the price of a house in dollars, the MAE will also be in dollars, which provides a clear understanding of the average prediction error in terms of the target variable.

  • Robustness to Outliers: MAE is less sensitive to outliers compared to other error metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE). Since MAE only considers the absolute differences between predicted and true values, it does not magnify the impact of large errors as much as squared error-based metrics.

  • Error Magnitude Representation: MAE provides an intuitive representation of the average magnitude of errors made by the model. It gives equal weight to all errors, regardless of their direction (overestimation or underestimation). This can be valuable in situations where the direction of errors is not critical, and the focus is on the overall accuracy of predictions.

  • Optimality in Some Cases: In certain scenarios, minimizing the MAE can lead to optimal predictions. For example, if the underlying error distribution is Laplacian (double-exponential), the median corresponds to the minimum MAE estimator. This property makes MAE suitable for specific cases and can guide model selection and training.

  • Computational Efficiency: MAE calculations involve absolute differences, which do not require squaring or square root operations. As a result, MAE computations are computationally efficient compared to metrics that involve squared errors, such as MSE or RMSE. This efficiency is particularly relevant when dealing with large datasets or training complex models.

  • Focus on Prediction Accuracy: MAE directly measures the accuracy of predictions by providing the average absolute difference between predicted and true values. It avoids the inclusion of additional complexities introduced by squared errors, such as the influence of outliers on the model’s performance.

Disadvantages:

  • Ignoring Error Direction: MAE treats all prediction errors equally, regardless of their direction (positive or negative). This characteristic can be a disadvantage in situations where the direction or sign of errors is important. For example, in some applications, overestimating the target variable may have different consequences and implications compared to underestimating it. In such cases, metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) that consider the squared differences and penalize larger errors more heavily might provide more appropriate insights.

  • Lack of Sensitivity to Individual Errors: MAE calculates the average absolute difference across all predicted and true values in the dataset. While this provides a measure of overall prediction accuracy, it does not capture the impact of individual errors. A model with a low MAE may still have a significant number of instances with large errors that affect specific predictions or subsets of the data. For detailed analysis or decision-making, it can be important to consider the distribution and characteristics of individual errors, which MAE does not explicitly capture.

  • Scale Dependency: MAE is sensitive to the scale of the target variable. If the target variable has a wide range of values or is on a different scale than the input features, the MAE values can vary accordingly. This scale dependency can make it challenging to compare the performance of models trained on different datasets or with different units of measurement. In such cases, it may be necessary to normalize or standardize the data to mitigate the impact of scale differences.

  • Lack of Differentiability: MAE is not differentiable at zero, which can be an issue when using gradient-based optimization algorithms for training models. The non-differentiability at zero can lead to difficulties in optimization processes that rely on gradient information, such as gradient descent. In such cases, alternative error metrics like MSE or smooth approximations of MAE, such as Huber loss, can be used to address this problem.

  • Limited Information about Error Magnitude: While MAE provides a measure of the average magnitude of errors, it does not offer detailed information about the distribution or spread of errors. Other metrics like standard deviation or quantiles can provide insights into the variability or uncertainty associated with predictions. Supplementing MAE with additional error analysis techniques can provide a more comprehensive understanding of the model’s performance.

(2) Mean Squared Error:

  • The Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator measures the average of error squares i.e. the average squared difference between the estimated values and true value.
  •  It is a risk function, corresponding to the expected value of the squared error loss. It is always non–negative and values close to zero are better.
  • The MSE is the second moment of the error (about the origin) and thus incorporates both the variance of the estimator and its bias.

Formula:

Advantages:

  1. Sensitivity to Large Errors: MSE considers the squared differences between predicted and true values. Squaring the errors amplifies the impact of larger errors, which can be beneficial in scenarios where accurately capturing and penalizing large errors is important. By heavily penalizing outliers or extreme errors, MSE provides a more robust measure of prediction quality.

  2. Differentiability: MSE is a differentiable metric, meaning it has a derivative with respect to the model’s parameters. This property is advantageous when using gradient-based optimization algorithms for training models. The differentiability allows efficient computation of gradients, enabling optimization algorithms to iteratively update the model parameters and converge towards better solutions.

  3. Efficient Optimization: The squared error term in MSE provides a smoother and convex optimization landscape compared to the absolute error term in MAE. This convexity often leads to easier optimization and faster convergence. Many optimization algorithms, such as gradient descent, are specifically designed for minimizing squared error objectives, making MSE a convenient choice in optimization processes.

  4. Statistical Properties: MSE has statistical properties that can be advantageous in certain contexts. For instance, when the errors are normally distributed and the model is unbiased, minimizing MSE results in finding the maximum likelihood estimator for the model’s parameters. This statistical interpretation can provide theoretical justification and facilitate inference in certain statistical analyses.

  5. Weighted Losses: MSE allows for the incorporation of different weights for individual instances or subsets of the data. By assigning different weights to specific instances, the model can be trained to prioritize certain regions or patterns in the data. This flexibility is valuable when dealing with imbalanced datasets or when specific instances require more attention or emphasis.

  6. Compatibility with Linear Regression: MSE is closely related to the method of least squares, which is commonly used in linear regression. Minimizing MSE is equivalent to finding the best-fit line that minimizes the sum of squared residuals. This compatibility with linear regression provides a strong theoretical foundation and allows for meaningful comparisons between models trained using MSE and linear regression techniques.

Disadvantages:

  1. Magnification of Large Errors: MSE squares the differences between predicted and true values, which amplifies the impact of larger errors. While this can be advantageous in certain scenarios, it can also be a disadvantage when dealing with outliers or extreme errors. MSE heavily penalizes these large errors, which may result in the model focusing excessively on minimizing them at the expense of other important aspects. In situations where outliers have different implications or when the goal is to prioritize robustness against extreme errors, MSE may not be the most suitable choice.

  2. Scale Dependency: Similar to MAE, MSE is sensitive to the scale of the target variable. If the target variable has a wide range of values or is on a different scale than the input features, the MSE values can vary accordingly. This scale dependency can make it challenging to compare the performance of models trained on different datasets or with different units of measurement. Normalizing or standardizing the data can help mitigate this issue.

  3. Lack of Intuitive Interpretability: MSE is expressed in the squared units of the target variable, which can make it less intuitive to interpret compared to MAE. For example, if the target variable represents prices in dollars, the MSE will be in squared dollars. This squared nature can make it difficult to relate the error metric directly to the original variable or understand the practical implications of the errors.

  4. Outlier Sensitivity: MSE is highly sensitive to outliers due to the squaring operation. Large errors have a disproportionate influence on the overall MSE value, potentially skewing the assessment of the model’s performance. If the dataset contains outliers or instances with substantially larger errors, MSE may not provide an accurate representation of the model’s general predictive performance.

  5. Lack of Robustness to Non-Normal Distributions: MSE assumes that the errors follow a normal distribution. If the error distribution deviates from normality, such as having heavy tails or skewness, MSE may not be the most appropriate choice. In such cases, alternative error metrics that are more robust to non-normality, such as Mean Absolute Error (MAE) or quantile loss functions, may be more suitable.

  6. Optimization Challenges: While MSE’s differentiability can be an advantage, it can also pose challenges in certain cases. The squared term in MSE can lead to sharper optimization landscapes with more local minima, making it harder to find the global minimum. This issue can be exacerbated when dealing with complex models or non-convex optimization problems. In such cases, alternative loss functions or optimization strategies may be necessary.

(3) Root Mean Squared Error:

  • Root mean squared error (RMSE) is the square root of the mean of the square of all of the errors.
  • RMSE is considered an excellent general-purpose error metric for numerical predictions.
  • RMSE is a good measure of accuracy, but only to compare prediction errors of different models or model configurations for a particular variable and not between variables, as it is scale-dependent.
  • It is the measure of how well a regression line fits the data points.

Formula:

Advantages:

  • Sensitivity to Large Errors: Similar to MSE, RMSE considers the squared differences between predicted and true values. By taking the square root of MSE, RMSE retains the sensitivity to large errors. It provides a balanced measure that penalizes both small and large errors, allowing for a more comprehensive evaluation of the model’s performance. RMSE is particularly effective in scenarios where accurately capturing and penalizing large errors is important.

  • Interpretability and Unit Consistency: Unlike MSE, which is expressed in squared units, RMSE is expressed in the same units as the target variable. This unit consistency makes the RMSE more interpretable and easier to relate to the original problem domain. For example, if the target variable represents prices in dollars, the RMSE will also be in dollars. This characteristic allows for a more intuitive understanding of the average prediction error in the original measurement units.

  • Robustness to Outliers: RMSE, like MSE, is sensitive to outliers due to the squared differences. However, by taking the square root, RMSE reduces the impact of extreme errors compared to MSE. This reduction in the influence of outliers can be desirable in situations where the focus is on overall prediction accuracy rather than disproportionately penalizing a few extreme errors. RMSE provides a more balanced assessment of the model’s performance, considering both average errors and the impact of outliers.

  • Statistical Properties: Similar to MSE, RMSE has statistical properties that can be advantageous in certain contexts. When the errors are normally distributed, minimizing the RMSE is equivalent to finding the maximum likelihood estimator for the model’s parameters. This statistical interpretation facilitates inference and model selection based on principles of statistical estimation theory.

  • Compatibility with Linear Regression: RMSE is closely related to the method of least squares, which is commonly used in linear regression. Minimizing RMSE is equivalent to finding the best-fit line that minimizes the sum of squared residuals. This compatibility with linear regression provides a strong theoretical foundation and allows for meaningful comparisons between models trained using RMSE and linear regression techniques.

  • Comparison across Different Datasets: RMSE allows for direct comparison of model performance across different datasets with varying scales and units. Since RMSE is expressed in the same units as the target variable, it provides a consistent measure of prediction accuracy that can be used to compare models trained on different datasets or with different units of measurement. This comparability is valuable when evaluating and selecting models for deployment in real-world scenarios.

Disadvantages:

  • Sensitivity to Outliers: Although RMSE reduces the impact of outliers compared to MSE, it is still sensitive to extreme errors. Outliers can have a significant influence on the RMSE value since it involves squaring the errors before taking the square root. If the dataset contains a substantial number of outliers or instances with large errors, RMSE may not provide an accurate representation of the model’s overall predictive performance.

  • Scale Dependency: Similar to MSE, RMSE is sensitive to the scale of the target variable. If the target variable has a wide range of values or is on a different scale than the input features, the RMSE values can vary accordingly. This scale dependency can make it challenging to compare the performance of models trained on different datasets or with different units of measurement. Normalizing or standardizing the data can help mitigate this issue.

  • Lack of Intuitive Interpretability: While RMSE is expressed in the same units as the target variable, making it more interpretable than MSE, it can still be challenging to interpret the practical implications of RMSE values. The square root operation can result in RMSE values that are not directly relatable to the original variable. Understanding the magnitude of RMSE and its practical significance may require additional context or domain knowledge.

  • Optimization Challenges: Similar to MSE, RMSE’s differentiability can be both an advantage and a disadvantage. While the differentiability allows for efficient computation of gradients and facilitates optimization, it can also pose challenges in certain cases. The squared term in RMSE can lead to sharper optimization landscapes with more local minima, making it harder to find the global minimum. This issue can be exacerbated when dealing with complex models or non-convex optimization problems.

  • Statistical Assumptions: RMSE assumes that the errors follow a normal distribution. However, in practice, this assumption may not hold for all datasets or regression problems. If the error distribution deviates from normality, RMSE may not be the most appropriate choice. In such cases, alternative error metrics that are more robust to non-normality, such as Mean Absolute Error (MAE) or quantile loss functions, may be more suitable.

  • Lack of Robustness to Skewed Distributions: RMSE treats positive and negative errors equally since it squares the errors. However, in some cases, the practical implications of positive and negative errors may not be symmetric. For example, in financial applications, overestimating a prediction may have different consequences than underestimating it. In such situations, alternative error metrics like asymmetric loss functions or tailored evaluation measures may provide more meaningful insights.

(4) R – Squared Error

  • R-squared is a statistical measure that represents the goodness of fit of a regression model.
  • The value of the R-square lies between 0 to 1.
  • Where we get R-square equals 1 when the model perfectly fits the data and there is no difference between the predicted value and actual value.  
  • However, we get R-square equals 0 when the model does not predict any variability in the model and it does not learn any relationship between the dependent and independent variables.

Formula:

Advantages:

  • Measure of Goodness of Fit: R² provides a measure of how well the regression model fits the observed data. It represents the proportion of the total variation in the dependent variable that is explained by the independent variables included in the model. R² ranges from 0 to 1, where 0 indicates that the model explains none of the variation, and 1 indicates that the model explains all of the variation.

  • Intuitive Interpretation: R² has an intuitive interpretation as the percentage of variation in the dependent variable that can be explained by the independent variables in the model. It is often expressed as a percentage, making it easy to understand and communicate the model’s explanatory power. For example, an R² value of 0.80 means that 80% of the variation in the dependent variable is explained by the model.

  • Comparison of Models: R² enables the direct comparison of different models fitted to the same dataset. By comparing the R² values of different models, you can assess which model provides a better fit to the observed data. This comparison allows for model selection and helps in determining which variables or features contribute most to the variation in the dependent variable.

  • Model Evaluation and Validation: R² serves as a useful tool for evaluating and validating regression models. A high R² value indicates a good fit, suggesting that the model captures a significant portion of the variation in the dependent variable. On the other hand, a low R² value may indicate that the model is not adequately capturing the relationships in the data, signaling the need for model improvement or reconsideration of the variables included.

  • Basis for Hypothesis Testing: R² forms the basis for hypothesis testing in regression analysis. It allows for the assessment of the statistical significance of the model and its individual predictors. By comparing the observed R² with the expected R² under the null hypothesis of no relationship, statistical tests such as F-tests can determine whether the model as a whole is significant.

Disadvantages:

  • Lack of Information About Prediction Accuracy: R² measures the proportion of variation in the dependent variable that is explained by the independent variables in the model. However, it does not provide information about the accuracy or precision of individual predictions. A high R² value does not necessarily imply that the model will make accurate predictions for new, unseen data points. Therefore, it is important to consider other evaluation metrics, such as Mean Squared Error (MSE) or Mean Absolute Error (MAE), to assess the model’s predictive performance.

  • Sensitivity to Outliers: R² is sensitive to outliers since it focuses on explaining the total variation in the dependent variable. Outliers can have a significant impact on the R² value, potentially inflating or deflating its magnitude. If the dataset contains outliers or instances with extreme values, R² may not provide an accurate representation of the model’s performance and may overestimate or underestimate the model’s explanatory power.

  • Dependence on the Number of Predictors: R² increases with the addition of more predictors to the model, even if those predictors are not truly relevant. This characteristic makes R² less reliable when comparing models with different numbers of predictors. To address this limitation, adjusted R² is often used, as it adjusts for the degrees of freedom and penalizes the inclusion of irrelevant variables. Adjusted R² provides a more accurate measure of the model’s explanatory power when comparing models with different numbers of predictors.

  • Assumptions of Linearity and Homoscedasticity: R² assumes that the relationship between the dependent variable and the independent variables is linear, and that the error terms have constant variance (homoscedasticity). If these assumptions are violated, R² may be misleading or provide inaccurate assessments of the model’s performance. It is important to validate these assumptions through diagnostic checks and consider alternative modeling techniques, such as nonlinear regression or generalized linear models, when appropriate.

  • Inadequate for Non-Linear Models: R² is not suitable for evaluating models that are inherently non-linear, such as polynomial regression or models with interaction terms. In these cases, the R² value may underestimate the true explanatory power of the model since it only captures linear relationships. Alternative methods, such as non-linear regression or model-specific evaluation metrics, should be used to assess the performance of non-linear models accurately.

  • Context-Dependent Interpretation: The interpretation of R² depends on the specific context and the nature of the data. A high or low R² value in one context may not hold the same meaning in another context. Additionally, a high R² value does not imply that the model is practically useful or that the relationships captured by the model are meaningful. It is crucial to interpret R² alongside other domain-specific knowledge, consider the goals of the analysis, and assess the practical implications of the model’s performance.

(5) Adjusted R – Squared Error

  • The problem lies in the R-Squared is that the value of r-square always increases as new variables(attributes) are added to the model, no matter that the newly added attributes have a positive impact on the model or not. also, it can lead to overfitting of the model if there are large no. of variables.
  • Adjusted r-square is a modified form of r-square whose value increases if new predictors tend to improve model’s performance and decreases if new predictors do not improve performance as expected.

Formula:

How Adjusted R – Squared Solve The Problem.:

Case-1: Adding Irrelevant Variable.

  • If you add a new variable, ‘k’ will increase and the denominator part will decrease.
  • The R-squared value will either be constant or it will increase.
  • Suppose the R-squared value remained constant then the whole numerator term will be constant.
  • With a constant numerator and decreased denominator, the entire term will increase.
  • Now 1 – (Increased Term) = Decreased.
  • Now the Adjusted R-squared value has decreased because of adding an irrelevant variable.

Case-2: Adding Relevant Variable.

  • If you add a new variable, ‘k’ will increase and the denominator part will decrease.
  • As it is a relevant variable R-squared value will increase.
  • Now 1 – (R-squared) will decrease. Hence the numerator part will also decrease.
  • With a decreased Numerator and decreased denominator, the entire term will decrease.
  • Now 1 – (Decreased Term) = Increased.
  • Now the adjusted R-squared value has increased.

Advantages:

  • Accounting for model complexity: The adjusted R-squared value adjusts the regular R-squared value for the number of predictors in the model. It penalizes the inclusion of irrelevant or redundant predictors, which helps prevent overfitting. By considering model complexity, the adjusted R-squared provides a more accurate assessment of the model’s predictive power.

  • Model comparison: The adjusted R-squared value enables the comparison of different regression models with varying numbers of predictors. When comparing models, it is important to consider both the goodness-of-fit and the complexity of the model. The adjusted R-squared allows you to make fairer comparisons by taking into account the trade-off between model fit and complexity.

  • Interpretability: The adjusted R-squared value is more easily interpretable than the regular R-squared value when comparing models with different numbers of predictors. A higher adjusted R-squared indicates that a greater proportion of the variation in the dependent variable is explained by the predictors included in the model while considering the complexity of the model.

  • Avoiding spurious relationships: The adjusted R-squared value helps guard against including irrelevant predictors in the model. Including such variables can lead to spurious relationships and an inflated regular R-squared value. The adjusted R-squared value adjusts for the number of predictors and is thus less prone to this issue, providing a more reliable measure of the model’s explanatory power.

  • Model simplicity: The adjusted R-squared value encourages model simplicity by penalizing the addition of unnecessary predictors. This promotes parsimony, which is a desirable characteristic in regression modelling. By discouraging the inclusion of excessive predictors, the adjusted R-squared helps prevent overfitting and improves the model’s generalizability.

Disadvantages:

  • Over-Penalization: The adjusted R-squared value penalizes the model for including additional predictors, even if they are meaningful and relevant to the outcome variable. This can lead to underestimating the true explanatory power of the model, particularly in situations where including more predictors genuinely improves the model’s performance.

  • Ambiguity In Interpretation: While the adjusted R-squared value is intended to provide a more accurate measure of the model’s goodness of fit, its interpretation can still be challenging. Unlike the regular R-squared, which represents the proportion of variance explained, the adjusted R-squared does not have a direct interpretation. Its value can vary depending on the dataset and the number of predictors, making it less intuitive to interpret in isolation.

  • Model Assumptions: The adjusted R-squared assumes that the model satisfies the underlying assumptions of linear regression, such as linearity, independence, and homoscedasticity. If these assumptions are violated, the adjusted R-squared value may not accurately reflect the model’s performance or predictive ability.

  • Model Selection Bias: The use of adjusted R-squared for model comparison and selection can be problematic. It is possible to obtain a higher adjusted R-squared value by including irrelevant predictors or overfitting the model to the training data. Therefore, relying solely on the adjusted R-squared may lead to the selection of overly complex models that do not generalize well to new data.

  • Context Dependence: The usefulness of the adjusted R-squared depends on the specific context and research question. In some cases, a lower adjusted R-squared value may still be acceptable if the model provides meaningful insights or if the predictors are theoretically important, even if they don’t improve the overall fit of the model significantly.

Leave a Reply

Your email address will not be published. Required fields are marked *