Praudyog

Extreme Gradient Boosting – Regression Algorithm

admin

November 7, 2023

Data Science Interview Questions

Extreme Gradient Boosting

Extreme Gradient Boosting – Regression Algorithm

Table Of Contents:

Example Of Extreme Gradient Boosting Regression.

Problem Statement:

Predict the package of the students based on CGPA value.

Step-1: Build First Model

In the case of the Boosting algorithm, the first model will be a simple one.
For the regression case, we have considered the mean value to be our first model.

Mean = (4.5+11+6+8)/4 = 29.5/4 = 7.375
Model1 output will always be 7.373 for all the records.

Step-3: Calculate Error Made By First Model

To calculate the error we will do the simple subtraction operation.
We will subtract the actual minus predicted values.
7.375 – 4.5 = -2.875

Step-4: Building Second Model

The second model for the XGBoost algorithm will be a Decision Tree.
The input for the decision tree will be the CGPA and MODEL1-RESIDUAL1.
The building of Decision Tree will be different in XGBoost algorithm.

Step-5: Tree Building Process.

First we will make a leaf node and keep all the residual points on it.

Similarity Score:

Second we need to calculate the Similarity Score for each leaf node.
More the Similarity Score better the node.

λ = Regularization Parameter.
Considering λ = 0.

Similarity Score = Square(-2.875 + 3.625 – 1.375 + 0.625)/4 = 0.02
Here Similarity Score is smaller means elements inside the leaf node is not similar.
Next we will try to increase this Similarity Score.

Create A Full Decision Tree:

With a single leaf node, our similarity score is quite low.
Hence we need to do more splitting on the single leaf node to increase the similarity score.
As we have here only one feature CGPA we will split it based on CGPA.
To choose a splitting criteria we will sort the CGPA value.

Steps To Create Decision Tree:

Step-1: Sort The CGPA In Ascending Order.

To find the perfect splitting criteria we need to first sort the CGPA value in ascending order and find the Average between numbers.
We will use these average values to split our original single leaf node.
We will choose one of the values where we will get a better similarity score and will make that a root node.

Step-2: Use Splitting Criteria-1: (5.85)

Let’s check for the first splitting criteria which is 5.85.
After that we will calculate the Similarity Score for each leaf node and find out the Gain.

Now we will find out the Similarity Score for each leaf node using below formula.

Step-3: Calculate Similarity Score For Each Leaf Node

Here you can notice that residuals which are more similar to each other have the highest similarity score.

Step-3: Calculate Total Gain For The Tree

To find out the Gain we will use below formula.

Here 0.52 is the Gain or in other words, we have increased our Similarity of residuals when we split it into two parts.

Step-4: Use Splitting Criteria-2: (7.1)

Let’s use the second splitting criteria and find out the Gain for the split.
We will use the above steps to find out our gain.

Here you can notice that the Gain for the similarity score is higher than the first crieteria.

Step-5: Use Splitting Criteria-3: (8.25)

Let’s use the third splitting criteria and find out the Gain for the split.
We will use the above steps to find out our gain.

Here you can notice that the Gain for the similarity score is higher than the second criterion.

Step-6: Conclusion

Criteria-1(5.85) Gain = 0.52
Criteria-2(7.1) Gain = 5.06
Criteria-3(8.25) Gain = 17.51
Here we can conclude that the Gain for the third criterion is highest.
Hence we will make the third criterion our root node.

Step-7: Again Split The Left Leaf Node.

Here we will do one more split to increase the Similarity score and in terms Gain of the split.
Now we need to find out the splitting criterion to split the Leaf Node.
We have already used one splitting criterion which is 8.25.
We have two more splitting criteria left to use which are 5.85 and 7.1
We will use both splitting criteria to find out the Gain.

Step-8: Splitting Criteria 5.85

Now we will split the Leaf Node using the criteria CGPA < 5.85.

The similarity score for the second root node is as follows.

Now we will find out the similarity score for the two leaf nodes and find out the Gain.

Step-9: Splitting Criteria 7.1

Now we will split the Leaf Node using the criteria CGPA < 7.1.

The similarity score for the second root node is as follows.

Now we will find out the similarity score for the two leaf nodes and find out the Gain.

Step-10: Conclusion

Criteria-1(5.85) Gain = 5.04
Criteria-2(7.1) Gain = 0.04
Here we can conclude that the Gain for the first criterion is highest.
Hence we will make the first criterion our root node.
Our final Decision Tree will look like this.

We can further split the tree but it will overfit. Hence we will stop here.
The default MaxDepth = 6 for XGBost.
As we have a smaller dataset we will keep MaxDepth= 2.

Step-11: Calculate Output Values For All Leaf Node.

As this is a Regression task we need to calculate the output values for each leaf node.

Keep, λ = 0

If, λ = 0 the output will be the average of residuals.

Step-12: Calculate Stage2 Combined Model Output

Now we need to calculate the combined output, which means Model1 + Model2 output.

The default value for Learning Rate = 0.3.
Now we will calculate the combined output for all the records.

For CGPA = 6.7,
Output = 7.3 + 0.3 * (-2.05) = 6.69
Here -2.05 is the output from the second decision tree.
Like this we will calculate for all the records.

Step-13: Calculate Residuals For Model-2

Residuals for Stage 2 will be calculated as = Package – Model2 Output.
For CGPA = 6.7, Residual = 4.5 – 6.69 = -2.19.
Like this, we will do for all the records.

Here you can notice that the residual values are getting reduced e.g. from -2.8 to -2.19.
Residuals/Error must go towards Zero, which means you are in the right direction.

Step-14: Building Third Decision Tree.

You can build the third Decision Tree, by following the same steps.
The input for the 3rd Decision Tree will be CGPA and Residual2 column.

Step-15: Final Output

The final output will be the combination of Model1 + Model2 + Model3 output.

Using this formula you can predict the Final Output value for new records.

Step-16: Conclusion

Here the goal is to reduce the Residual/Error value.
Residuals with more closure to zero better the model.

Leave a Reply Cancel reply