Saturday 16 January 2021

Understanding XGBoost Algorithm and its question

 

1| Is XGBoost faster than random forest?

Solution: XGBoost is usually used to train gradient-boosted decision trees (GBDT) and other gradient boosted models. Random forests also use the same model representation and inference as gradient-boosted decision trees, but it is a different training algorithm. XGBoost can be used to train a standalone random forest. Also, random forest can be used as a base model for gradient boosting techniques.

Further, random forest is an improvement over bagging that helps in reducing the variance. Random forest builds trees in parallel, while in boosting, trees are built sequentially. Meaning, each of the trees is grown using information from previously grown trees, unlike bagging, where multiple copies of original training data are created and fit separate decision tree on each. This is the reason why XGBoost generally performs better than random forest. 


| What are the advantages and disadvantages of XGBoost?

Advantages:

  • XGB consists of a number of hyper-parameters that can be tuned — a primary advantage over gradient boosting machines.
  • XGBoost has an in-built capability to handle missing values.
  • It provides various intuitive features, such as parallelisation, distributed computing, cache optimisation, and more. 

Disadvantages:

  • Like any other boosting method, XGB is sensitive to outliers.
  • Unlike LightGBM, in XGB, one has to manually create dummy variable/ label encoding for categorical features before feeding them into the models. 

Features of XGBoost:

  1. Can be run on both single and distributed systems(Hadoop, Spark).
  2. XGBoost is used in supervised learning(regression and classification problems).
  3. Supports parallel processing.
  4. Cache optimization.
  5. Efficient memory management for large datasets exceeding RAM.
  6. Has a variety of regularizations which helps in reducing overfitting.
  7. Auto tree pruning – Decision tree will not grow further after certain limits internally.
  8. Can handle missing values.
  9. Has inbuilt Cross-Validation.
  10. Takes care of outliers to some extent.

| How XGBoost Works?

Solution: When using gradient boosting for regression, where the weak learners are considered to be regression trees, each of the regression trees maps an input data point to one of its leaves that includes a continuous score. XGB minimises a regularised objective function that merges a convex loss function, which is based on the variation between the target outputs and the predicted outputs. The training then proceeds iteratively, adding new trees with the capability to predict the residuals as well as errors of prior trees that are then coupled with the previous trees to make the final prediction. 


XGBoost Algorithm

Let’s look at how XGboost works with an example. Here I’ll try to predict a child’s IQ based on age. For any basic assumption in such statistical data, we can take the average IQ and find how much variance(loss) is present. Residuals are the losses incurred will be calculated after each model predictions.

CHILD’s AGECHILD’s IQRESIDUALS
1020-10
15344
16388

So the average of 20, 34, and 38 is 30.67 for simplicity let’s take it as 30. If we plot a graph keeping y-axis as IQ and x-axis as Age and then we can see the variance in points from the average mark. 

At first, our base model(M0) will give a prediction 30. As from the graph, we know this model suffers a loss which will have some optimisation in the next model(M1). Model M1 will have input as age(independent features) and target as the loss suffered(variances) in M0. Until now it is the same as the gradient boosting technique.

For XGboost some new terms are introduced, 

ƛ -> regularization parameter
Ɣ -> for auto tree pruning
eta -> how much model will converge

Now calculate the similarity score,

Similarity Score(S.S.) =  (S.R ^ 2) / (N + ƛ)
Here, S.R is the sum of residuals,
N is Number of Residuals

At first let's put  ƛ =0, then Similarity Score = (-10+4+8)^2 / 3+0 = 4/3 = 1.33

Let’s make the decision tree using these residuals and similarity scores. I’ve set the tree splitting criteria as Age >10.

Again for these two leaves, we calculate the similarity scores which is 100 and 72. 

After this, we calculate the 

Gain = S.S of the branch before split - S.S of the branch after the split.

Gain = ( 100 + 72 ) – 1.3 

Now we set our Ɣ, which is a value provided to the model at starting and its used during splitting. If Gain>Ɣ then split will happen otherwise not. Let’s assume that Ɣ for this problem is 130 then since the gain is greater than Ɣ, further split will occur. By this method, auto tree pruning will be achieved. The greater the Ɣ value more pruning will be done.

For regularization and preventing overfitting, we must increase the ƛ  which was initially set to 0. But this should be done carefully as greater the ƛ  value lesser the Similarity score, lesser the gain and more the pruning.

New prediction = Previous Prediction + Learning rate * Output

XGboost calls the learning rate as eta and its value is set to 0.3 

For the 2nd reading(Age=15) new prediction = 30 + (0.3 * 6) = 31.8

The outcome is 6 is calculated from the average residuals 4 and 8.

New Residual = 34 – 31.8 = 2.2

Age IQ Residual
1020-7
15342.2
16386.2

This way model M1 will be trained and residuals will keep on decreasing, which means the loss will be optimized in further models.

Conclusion

XGboost has proven to be the most efficient Scalable Tree Boosting Method. It has shown outstanding results across different use cases such as motion detection, stock sales predictions, malware classification, customer behaviour analysis and many more. The system runs way faster on a single machine than any other machine learning technique with efficient data and memory handling. The algorithm’s optimization techniques improve performance and thereby provides speed using the least amount of resources.


What does the weight of XGB leaf nodes mean? How to calculate it?

Solution: The “leaf weight” can be said as the model’s predicted output associated with each leaf (exit) node. Here is an instance of how to calculate the weights of the leaf nodes in XGB-

Consider a test data point, where age=10 and gender=female.To get the prediction for the data point, the tree is traversed from the top to bottom, performing a series of tests. At each of the intermediate nodes, a feature is needed to compare against a threshold. 

Now, depending on the result of the comparison, one must proceed to either the left or right child node of the tree. In case of (10, female), the test “age < 15” is to be performed first and then proceed to the left branch, because “age < 15” is true. Then, the second test “gender = male?” is performed, which evaluates to false, so we proceed to the right branch. We end up at the Leaf 2, whose output (leaf weight) is 0.1.


| What are the data pre-processing steps for XGB?

Solution: The data pre-processing steps for XGB include the following-

  • Load the data
  • Explore the data and remove the unneeded attributes
  • Transform textual values to numeric
  • Find and replace the missing values if needed
  • Encoding the categorical data
  • Break the dataset into training set as well as test set
  • Perform feature scaling or data normalisation

 How does XGB calculate features?

Solution: XGB automatically provides the estimations of feature importance from a trained predictive model. After a boosting tree is constructed, it retrieves feature importance scores for each attribute. The feature importance contributes a score which indicates how much valuable each feature was in the construction of the boosted decision trees within the model.  

Also, in terms of accuracy, XGB models show better performance for the training phase and comparable performance for the testing phase when compared to SVM models. Besides accuracy, XGB has higher computation speed than SVM.


| Why does XGBoost perform better than SVM?

Solution: In case of missing values, XGB is internally designed to handle missing values. The missing values are interpreted in such a way that if there endures any trend in the missing values, it is captured by the model. Users are required to supply a different value than other observations and pass that as a parameter. 

XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future. On the other hand, Support Vector Machine (SVM) does not perform well with the missing data and it is always a better option to impute the missing values before running SVM. 



Differences between XGBoost and LightGBM.

Solution: XGBoost and LightGBM are the packages belonging to the family of gradient boosting decision trees (GBDTs). 

  • Traditionally, XGBoost is slower than lightGBM but it achieves faster training through the Histogram binning process.
  • LightGBM is a newer tool as compared to XGBoost. Hence, it has fewer users and thus a narrow user base than XGBoost and contains less documentation.

How does XGB handle missing values?

Solution: XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training. It is important to note that the gblinear booster treats missing values as zeros. During the training time XGB decides whether the missing values should fall into the right node or left node. This decision is taken to minimise the loss. If there are no missing values during the training time, the tree made a default  decision to send any new missings to the right node.


What is the difference between AdaBoost and XGBoost?

Solution: XGBoost is flexible compared to AdaBoost as XGB is a generic algorithm to find approximate solutions to the additive modeling problem, while AdaBoost can be seen as a special case with a particular loss function.

  • Unlike XGB, AdaBoost can be implemented without the reference to gradients by reweighting the training samples based on classifications from previous learners


ADA Boost Vs Gradient Boost

Both AdaBoost and Gradient Boosting build weak learners in a sequential fashion. Originally, AdaBoost was designed in such a way that at every step the sample distribution was adapted to put more weight on misclassified samples and less weight on correctly classified samples. The final prediction is a weighted average of all the weak learners, where more weight is placed on stronger learners.

Later, it was discovered that AdaBoost can also be expressed as in terms of the more general framework of additive models with a particular loss function (the exponential loss). See e.g. chapter 10 in (Hastie) ESL.

Additive modeling tries to solve the following problem for a given loss function L:

minαn=1:N,βn=1:NL(y,n=1Nαnf(x,βn))

where f could be decision tree stumps. Since the sum inside the loss function makes life difficult, it can be approximated in an additive fashion by taking the sum in front of the loss function iteratively minimizing one subproblem at a time:

minαn,βnL(y,fn1((x)+αnfn(x,βn))

For arbitrary loss functions this is still a tricky problem, so we can further approximate this by applying steepest descent with line search, i.e. we update fn by taking a step into the direction of the negative gradient.

In order to avoid overfitting on the gradient, the gradient is approximated with a new weak learner. This gives you the gradient boosting algorithm:

  1. Start with a constant model f0
  2. Fit a weak learner hn to the negative gradient of the loss function w.r.t. fn1
  3. Take a step γ s.t. fn=fn1+γhn minimizes the loss L(y,fn(x))

The main differences therefore are that Gradient Boosting is a generic algorithm to find approximate solutions to the additive modeling problem, while AdaBoost can be seen as a special case with a particular loss function. Hence, gradient boosting is much more flexible.

Second, AdaBoost can be interepted from a much more intuitive perspective and can be implemented without the reference to gradients by reweighting the training samples based on classifications from previous learners.


No comments:

Post a Comment