How Gradient-Boosted Trees Turn Weak Predictions into Strong Ones // Polya's Urn

DECISION TREE

Trees

MSE

Learn Rate

The Single Decision Tree

Begin with the simplest model: a decision tree. It partitions the feature space by asking binary questions — "Is x > 0.5?" — and assigns a prediction to each region.

A single tree captures broad patterns but overfits if grown deep, and underfits if kept shallow. The bias-variance tradeoff seems inescapable.

The Residuals

After the first tree makes its predictions, compute the residuals: the gap between each true value and the tree's prediction. These residuals are what the model got wrong.

In gradient boosting, each residual is the negative gradient of the loss function. For squared error, it is simply y − ŷ.

The Second Tree

Now train a second shallow tree — not on the original targets, but on the residuals. This tree learns to predict where the first tree went wrong.

Add its predictions to the first tree's, scaled by a learning rate η. The ensemble is now: F(x) = h₁(x) + η·h₂(x).

Sequential Correction

Repeat: compute new residuals, fit another tree, add it to the ensemble. Each tree corrects the errors of all previous trees. Early trees capture dominant patterns. Later trees capture subtle signals.

This is gradient boosting: function-space gradient descent using decision trees as the base learner.

The Ensemble Grows

Watch the MSE plummet as trees are added. Each tree is weak individually — a shallow partition of the space. But their sequential combination produces a model of extraordinary accuracy.

The learning rate η controls how aggressively each tree contributes. Smaller η requires more trees but generalizes better.

Regularization

XGBoost adds structural regularization: a penalty γ for each leaf and L2 regularization λ on leaf weights. The split gain must exceed γ to justify a new branch.

The optimal leaf weight becomes: w* = −G / (H + λ), where G and H are sums of gradients and Hessians. Complexity is penalized, not just training error.

The Overfitting Frontier

Without regularization, adding trees eventually increases test error even as training error drops to zero. The model memorizes noise.

Early stopping monitors validation loss and halts when it begins to rise. Combined with shrinkage and subsampling, it defines the boundary between learning and memorizing.

XGBoost: The Full System

Chen's system combines it all: second-order optimization (using Hessians, not just gradients), column subsampling (borrowed from Random Forests), sparsity-aware splits for missing data, and weighted quantile sketches for approximate split-finding.

The result: the algorithm that dominated structured-data ML for a decade.

How Gradient-Boosted Trees Turn Weak Predictions into Strong Ones

The Single Decision Tree

The Residuals

The Second Tree

Sequential Correction

The Ensemble Grows

Regularization

The Overfitting Frontier

XGBoost: The Full System

The Kaggle Decade

Error vs. Boosting Rounds

From Stumps to Giants

The Bias-Variance Decomposition

Why Not Neural Networks?

Build Your Own Gradient Booster