VOL.01 // ISS.04

How Gradient-Boosted Trees Turn Weak Predictions into Strong Ones

In 2014, Tianqi Chen released XGBoost — a system that proved thousands of shallow decision trees, each learning from the mistakes of the last, could outperform almost anything on structured data.

scroll to begin
DECISION TREE
0
Trees
--
MSE
--
Learn Rate

The Single Decision Tree

Begin with the simplest model: a decision tree. It partitions the feature space by asking binary questions — "Is x > 0.5?" — and assigns a prediction to each region.

A single tree captures broad patterns but overfits if grown deep, and underfits if kept shallow. The bias-variance tradeoff seems inescapable.

The Residuals

After the first tree makes its predictions, compute the residuals: the gap between each true value and the tree's prediction. These residuals are what the model got wrong.

In gradient boosting, each residual is the negative gradient of the loss function. For squared error, it is simply y − ŷ.

The Second Tree

Now train a second shallow tree — not on the original targets, but on the residuals. This tree learns to predict where the first tree went wrong.

Add its predictions to the first tree's, scaled by a learning rate η. The ensemble is now: F(x) = h₁(x) + η·h₂(x).

Sequential Correction

Repeat: compute new residuals, fit another tree, add it to the ensemble. Each tree corrects the errors of all previous trees. Early trees capture dominant patterns. Later trees capture subtle signals.

This is gradient boosting: function-space gradient descent using decision trees as the base learner.

The Ensemble Grows

Watch the MSE plummet as trees are added. Each tree is weak individually — a shallow partition of the space. But their sequential combination produces a model of extraordinary accuracy.

The learning rate η controls how aggressively each tree contributes. Smaller η requires more trees but generalizes better.

Regularization

XGBoost adds structural regularization: a penalty γ for each leaf and L2 regularization λ on leaf weights. The split gain must exceed γ to justify a new branch.

The optimal leaf weight becomes: w* = −G / (H + λ), where G and H are sums of gradients and Hessians. Complexity is penalized, not just training error.

The Overfitting Frontier

Without regularization, adding trees eventually increases test error even as training error drops to zero. The model memorizes noise.

Early stopping monitors validation loss and halts when it begins to rise. Combined with shrinkage and subsampling, it defines the boundary between learning and memorizing.

XGBoost: The Full System

Chen's system combines it all: second-order optimization (using Hessians, not just gradients), column subsampling (borrowed from Random Forests), sparsity-aware splits for missing data, and weighted quantile sketches for approximate split-finding.

The result: the algorithm that dominated structured-data ML for a decade.

The Kaggle Decade

Between 2015 and 2020, XGBoost (and its successors LightGBM and CatBoost) won the majority of Kaggle competitions involving tabular data. The pattern was so consistent that "just XGBoost it" became a meme in the competitive ML community.

The victories were not narrow. From credit scoring to particle physics, from click-through prediction to property valuation, gradient-boosted trees proved that a well-tuned ensemble of weak learners could match or exceed the performance of deep neural networks — on a laptop, in minutes, with a fraction of the data.

Error vs. Boosting Rounds

As more trees are added, training error drops monotonically. Test error initially drops too, then rises as overfitting sets in. The optimal number of trees sits at the minimum of the test error curve.

Green: training error. Red: test error. The gap between them is the generalization gap — the signature of overfitting.

From Stumps to Giants

A decision stump is a tree with a single split — one question, two answers. It is the weakest possible learner: barely better than guessing. Yet Schapire proved in 1990 that any algorithm that can do slightly better than chance can be boosted to arbitrary accuracy.

The mathematical insight is that combining many slightly-better-than-random predictions, each weighted and sequentially corrected, produces a predictor whose margin grows with every round. Weakness is not a flaw — it is a raw material.

The Bias-Variance Decomposition

Bagging reduces variance by averaging independent trees. Boosting reduces bias by sequentially correcting errors. XGBoost controls both through regularization, learning rate, and early stopping.

Left: a single deep tree (low bias, high variance). Center: a random forest (lower variance). Right: a boosted ensemble (low bias and variance). The boosted model achieves the best total error.

Why Not Neural Networks?

On images, language, and audio, deep neural networks dominate. But on structured tabular data — spreadsheets, databases, feature tables — gradient-boosted trees consistently match or outperform them. Trees handle heterogeneous features naturally, detect interactions efficiently, and ignore irrelevant inputs automatically.

The practical gap is even wider: XGBoost trains in minutes on a CPU; comparable neural networks require GPUs and hours. For the vast majority of real-world prediction tasks, gradient-boosted trees remain the pragmatic default.

Build Your Own Gradient Booster

Choose a target function, set the learning rate and tree depth, then watch the boosted ensemble converge. The left panel shows the ensemble's prediction (green) against the true function (dashed). The right panel tracks error over boosting rounds.

0.10
Trees: 0   Train MSE: --   Leaf Nodes: --
Prediction vs. Target
MSE Over Rounds