## Sarem Seitz # Multi-output Gradient Boosting for probability distributions

Gradient Boosting is arguably one of the most popular Machine Learning algorithms nowadays. Combining multiple weak learners in order to generate a strong one seems almost too good to be true.

Nevertheless, respective packages like xgboost regularly shine in many online competitions. The ability to ‘just work’ makes boosting models a favourable tool in industry applications, too.

Popular libraries like lightGBM or said xgboost provide many tools for a variety of different use-cases. One particular feature however, namely arbitrary multi-output boosting, doesn’t seem to be available in these packages yet.

A quick Google search will provide some explanations on how to use sklearn’s MultiOutputRegressor in such cases.

This solution can work for suitable loss functions. Still, there are cases where the MultiOutputRegressor-approach would fail. In these situations, we need to dive deeper into the background of Gradient Boosting and implement some elements manually.

Hence, today, I want to give you some idea on how Multi-Output Gradient Boosting can be done ‘manually’. The focus will be on boosting multiple parameters of a target variable’s probability distribution. If you are not interested in the theory, feel free to skip the following section.

## Theory (skip this section at your liking)

The general idea behind Gradient Boosting can, roughly, be summarized in three steps:

1. Use an initial shallow learner to minimize a given loss
2. Let a second shallow learner learn to predict the loss’ derivative with respect to the first shallow learner’s prediction
3. Add the second learner’s prediction times a constant to the first learner’s prediction

The constant in 3) is chosen such that the combined prediction minimizes the target loss. Afterwards, we repeat steps 2) and 3) respectively until a maximum number of iterations is reached.

NOTE: Most explanations of Gradient Boosting consider the negative derivative in 2). As in standard gradient descent, this is necessary for minimization. Given that the learned constant in 3) can have either sign, the sign of the derivative can be ignored however.

For simplicity, let us therefore work with the positive derivatives.

#### In-depth explanation of step 1) — Initialization

\bar{y}=f_0(x)

This initial prediction can be an optimized constant such that

\bar{y}=\underset{a}{{argmin}}\,L(y,a)

for an arbitrary loss. For simplicity, we will define a fixed constant without optimizing it first.

For a continuous target, the loss function is usually the mean squared error (MSE) although that is not the only possibility. As you can see in the xgboost documentation, there are many more options available.

The only requirement for our loss is that it has to be differentiable with respect to the predictions.

#### In-depth explanation of step 2) — Gradient estimation

Obviously, we need to calculate the loss function’s derivative in this step. It’s easiest to start with the general formulation and then look at a concrete example. For an arbitrary, properly differentiable loss we have:

\frac{\partial}{\partial \hat{y}}L(y,\hat{y})

Next, let us use the MSE for an actual example. The MSE is simply

L(y,\hat{y})=0.5(y-\hat{y})^2

Plugging this into the above derivative formula, we get

\frac{\partial}{\partial \hat{y}}L(y,\hat{y})=\frac{\partial}{\partial \hat{y}}0.5(y-\hat{y})^2=y-\hat{y}=\epsilon.

Hence, Gradient Boosting on an MSE loss is equivalent to using the estimation residuals. Notice that this is not the case for Gradient Boosting in general.

The next-iteration learner now has to estimate the residual:

f_1(x)\approx\epsilon

Or, for the general case:

f_1(x)\approx\frac{\partial}{\partial \hat{y}}L(y,\hat{y})

This is done per observation in your dataset. For clarity, let us introduce a subscript index i to denote a single instance:

f_1(x_i)\approx\frac{\partial}{\partial \hat{y_i}}L(y_i,\hat{y_i})

#### In-depth explanation of step 3) — Model update

Finally, we need to update the full model as follows:

F(x)=f_0(x)+\gamma_1 f_1(x)

This is done through a simple optimization problem over the full dataset of size N:

\gamma_1=\underset{\gamma}{{argmin}}\,\sum_{i=1}^N L(y_i,f_0(x_i)+\gamma f_1(x_i))

We can either provide gradient information to an optimizer or use e.g. scipy.optimize.minimize for black-box (without explicit derivatives) optimization.

From here, we only need to re-iterate the above steps until we have a model with K+1 shallow learners in total:

F(x)=f_0(x)+\sum_{k=1}^K \gamma_k f_k(x)

Personally, I find it pretty amazing that such a simple algorithm produces such great results in practice.

Finally, we can switch to the actual topic of this article. Our most important element for this case is a proper multi-output loss function. That loss has to be able to condense outputs from multiple models into a single quantity.

Let us introduce some notation for the multi-output case with M outputs:

y=\begin{pmatrix}y^{(1)} \\ \vdots \\ y^{(M)}\end{pmatrix},\quad \hat{y}=\begin{pmatrix}\hat{y}^{(1)} \\ \vdots \\ \hat{y}^{(M)}\end{pmatrix}

#### Multi-Output MSE

The easiest extension for multi-output, continuous regression is the sum of individual MSEs:

L(y,\hat{y})=\sum_{m=1}^M 0.5(y^{(m)}-\hat{y}^{(m)})^2

Now, we need to calculate the derivative for each output which yields

\frac{\partial}{\partial \hat{y}^{(m)}}L(y,\hat{y})=y^{(m)}-\hat{y}^{(m)}

This tells us, essentially, that we can run a separate Gradient Boosting instance for each output. In such cases, the MultiOutputRegressor will work without further ado.

To demonstrate why such separation is not necessarily possible, consider a multi-class problem with M target classes.

We use a multi-class crossentropy loss and softmax-transform the outputs to obtain valid class probabilities. This yields

\begin{gather*}
L(y,\hat{y})=-\sum_{m=1}^M y^{(m)} \log\frac{exp(\hat{y}^{(m)})}{\sum_{\tilde{m}}^M exp(\hat{y}^{(\tilde{m})})}\\
=-\sum_{m=1}^M y^{(m)} (\hat{y}^{(m)}-\log S)
\end{gather*}

Using onehot-encoding for the targets, only one element in the above sum is non-zero. We denote this as m^star and get rid of the sum:

L(y,\hat{y})=-(\hat{y}^{(m^*)}-\log S)

Now, for the derivatives, we obtain

\frac{\partial}{\partial \hat{y}^{(m)}} L(y,\hat{y})=\begin{cases}\frac{exp(\hat{y}^{(m)})}{\sum_{\tilde{m}}^M exp(\hat{y}^{(\tilde{m})})}-1\quad \text{if } m=m^* \\\frac{exp(\hat{y}^{(m)})}{\sum_{\tilde{m}}^M exp(\hat{y}^{(\tilde{m})})} \quad\quad\,\,\, \text{else} \end{cases}.

As you can see, the derivative for each output depends on the outputs of all other Gradient Boosting instances. Hence, it is not possible anymore to just treat this multi-output problem as M separate problems.

This is different from the MSE example above.

The broader implication is that we can only use the MultiOutputRegressor-approach from before if the loss function is appropriate. Once the per-output derivative depends doesn’t depend on the respective output alone, the resulting Gradient Boosting problem becomes non-trivial.

For multi-class problems, we luckily have the necessary algorithms ready in the standard libraries. More fancy things, however, will likely require a manual implementation at the moment.

## Gradient Boosting for probability distributions

One of those more-fancy-things is predicting the parameters of a conditional probability distribution. Consider a general probabilistic regression setup:

p(y|x)=f(x)

Our goal here is to predict a probability distribution for target y given input x. In plain linear regression, this typically looks as follows:

p(y|x)=\mathcal{N}(y|x\beta^T,\sigma^2)

This conditional probability is simply a Gaussian whose mean depends linearly on the input. The variance is presumed to be constant. Respective parameters can easily be estimated with maximum likelihood.

Now, let us replace the linear mean and constant variance terms each with a Gradient Boosting model:

p(y|x)=\mathcal{N}(y|y^{(1)},y^{(2)2}))

In order to optimize our model, we use the log-likelihood function as the loss function

 L(y,\hat{y})=-0.5\log(2\pi) - 0.5 \log(\hat{y}^{(2)2}) - \frac{1}{2\hat{y}^{(2)2}}(y-\hat{y}^{(1)})^2

Notice that this is indeed a multi-output problem for two distributional parameters. The target variable itself however is still one-dimensional.

Finally, we can calculate the necessary derivatives. For the Gradient Boosting model of the mean, we have:

\frac{\partial}{\partial \hat{y}^{(1)}} L(y,\hat{y})=\frac{1}{\hat{y}^{(2)2}}(y-\hat{y}^{(1)})

For the Gradient Boosting model of the standard deviation, we have

\frac{\partial}{\partial \hat{y}^{(2)}} L(y,\hat{y})=-\frac{1}{\hat{y}^{(2)}}+\frac{1}{\hat{y}^{(2)3}}(y-\hat{y}^{(2)})^2

And that’s it for the theory section. We are now ready to build a POC level implementation.

## A quick demonstration

Numpy, sklearn and scipy offer everything we need for our proof of concept.

#### Loss and derivative functions

First, we define the loglikelihood function for a plain normal distribution:

In order to properly use scipy.optimize.minimize, we create a wrapper function. That wrapper can then be plugged into the optimizer as a lambda function.

Notice that the loglikelihood now has negative sign as we want to maximize the loglikelihood. This is necessary to maximize the loglikelihood with a factual minimization algorithm.

Finally, we need the derivatives of the loss function:

#### Sample data

Our data generating process should be non-linear and have non-constant variance. For visualization, we also keep the input data one-dimensional.

A simple process that fulfills these requirements is

\begin{gather*}
x\sim\mathcal{U}(-3,3)\\
y\sim\mathcal{N}(sin(x),(0.25+0.5\cdot x)^2)
\end{gather*}

The choice was completely arbitrary. We only want to see if our model works at this point. Sample data (blue dots) from the data generating process with non-linear mean (green line) and non-constant variance (green area denotes two standard deviations from the mean)

#### Running the model

To keep things as simple as possible, we won’t use any Python classes. In case you want to build something on top of this bare-bones implementation, feel free to build wrapper classes.

Our base learners will be simple Decision Tree stumps, say 100.

In order to keep track of the training predictions and gammas, we use 4 numpy arrays altogether. That implies that we use separate gammas for the mean and variance models. On the one hand, this introduces more risk of overfitting.

On the other hand, the increased flexibility could improve results. This is a trade-off to be considered, and the choice to use separate gammas was arbitrary as well.

We set the initial mean prediction to zero; the initial standard deviation prediction is set to one.

Additionally, we store all training predictions in an N x n_trees matrix and all gammas in an 1 x n_trees matrix. This allows us to simply multiply and sum the respective prediction and gamma columns for the aggregated boosting output.

The base learners will be stored in lists for later usage.

Now we can run the training loop. As stated earlier, we will use scipy.optimize.minimize here. In order to not bloat this article too much, we use it as a black-box optimizer. That means we don’t provide any gradient or hessian information to the function.

In a production-ready implementation we might want do so — results could possibly improve.

The evaluation process is similar to the training process: Predicted mean function (red line) and corresponding variance (red area = 2 standard deviations from mean)

The results for both mean and standard deviation look reasonable. We could likely improve results with more fine-tuning — here are some ideas:

• Use a more robust optimization procedure As mentioned earlier, we should ideally provide gradient and hessian functions to the optimizer. Autograd packages like tensorflow and PyTorch could do that in an automated manner.
• Fine tune the model hyperparameters The number of trees and their respective depth would be the obvious place to start.
• Use a more sophisticated boosting algorithm Our boosting algorithm is pretty bare-bones. There exist many variations and advancements that could easily outperform this implementation.

These considerations would likely help to further improve our results.

## Conclusion — what else can we do?

The normal distribution was probably the most obvious option, however there are many more interesting choices.

Consider, for example, a skewed version of the Gaussian distribution. Most data in practice are far from symmetric around the mean. Accounting for such behaviour could turn out to be fairly advantageous.

However, we definitely increase the risk of overfitting with an increasing number of distributional parameters. Each new parameter in our target distribution means another Gradient Boosting model after all.

To mitigate that risk, regularization will be necessary. The simplest form of regularisation might be a decrease of the amount of tree stumps (n_trees) in our algorithm. This could already suffice before we even have to consider more sophisticated regularization approaches.

All in all, Gradient Boosting with probabilistic outputs can be fairly helpful in case you need to assess the noise in your target variable. I wrote two other pieces here and here about why I believe that this is generally a good idea.

Let me know in the comments if you have any questions or feedback in general.

## 2 thoughts on “Multi-output Gradient Boosting for probability distributions”

1. Lukasz Wiklendt says:

Thank you for the excellent post.

Just a couple small points in the theory section. A small typo in the gradients where (y – y^(2)) should be (y – y^(1)), and you introduce L as the negative log-likelihood, but the gradients show the log-likelihood.

1. Sarem says: