Forecasting with Decision Trees and Random Forests

Introduction

Today, Deep Learning dominates many areas of modern machine learning. On the other hand, Decision Tree based models still shine particularly for tabular data. If you look up the winning solutions of respective Kaggle challenges, chances are high that a tree model is among them.

A key advantage of tree approaches is that they typically don’t require too much fine-tuning for reasonable results. This is in stark contrast to Deep Learning. Here, different topologies and architectures can result in dramatical differences in model performance.

For time-series forecasting, decision trees are not as straightforward as for tabular data, though:

Challenges with trees and forests for forecasting

As you probably know, fitting any decision tree based methods requires both input and output variables. In a univariate time-series problem, however, we usually only have our time-series as a target.

To work around this issue, we need to augment the time-series to become suitable for tree models. Let us discuss two intuitive, yet false approaches and why they fail first. Obviously, the issues generalize to all Decision Tree ensemble methods.

Decision Tree forecasting as regression against time

Probably the most intuitive approach is to consider the observed time-series as a function of time itself, i.e. \[ y_t=f(t)+\epsilon \] With some i.i.d. stochastic additive error term. In an earlier article, I have already made some remarks on why regression against time itself is problematic. For tree based models, there is another problem:

Decision Trees for regression against time cannot extrapolate into the future.

By construction, Decision Tree predictions are averages of subsets of the training dataset. These subsets are formed by splitting the space of input data into axis-parallel hyper rectangles. Then, for each hyper rectangle, we take the average of all observation outputs inside those rectangles as a prediction.

For regression against time, those hyper rectangles are simply splits of time intervals. More exactly, those intervals are mutually exclusive and completely exhaustive.

Predictions are then the arithmetic means of the time-series observations inside those intervals. Mathematically, this roughly translates to \[ \hat{f}(t)=\sum_{i=1}^N \frac{1}{\left|\left\{y_s, s \in\{1 ; \ldots ; T\} \cap I_i\right\}\right|} \sum_{s=1}^T y_s \cdot \mathbb{I}_{s \in I_i} \mathbb{I}_{t \in I_i} \] where \(y_1, \ldots, y_T\) : training observations \[ \begin{gathered} I_i=\left(l_i, u_i\right] \\ l_i, u_i \in \mathbb{R}_0^{+} ; l_i<u_i ; l_1=0, u_N=\infty \\ I_i \cap I_{j \neq i}=\emptyset ; \bigcup_{i=1}^N I_i=\mathbb{R}_0^{+} \\ \mathbb{I}_{s \in I_i}= \begin{cases}1 & s \in I_i \\ 0 & \text { else }\end{cases} \end{gathered} \] Consider now using this model to predict the time-series at some time in the future. This reduces the above formula to the following: \[ \begin{gathered} \hat{f}(\tilde{t})=\frac{1}{\left|\left\{y_s, s \in\{1 ; \ldots ; T\} \cap I_T\right\}\right|} \sum_{s=1}^T y_s \cdot \mathbb{I}_{s \in I_T} \\ \text { for all } \tilde{t} \geq T \end{gathered} \] In words: For any forecast, our model always predicts the average of the final training interval. Which is clearly useless…

Let us visualize this issue on a quick toy example:

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

#create data with linear trend
np.random.seed(123)
t = np.arange(100)
y = t + 2 * np.random.normal(size = 100)#linear trend

t_train = t[:50].reshape(-1,1)
t_test = t[50:].reshape(-1,1)

y_train = y[:50]
y_test = y[50:]

tree = DecisionTreeRegressor(max_depth = 2)
tree.fit(t_train, y_train)

y_pred_train = tree.predict(t_train)
y_pred_test = tree.predict(t_test)

plt.figure(figsize = (16,8))
plt.plot(t_train.reshape(-1), y_train, label = "Training data", color="blue", lw=2)
plt.plot(np.concatenate([np.array(t_train[-1]),t_test.reshape(-1)]), 
         np.concatenate([[y_train[-1]],y_test]), label = "Test data", 
         color="blue", ls = "dotted", lw=2)


plt.plot(t_train.reshape(-1), y_pred_train, label = "Decision Tree insample predictions", 
         color="red", lw = 3)
plt.plot(np.concatenate([np.array(t_train[-1]),t_test.reshape(-1)]), 
         np.concatenate([[y_pred_train[-1]],y_pred_test]), label = "Decision Tree out-of-sample predictions", 
         color="purple", lw=3)

plt.grid(alpha = 0.5)
plt.axvline(t_train[-1], color="black", lw=2, ls="dashed")
plt.legend(fontsize=13)
plt.title("Decision Tree VS. Time-Series with linear trend", fontsize=15)

plt.margins(x=0)

The same issues obviously arise for seasonal patterns as well:

#create data with seasonality 
np.random.seed(123)
t = np.arange(100)
y = np.sin(0.5 * t) + 0.5 * np.random.normal(size = 100)#sine seasonality

t_train = t[:50].reshape(-1,1)
t_test = t[50:].reshape(-1,1)

y_train = y[:50]
y_test = y[50:]

tree = DecisionTreeRegressor(max_depth = 4)
tree.fit(t_train, y_train)

y_pred_train = tree.predict(t_train)
y_pred_test = tree.predict(t_test)

plt.figure(figsize = (16,8))
plt.plot(t_train.reshape(-1), y_train, label = "Training data", color="blue", lw=2)
plt.plot(np.concatenate([np.array(t_train[-1]),t_test.reshape(-1)]), 
         np.concatenate([[y_train[-1]],y_test]), label = "Test data", 
         color="blue", ls = "dotted", lw=2)


plt.plot(t_train.reshape(-1), y_pred_train, label = "Decision Tree insample predictions", 
         color="red", lw = 3)
plt.plot(np.concatenate([np.array(t_train[-1]),t_test.reshape(-1)]), 
         np.concatenate([[y_pred_train[-1]],y_pred_test]), label = "Decision Tree out-of-sample predictions", 
         color="purple", lw=3)

plt.grid(alpha = 0.5)
plt.axvline(t_train[-1], color="black", lw=2, ls="dashed")
plt.legend(fontsize=13)
plt.title("Decision Tree VS. Time-Series with seasonality", fontsize=15)

plt.margins(x=0)

To generalize the above in a single sentence:

Decision Trees fail for out-of-distribution data but in regression against time, every future point in time is out-of-distribution.

Thus, we need to find a different approach.

Decision Trees for auto-regressive forecasting

A far more promising approach is the auto-regressive one. Here, we simply view the future of a random variable as dependent on its past realizations. \[ y_t=f\left(y_{t-1}, \ldots, y_{t-k}\right)+\epsilon \] While this approach is easier to handle than regression on time, it doesn’t come without a cost:

The time-series must be observed at equi-distant timestamps: If your time-series is measured at random times, you cannot use this approach without further adjustments.
The time-series should not contain missing values: For many time-series models, this requirement is not mandatory. Our Decision Tree/Random Forest forecaster, however, will require a fully observed time-series.

As these caveats are common for most popular time-series approaches, they aren’t too much of an issue.

Now, before jumping into an example, we need to take a another look at a previously discussed issue: Tree based models can only predict within the range of training data. This implies that we cannot just fit a Decision Tree or Random Forest to model auto-regressive dependencies.

To exemplify this issue, let’s do another example:

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
import pandas as pd

#create data with linear trend
np.random.seed(123)
t = np.arange(100)
y = t + 2 * np.random.normal(size = 100)#linear trend

t_train = t[:50].reshape(-1,1)
t_test = t[50:].reshape(-1,1)


y_train = y[:50]
X_train_shift = np.concatenate([pd.Series(y_train).shift(t).values.reshape(-1,1) for t in range(1,6)],1)[5:,:]
y_train_shift = y_train[5:]
y_test = y[50:]

tree = DecisionTreeRegressor(max_depth = 2)
tree.fit(X_train_shift, y_train_shift)

y_pred_train = tree.predict(X_train_shift).reshape(-1)

Xt = np.concatenate([X_train_shift[-1,1:].reshape(1,-1),np.array(y_train_shift[-1]).reshape(1,1)],1)
predictions_test = []

for t in range(len(y_test)):
    pred = tree.predict(Xt)
    predictions_test.append(pred[0])
    Xt = np.concatenate([Xt[-1,1:].reshape(1,-1),np.array(pred).reshape(1,1)],1)
    
y_pred_test = np.array(predictions_test)


plt.figure(figsize = (16,8))
plt.plot(t_train.reshape(-1), y_train, label = "Training data", color="blue", lw=2)
plt.plot(np.concatenate([np.array(t_train[-1]),t_test.reshape(-1)]), 
         np.concatenate([[y_train[-1]],y_test]), label = "Test data", 
         color="blue", ls = "dotted", lw=2)


plt.plot(t_train.reshape(-1)[5:], y_pred_train, label = "Decision Tree insample predictions", 
         color="red", lw = 3)
plt.plot(np.concatenate([np.array(t_train[-1]),t_test.reshape(-1)]), 
         np.concatenate([[y_pred_train[-1]],y_pred_test]), label = "Decision Tree out-of-sample predictions", 
         color="purple", lw=3)

plt.grid(alpha = 0.5)
plt.axvline(t_train[-1], color="black", lw=2, ls="dashed")
plt.legend(fontsize=13)
plt.title("Decision Tree VS. Time-Series with linear trend", fontsize=15)

plt.margins(x=0)

Again, not useful at all. To fix this last issue, we need to first remove the trend. Then we can fit the model, forecast the time-series and ‘re-trend’ the forecast.

For de-trending, we basically have two options:

Fit a linear trend model - here we regress the time-series against time in a linear regression model. Its predictions are then subtracted from the training data to create a stationary time-series. This removes a constant, deterministic trend.
Use first-differences - in this approach, we transform the time-series via first order differencing. In addition to the deterministic trend, this approach can also remove stochastic trends.

As most time-series are driven by randomness, the second approach appears more reasonable. Thus, we now aim to forecast the transformed time-series \[ \Delta y_t=y_t-y_{t-1} \] by an autoregressive model, i.e. \[ \Delta y_t=f\left(\Delta y_{t-1}, \ldots, \Delta y_{t-k}\right)+\epsilon \] Obviously, differencing and lagging remove some observations from our training data. Some care should be taken to not remove too much information that way. I.e. don’t use too many lagged variables if your dataset is small.

To obtain a forecast for the original time-series we need to retransform the differenced forecast via \[ \hat{y}_{t+1}=\hat{\Delta} y_{t+1}+y_t \] and, recursively for further ahead forecasts: \[ \hat{y}_{t+h}=\hat{\Delta} y_{t+h}+\hat{y}_{t+h-1} \] For our running example this finally leads to a reasonable solution:

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

#create data with linear trend
np.random.seed(123)
t = np.arange(100)
y = t + 2* np.random.normal(size = 100)#linear trend

t_train = t[:50].reshape(-1,1)
t_test = t[50:].reshape(-1,1)

n_lags = 10

y_train = y[:50]
X_train_shift = pd.concat([pd.DataFrame(y_train).shift(t) for t in range(1,n_lags)],1).diff().values[n_lags:,:]
y_train_shift = np.diff(y_train)[n_lags-1:]
y_test = y[50:]

tree = DecisionTreeRegressor(max_depth = 1)
tree.fit(X_train_shift, y_train_shift)

y_pred_train = tree.predict(X_train_shift).reshape(-1)

Xt = np.concatenate([X_train_shift[-1,1:].reshape(1,-1),np.array(y_train_shift[-1]).reshape(1,1)],1)
predictions_test = []

for t in range(len(y_test)):
    pred = tree.predict(Xt)
    predictions_test.append(pred[0])
    Xt = np.concatenate([np.array(pred).reshape(1,1),Xt[-1,1:].reshape(1,-1)],1)
    
y_pred_test = np.array(predictions_test)

y_pred_train = y_train[n_lags-2]+np.cumsum(y_pred_train)
y_pred_test = y_train[-1]+np.cumsum(y_pred_test)



plt.figure(figsize = (16,8))
plt.plot(t_train.reshape(-1), y_train, label = "Training data", color="blue", lw=2)
plt.plot(np.concatenate([np.array(t_train[-1]),t_test.reshape(-1)]), 
         np.concatenate([[y_train[-1]],y_test]), label = "Test data", 
         color="blue", ls = "dotted", lw=2)


plt.plot(t_train.reshape(-1)[n_lags:], y_pred_train, label = "Decision Tree insample predictions", 
         color="red", lw = 3)
plt.plot(np.concatenate([np.array(t_train[-1]),t_test.reshape(-1)]), 
         np.concatenate([[y_pred_train[-1]],y_pred_test]), label = "Decision Tree out-of-sample predictions", 
         color="purple", lw=3)

plt.grid(alpha = 0.5)
plt.axvline(t_train[-1], color="black", lw=2, ls="dashed")
plt.legend(fontsize=13)
plt.title("Decision Tree VS. Time-Series with linear trend", fontsize=15)

plt.margins(x=0)

/var/folders/2d/hl2cr85d2pb2kfbmsng3267c0000gn/T/ipykernel_67956/2778805950.py:16: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only.
  X_train_shift = pd.concat([pd.DataFrame(y_train).shift(t) for t in range(1,n_lags)],1).diff().values[n_lags:,:]

From trees to Random Forecast forecasts

Let us now apply the above approach to a real-world dataset. We use the alcohol sales data from the St. Louis Fed database. For evaluation, we use the last four years as a holdout set:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("../data/Alcohol_Sales.csv")
df.columns = ["date", "sales"]
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")

df_train = df.iloc[:-48]
df_test = df.iloc[-48:]

plt.figure(figsize = (18,7))
plt.plot(df, label="Training data")
plt.plot(df_test, label = "Test data")
plt.grid(alpha=0.5)
plt.margins(x=0)
plt.title("Alcohol Sales")

Text(0.5, 1.0, 'Alcohol Sales')

Growing an autoregressive Random Forest for forecasting

Since a single Decision Tree would be boring at best and inaccurate at worst, we’ll use a Random Forest instead. Besides the typical performance improvements, Random Forests allow us to generate forecast intervals.

To create Random Forest forecast intervals, we proceed as follows:

Train an autoregressive Random Forest: This step is equivalent to fitting the Decision Tree as before
Use a randomly drawn Decision Tree at each forecast step: Instead of just forest.predict(), we let a randomly drawn, single Decision Tree perform the forecast. By repeating this step multiple times, we create a sample of Decision Tree forecasts.
Calculate quantities of interest from the Decision Tree sample: This could range from median to standard deviation or more complex targets. We are primarily interested in a mean forecast and the 90% predictive interval.

The following Python class does everything we need:

from sklearn.ensemble import RandomForestRegressor
from copy import deepcopy


class RandomForestARModel():
    """
    Autoregressive forecasting with Random Forests
    """
    
    def __init__(self, n_lags=1, max_depth = 3, n_estimators=1000, random_state = 123,
                 log_transform = False, first_differences = False, seasonal_differences = None):
        """
        Args:
            n_lags: Number of lagged features to consider in autoregressive model
            max_depth: Max depth for the forest's regression trees
            random_state: Random state to pass to random forest
            
            log_transform: Whether the input should be log-transformed
            first_differences: Whether the input should be singly differenced
            seasonal_differences: Seasonality to consider, if 'None' then no seasonality is presumed
        """
        
        self.n_lags = n_lags
        self.model = RandomForestRegressor(max_depth = max_depth, n_estimators = n_estimators, random_state = random_state)
        
        self.log_transform = log_transform
        self.first_differences = first_differences
        self.seasonal_differences = seasonal_differences
        
        
        
    def fit(self, y):
        """
        Args:
            y: training data (numpy array or pandas series/dataframe)
        """
        #enable pandas functions via dataframes
        y_df = pd.DataFrame(y)
        self.y_df = deepcopy(y_df)
        
        #apply transformations and store results for retransformations
        if self.log_transform:
            y_df = np.log(y_df)
            self.y_logged = deepcopy(y_df)
        
        if self.first_differences:
            y_df = y_df.diff().dropna()
            self.y_diffed = deepcopy(y_df)
        
        if self.seasonal_differences is not None:
            y_df = y_df.diff(self.seasonal_differences).dropna()
            self.y_diffed_seasonal = deepcopy(y_df)
        
        
        #get lagged features
        Xtrain = pd.concat([y_df.shift(t) for t in range(1,self.n_lags+1)],axis=1).dropna()
        self.Xtrain = Xtrain
        
        ytrain = y_df.loc[Xtrain.index,:]
        self.ytrain = ytrain

        self.model.fit(Xtrain.values,ytrain.values.reshape(-1))

    
    
    def sample_forecast(self, n_periods = 1, n_samples = 10000, random_seed =123):
        """
        Draw forecasting samples by randomly drawing from all trees in the forest per forecast period
        Args:
            n_periods: Ammount of periods to forecast
            n_samples: Number of samples to draw
            random_seed: Random seed for numpy
        """
        samples = self._perform_forecast(n_periods, n_samples, random_seed)
        output = self._retransform_forecast(samples, n_periods)
        
        return output
    
    
    
    def _perform_forecast(self, n_periods, n_samples, random_seed):
        """
        Forecast transformed observations
        Args:
            n_periods: Ammount of periods to forecast
            n_samples: Number of samples to draw
            random_seed: Random seed for numpy
        """
        samples = []
        
        np.random.seed(random_seed)
        for i in range(n_samples):
            #store lagged features for each period
            Xf = np.concatenate([self.Xtrain.iloc[-1,1:].values.reshape(1,-1),
                                 self.ytrain.iloc[-1].values.reshape(1,1)],1)

            forecasts = []

            for t in range(n_periods):
                tree = self.model.estimators_[np.random.randint(len(self.model.estimators_))]
                pred = tree.predict(Xf)[0]
                forecasts.append(pred)
                
                #update lagged features for next period
                Xf = np.concatenate([Xf[:,1:],np.array([[pred]])],1)
            
            samples.append(forecasts)
        
        return samples
    
    
    
    def _retransform_forecast(self, samples, n_periods):
        """
        Retransform forecast (re-difference and exponentiate)
        Args:
            samples: Forecast samples for retransformation
            n_periods: Ammount of periods to forecast
        """
        
        full_sample_tree = []

        for samp in samples:
            draw = np.array(samp)
            
            #retransform seasonal differencing
            if self.seasonal_differences is not None:
                result = list(self.y_diffed.iloc[-self.seasonal_differences:].values)
                for t in range(n_periods):
                    result.append(result[t]+draw[t])
                result = result[self.seasonal_differences:]
            else:
                result = []
                for t in range(n_periods):
                    result.append(draw[t])
            
            #retransform first differences
            y_for_add = self.y_logged.values[-1] if self.log_transform else self.y_df.values[-1]
            
            if self.first_differences:
                result = y_for_add + np.cumsum(result)
            
            #retransform log transformation
            if self.log_transform:
                result = np.exp(result)
            
            full_sample_tree.append(result.reshape(-1,1))

        return np.concatenate(full_sample_tree,1)

As our data is strictly positive, has a trend and yearly seasonality, we apply the following transformations:

Logarithm transformation: Our forecasts then need to be re-transformed via an exponential transform. Thus, the exponentiated results will be strictly positive as well
First differences: As mentioned above, this removes the linear trend in the data.
Seasonal differences: Seasonal differencing works like first differences with higher lag orders. Also, it allows us to remove both deterministic and stochastic seasonality. The main challenge with all these transformations is to correctly apply their inverse on our predictions. Luckily, the above model has these steps implemented already.

Evaluating the Random Forest forecast

Using the data and the model, we get the following result for our test period:

model = RandomForestARModel(n_lags = 2, log_transform = True, first_differences = True, seasonal_differences = 12)
model.fit(df_train)

predictions_forest = model.sample_forecast(n_periods=len(df_test), n_samples=10000)


means_forest = np.mean(predictions_forest,1)
lowers_forest = np.quantile(predictions_forest,0.05,1)
uppers_forest = np.quantile(predictions_forest,0.95,1)

plt.figure(figsize = (18,7))

plt.grid(alpha=0.5)

plt.plot(df.iloc[-120:], label = "Training observations (truncated)")
plt.plot(df_test, color = "blue", label = "Out-of-sample observations", ls="dashed")

plt.plot(df_test.index,means_forest,color="purple", label = "RF mean forecast")

plt.fill_between(df_test.index, lowers_forest, uppers_forest, color="purple", alpha=0.5, label = "RF 90% forecast inverval")

plt.legend(fontsize=13)
plt.margins(x=0)

This looks quite good. To verify that we were not just lucky, we use a simple benchmark for comparison:

from scipy.stats import gaussian_kde

df_train_diffed = np.log(df_train["sales"]).diff().dropna()
df_train_trans = df_train_diffed.diff(12).dropna()

kde = gaussian_kde(df_train_trans.values)

target_range = np.linspace(np.min(df_train_trans.values)-0.5,np.max(df_train_trans.values)+0.5,num=100)

full_sample_toy = [] 
np.random.seed(123)
for i in range(10000):
    draw = kde.resample(len(df_test)).reshape(-1)
    result = list(df_train_diffed.iloc[-12:].values)

    for t in range(len(df_test)):
        result.append(result[t]+draw[t])

    full_sample_toy.append(np.exp(np.array((np.log(df_train.values[-1])+np.cumsum(result[12:]))).reshape(-1,1)))

    
predictions_toy = np.concatenate(full_sample_toy,1)
means_toy = np.mean(predictions_toy,1)
lowers_toy = np.quantile(predictions_toy,0.05,1)
uppers_toy = np.quantile(predictions_toy,0.95,1)

plt.figure(figsize = (18,7))

plt.grid(alpha=0.5)

plt.plot(df.iloc[-120:], label = "Training observations (truncated)")
plt.plot(df_test, color = "blue", label = "Out-of-sample observations", ls="dashed")

plt.plot(df_test.index,means_toy,color="red", label = "Benchmark mean forecast")

plt.fill_between(df_test.index, lowers_toy, uppers_toy, color="red", alpha=0.5, label = "Benchmark 90% forecast inverval")

plt.legend(fontsize=13)
plt.margins(x=0)

Apparently, the benchmark intervals are much worse than for the Random Forest. The mean forecast starts out reasonably but quickly deteriorates after a few steps.

Let’s compare both mean forecasts in a single chart:

plt.figure(figsize = (18,7))

plt.grid(alpha=0.5)

plt.plot(df.iloc[-120:], label = "Training observations (truncated)")
plt.plot(df_test, color = "blue", label = "Out-of-sample observations", ls="dashed")

plt.plot(df_test.index,means_forest,color="purple", label = "RF mean forecast",lw = 3)
plt.plot(df_test.index,means_toy,color="red", label = "Benchmark mean forecast", lw = 3)

plt.legend(fontsize=13)
plt.margins(x=0)

rmse_forest = np.sqrt(np.mean((df_test.values[:,0] - means_forest)**2))
rmse_toy = np.sqrt(np.mean((df_test.values[:,0] - means_toy)**2))

print("Random Forest: {}".format(rmse_forest))
print("Benchmark: {}".format(rmse_toy))

Random Forest: 909.7996221364062
Benchmark: 6318.1429838549

Clearly, the Random Forest is far superior for longer horizon forecasts.

Conclusion

Hopefully, this article gave you some insights on the do’s and dont’s of forecasting with tree models. While a single Decision Tree might be useful sometimes, Random Forests are usually more performant. That is, unless your dataset is very tiny in which case you could still reduce max_depth of your forest trees.

Obviously, you could add easily add external regressors to either model to improve performance further. As an example, adding monthly indicators to our model might yield more accurate results than right now.

As an alternative to Random Forests, Gradient Boosting could be considered. Nixtla’s mlforecast package has a very powerful implementation - besides all their other great tools for forecasting. Keep in mind however, that we cannot transfer the algorithm for forecast intervals to Gradient Boosting.

On another note, keep in mind that forecasting with advanced machine learning is a double-edged sword. While powerful at the surface, ML for time-series can overfit much quicker than for cross-sectional problems. As long as you properly test your model against some benchmarks, though, they should not be overlooked either.

PS: You can find a full notebook for this article here.

References

[1] Breiman, Leo. Random forests. Machine learning, 2001, 45.1, p. 5-32.

[2] Breiman, Leo, et al. Classification and regression trees. Routledge, 2017.

[3] Hamilton, James Douglas. Time series analysis. Princeton university press, 2020.