# Better time series forecasting using expert knowledge

02/15/19

Methods for time series forecasting have become more and more powerful in recent decades, ranging form simple linear models to complex machine learning algorithms. Nevertheless, not only the quality of the forecasts is important, but also their acceptance by the staff. Especially with automatic forecasts, there is the possibility of distrust and incomprehension among long-term dispatchers. Furthermore, long-standing senior employees in many cases have a very good overview of customer behavior, market situation and development, economic conditions and many other important factors. Therefore, it makes sense to include this expert knowledge in the predictions of machine learning algorithms.

The following blogpost will therefore show a way to include expert knowledge in the predictions of arbitrary algorithms (Python sourcecode: Maximum Entropy Example).

## Basic forecast using Facebook Prophet

We start with the famous air passengers time series that shows the monthly totals of international airline passengers between 1949 to 1960 in thousands from which we would like to predict the year 1960:

In order to do this, we use Facebook Prophet with multiplicative seasonality:

The red circles mark the forecasts for May and July 1960, which are visibly off. Fortunately, Facebook Prophet does not only provide us with the point forecasts, but also with associated Markov-Chain-Monte-Carlo samples \(y^{\ast}_{i}\) from the posterior predictive distribution of each forecast step. Let’s take a look at the kernel density estimate of the posterior predictive distribution \(p_{0}\left(y\right)\) of the forecast for May 1960:

Calculating the integral \(\int_{-\infty}^{\infty} y \ p_{0}\left(y\right) dy \approx \frac{1}{n} \sum y^{\ast}_{i}\) yields the point forecast for May, which is \(\hat{y}_{\text{May}}= 440 \). In order to improve the forecast, it would be useful if we were able to enrich the posterior predictive distribution by expert views about future events.

## Mathematical background

The starting point is the Kullback-Leibler divergence:

$$\text{KL}\left[p,p_{0}\right] = \int_{-\infty}^{\infty} p\left(y\right)\text{log}\frac{p\left(y\right)}{p_{0}\left(y\right)} dy.$$

Given the prior \(p_{0}\left(y\right)\), we seek the distribution \(p\left(y\right)\) that minimizes the functional \(\text{KL}\), given certain constraints. Or, in other words: we are looking for the distribution \(p\left(y\right)\) that has some predefined properties and comes as close as possible to our prior knowledge \(p_{0}\left(y\right)\). The distribution \(p\left(y\right)\) then is the called Maximum Entropy distribution. What could these constraints look like? What could the expert say?

- “The probability of 400.000 or fewer passengers for next July in my view is 5%.”

\(\Leftrightarrow\)

\(\int_{-\infty}^{400} p\left(y\right)dy \overset{!}{=} 0.05\) - “We have a strong growing economy, so I think with 80% probability we will have between 440.000 and 480.000 passengers.”

\(\Leftrightarrow\)

\(\int_{440}^{480} p\left(y\right)dy \overset{!}{=} 0.8\) - “I expect 460.000 passengers.”

\(\Leftrightarrow\)

\(\int_{-\infty}^{\infty} y \ p\left(y\right)dy \overset{!}{=} 460\)

Therefore, our constraints k=1,2,..,m are of the form

$$\int_{-\infty}^{\infty} F_{k}\left(y\right) \ p\left(y\right)dy \overset{!}{=} f_{k}.$$

What does \(F_{k}\left(y\right)\) mean? This is best understood by inspecting the second and third constraint-example. For the second example, it is

$$F\left(y\right) = \begin{cases}

1, \text{if} \ y \in \left[440,480\right]\\

0, \text{else}

\end{cases}$$

and for the third example, we simply have

$$F\left(y\right) = y.$$

In order to minimize \(\text{KL}\) under constraints, the Lagrange multipliers \(\boldsymbol{\lambda} = \lambda_{1}, \lambda_{2},…,\lambda_{m}\) have to be introduced. We arrive at the functional:

$$ L\left[p, \boldsymbol{\lambda}\right] = \int_{-\infty}^{\infty} p\left(y\right)\text{log}\frac{p\left(y\right)}{p_{0}\left(y\right)}dy-\lambda_{1}\left(\int_{-\infty}^{\infty} F_{1}\left(y\right) \ p\left(y\right)dy – f_{1}\right)-…-\lambda_{m}\left(\int_{-\infty}^{\infty} F_{m}\left(y\right) \ p\left(y\right)dy – f_{m}\right).$$

The first step is to calculate the derivatives of \(L\) with respect to \(p\) and \(\boldsymbol{\lambda}\) and to set them to zero. Beginning with the functional derivative with respect to \(p\), we get

$$\frac{\delta L}{\delta p} = \text{log}\frac{p\left(y\right)}{p_{0}\left(y\right)}+1 – \lambda_{1}F_{1}\left(y\right)-…-\lambda_{m}F_{m}\left(y\right)\overset{!}{=}0.$$ After resolving to \(p\left(y\right)\) and normalizing the result, we arrive at the **Bolzmann distribution**

$$p_{B}\left(y\right) = \frac{1}{Z}p_{0}\left(y\right) e^\left(\lambda_{1}F_{1}\left(y\right)+…+\lambda_{m}F_{m}\left(y\right)\right)$$

with the normalizing constant

$$Z\left(\boldsymbol{\lambda}\right)=\int_{-\infty}^{\infty} p_{0}\left(y\right) e^\left(\lambda_{1}F_{1}\left(y\right)+…+\lambda_{m}F_{m}\left(y\right)\right)dy.$$

The partial derivatives of \(L\) with respect to \(\boldsymbol{\lambda}\) read

$$\frac{\partial L}{\partial \lambda_{k}} = \int_{-\infty}^{\infty} F_{k}\left(y\right) \ p\left(y\right)dy-f_{k}\overset{!}{=} 0,\ k=1,…,m.$$

As we already have calculated our normalized solution to \(p\left(y\right)\), which is \(p_{B}\left(y\right)\), we can insert this result into the derivatives:

$$\frac{\partial L}{\partial \lambda_{k}} = \int_{-\infty}^{\infty} F_{k}\left(y\right) \ \underbrace{\frac{1}{Z}p_{0}\left(y\right) e^\left(\lambda_{1}F_{1}\left(y\right)+…+\lambda_{m}F_{m}\left(y\right)\right)}_{p_{B}\left(y\right)}dy-f_{k}\overset{!}{=} 0,\ k=1,…,m.$$

This, however, means nothing more than: \(E\left[F_{k}\right]\overset{!}{=}f_{k},\ k=1,…,m\).

**We are finally there**: we have to find \(\boldsymbol{\lambda}\), so that the expected values of the functions \(F_{k}\) match the given constraints.

As the number of constraints rises, the numerical solution to the the system of equations becomes increasingly harder to find. Due to the problem of multiple local minima, we refrain from using a gradient-based algorithm and instead use a heuristic algorithm. In our case, it is the particle swarm algorithm (Python package pyswarm).

## Improving the forecasts for May and July

In this section we will make up expert assessments for May and July 1960 and show how the forecasts are affected.

The expert assessment for May:

“This May we had 420.000 Passengers and we will definitely not have fewer in May 1960 (probability 1%). Furthermore, given the numbers of the last three years, I am sure that a growth rate compared to this May between 7.5% and 15% is extremely probable (probability 80%). However, an increase of 15% or more compared to this May, in my opinion, is unrealistic (probability 1%).”

This results in the following constraints:

- \(\int_{-\infty}^{420} p_{B}\left(y\right)dy \overset{!}{=} 0.01\)
- \(\int_{451}^{483} p_{B}\left(y\right)dy \overset{!}{=} 0.8\)
- \(\int_{483}^{\infty} p_{B}\left(y\right)dy \overset{!}{=} 0.01\)

The expert assessment for July:

“This July we had 448.000 Passengers. Comparing the Julys of the past five years, we can see that we had an average increase of 50 passengers per year. Due to the good economic situation, I am sure that we will at least regain this growth (probability 80%).” This yields the constraint: \(\int_{498}^{\infty} p_{B}\left(y\right)dy \overset{!}{=} 0.8\).

The following two figures show the distributions of the Facebook Prophet forecasts and the associated Maximum Entropy distributions. As can be seen, the expert’s assessments lead to distributions that differ significantly from the prior distributions. Nevertheless, the Maximum Entropy distributions have the smallest possible distance to the priors, while maintaining the given constraints.

In the last figure, the forecasts which result from the Maximum Entropy distributions as well as the Facebook Prophet forecasts are shown. The RMSE of the forecasts of Facebook Prophet is 64.90. Using the Maximum Entropy approach leads to a RMSE of 30.94, which is equal to a reduction of approximately 52%.

This artificial example is intended to show that the inclusion of expert assessments, which in many cases may reflect only gut instincts or common sense, can be useful to improve the forecasts of complex machine learning algorithms. In addition, the inclusion of employee opinions may also increase the general acceptance of forecasts.

### References:

Kullback, S., Information Theory and Statistics, John Wiley & Sons, 1959.

Singer, H., Maximum entropy inference for mixed continuous‐discrete variables, International Journal of Intelligent Systems, John Wiley & Sons, 2010.