Time Series Regression Model

1. GDP By Region

The GDP by Region (in million) data obtained from Statistics New Zealand is annually therefore it needs to be converted into monthly data. The process of converting the data into monthly is split into two parts : interpolate the annual data into quarterly and interpolate the converted quarterly data into monthly. The formula used for coverting the annual data into quarterly data is as follow:

\[\text{GDP}_{year,quarter} = \frac{\text{GDP}_{year + 1}}{4} + \frac{\Delta_{year + 1}\times \text{nb}}{10}, \hspace{.3in} \Delta_{year + 1} = \text{GDP}_{year + 1} - \text{GDP}_{year}\] where \(\text{nb} = 1,2,3,4\) corresponding to \(quarter = Q1,Q2,Q3,Q4\), respectively for each \(year = 2008,...,2018\).



The quarterly data can then be converted into monthly data using cubic interpolation (spline() function).



2. Seasonal Dummy

Figure 1. Monthly seasonal time series data

Figure 1. Monthly seasonal time series data


Comment: By looking at the boxplot, it’s clear that Family A frequently bought oranges in most months over the past eight years. It can also be interepreted that Family A bought the most oranges majority of the months, where the less popular months are April, June and July.

The aim is to forecast the 2018 bought oranges for Family A. We can model this data using a regression model with a linear trend and monthly dummy variables, \[y_t = \beta_o + \beta_1t + \beta_2m_{2,t} + \beta_3m_{3,t} + \dots + \beta_{12}m_{12,t} + \epsilon_t\] where \(\beta_1\) is the trend predictor and \(\beta_2m_{2,t},\dots,\beta_{12}m_{12,t}\) are the seasonal dummy predictors for 12 months. Notice that only eleven dummy variables are needed to code twelve categories. That is because the first category (in this case January) is captured by the intercept, and is specified when the dummy variables are all set to zero. In R the trend predictor is coded as trend and the seasonal dummy predictor is coded as season.

sumVal <- tapply(training, cycle(training), FUN = sum)
fit.reg <- tslm(training~ trend + relevel(season, ref = which.min(sumVal)))
names(fit.reg$coefficients)[3:13] <- paste("month", substr(names(fit.reg$coefficients)[3:13],41,42))
summary(fit.reg)
## 
## Call:
## tslm(formula = training ~ trend + relevel(season, ref = which.min(sumVal)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.167  -5.583   0.000   5.583  22.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  16.3056    14.8390   1.099    0.295  
## trend        -0.5278     0.6310  -0.836    0.421  
## month 1      24.8889    18.7190   1.330    0.211  
## month 2      17.4167    18.6444   0.934    0.370  
## month 3      12.9444    18.5909   0.696    0.501  
## month 4      22.9722    18.5588   1.238    0.242  
## month 6      15.0278    18.5588   0.810    0.435  
## month 7       8.0556    18.5909   0.433    0.673  
## month 8      13.0833    18.6444   0.702    0.497  
## month 9       9.6111    18.7190   0.513    0.618  
## month 10     36.1389    18.8145   1.921    0.081 .
## month 11     17.6667    18.9305   0.933    0.371  
## month 12     30.6944    19.0668   1.610    0.136  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.55 on 11 degrees of freedom
## Multiple R-squared:  0.3816, Adjusted R-squared:  -0.293 
## F-statistic: 0.5657 on 12 and 11 DF,  p-value: 0.8291


Comment: The p-value reported is the probability of the estimated \(\beta\) coefficient being as large as it is if there was no real relationship between the response variable and the corresponding predictor. In this case, no months is shown to have an effect on the number of oranges bought, implying seasonality is not significant.


Figure 2. Forecast of the predicted regression model



Comment: Figure 3 plots the actual data versus the fitted data. If the predictions are close to the actual values,\(\text{R}^2\) is expected to be close to 1. The Pearson correlation test shows that the correlation between these variables is \(\text{R}^2\) = 0.6177605. In this case the model doesn’t seem like an appropriate choice for fitting the data of Family A as it only explains 32% of the variation in the data.

Figure 3. Time plot of Family A tre and predicted bought oranges

Figure 3. Time plot of Family A tre and predicted bought oranges

## 
##  Pearson's product-moment correlation
## 
## data:  training and fitted(fit.reg)
## t = 3.6847, df = 22, p-value = 0.001297
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2855148 0.8174473
## sample estimates:
##       cor 
## 0.6177605
Figure 4. Time plot of Tasman visitors and predicted Tasman visitors

Figure 4. Time plot of Tasman visitors and predicted Tasman visitors


Note: A plot of the residuals against the fitted values should show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant.

The residuals based on the residual plots are not showing any obvious patterns or trends, indicating constant variance.

Figure 5. Residuals of regression model including trend and seasonal dummy predictors

Figure 5. Residuals of regression model including trend and seasonal dummy predictors

The CV() (short for cross-validation statistic) function computes the CV, AIC, AICc and BIC measures of predictive accuracy for a linear model. For these measures, the model fits the data better with the lowest value of CV, AIC, AICc and BIC; the model fits the data better with the highest value for Adjusted \(\text{R}^2\). Note: This is useful when studying the effect of each predictor, but is not particularly useful for forecasting.

CV(fit.reg)
##          CV         AIC        AICc         BIC       AdjR2 
## 750.6115702 149.4537085 196.1203751 165.9464621  -0.2929596

This function’s purpose is to select the best predictors to use in a regression model when there are multiple predictors. Here the result shows that the seasonal dummy variables are not required in the model.


Other methods

These methods are going to use the full data (2013-2017) to fit the following methods, instead of using training data. This is so that we could see whether the prediction results would be better now that there is more data to fit the models or will the prediction results stay the same since the data is similar to white noise. The 2018 data is then used as a “test” set in order to find the error values of each model to check their performances.


1. STL-ETS model

The stlf() function decomposes the time series using STL, forecast the seasonally adjusted data (data without seasonality) and return the ‘reseasonalized’ forecasts (forecasts that take the seasonality into account). If the method argument is not specified, the function will use the ETS approach applied to the seasonally adjusted series.

fulldata <- window(ts.A, start = c(2013, 1), end = c(2017, 12))
STL <- stlf(fulldata, biasadj = TRUE, h = 12, level = c(30,50,70))
STL$model$aic
## [1] 566.1593


2. NNAR model

The nnetar() function in the forecast package for R fits a Neural Network Model (NNAR) to a time series with lagged values of the time series as inputs (and possibly some other exogenous inputs). It is therefore a non-linear autogressive model, allowing complex non-linear relationships between the response variable and its predictors. The NNAR model for seasonal data has the form: \[NNAR(p,P,k)[m]\]

set.seed(2015)
nnetar.model <- nnetar(fulldata)
NNAR <- forecast(nnetar.model, h = 12, biasadj = TRUE, level = c(30,50,70))
nnetar.model
## Series: fulldata 
## Model:  NNAR(1,1,2)[12] 
## Call:   nnetar(y = fulldata)
## 
## Average of 20 networks, each of which is
## a 2-2-1 network with 9 weights
## options were - linear output units 
## 
## sigma^2 estimated as 193.6

Since NNAR models usually (and in this case) have no underlying statistical model, calculating an AIC/BIC does not make sense here. A possible solution to select the best model is to fit various models to 90% of the data and use these models to forecast the remaining 10%, i.e., use a holdout sample. Choose the model that performs best on the holdout sample (“best” will depend on the error measure(s)). Refit this model based on the entire sample.


3. TBATS model

Both the NNAR and TBATS models are mainly used for series exhibiting multiple complex seasonalities. TBATS is short for Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components:

  • T for trigonometric regressors to model multiple-seasonalities
  • B for Box-Cox transformations
  • A for ARMA errors
  • T for trend
  • S for seasonality

The TBATS model can be fitted using the tbats() command in the forecast package for R. The forecast function when running the TBATS model only returns the AIC value hence in this section we are comparing models using AIC. However AIC is not valid for neither NNAR or Combination (the combination is not really model but merely an average of all the methods’forecasts) thus these methods are going to be compared using RMSE.

TBATS <- forecast(tbats(fulldata, biasadj = TRUE, use.box.cox = FALSE), h = 12, level = c(30,50,70))
TBATS$model$AIC
## [1] 583.1142


4. Forecast Combinations

An easy way to improve forecast accuracy is to use several different methods on the same time series, and to average the resulting forecasts. The forecasts used in this example are from the following models: ETS, ARIMA, STL-ETS, NNAR, and TBATS.

arima.model <- auto.arima(fulldata, stepwise = FALSE, approximation = FALSE)

ets.model <- ets(fulldata)

ARIMA <- forecast(arima.model, h = 12, biasadj = TRUE, level = c(30,50,70))
ETS <- forecast(ets.model, h = 12, biasadj = TRUE, level = c(30,50,70))

Combination <- (ARIMA[["mean"]] + ETS[["mean"]] + STL[["mean"]] + NNAR[["mean"]] + 
                  TBATS[["mean"]])/5


Comment: Though the Combination models performance is particularly well in this series, the ARIMA model’s performance is shown to have the smallest RMSE error value.

##         ARIMA      ETS  STL-ETS     NNAR    TBATS Combination
## RMSE 13.73726 13.73734 16.87187 15.32699 13.74732    14.19774



Results

Table 3: True values vs. Predicted values for non-transformed data
year2018 TrueData ARIMA ETS Seasonal STL NNAR TBATS Combination
Jan 10 25 25 27 33 28 26 27
Feb 42 25 25 19 18 25 26 23
Mar 8 25 25 15 21 35 26 26
Apr 37 25 25 24 36 25 26 27
May 31 25 25 0 12 35 26 24
June 19 25 25 15 21 21 26 23
Jul 16 25 25 7 16 25 26 23
Aug 45 25 25 12 25 24 26 25
Sep 14 25 25 8 28 25 26 26
Oct 43 25 25 34 29 24 26 26
Nov 33 25 25 15 24 36 26 27
Dec 9 25 25 27 39 25 26 28


Comment: Linear Regression Model with seasonal dummy variables and STL are the only two models that have different prediction value for each month. However these results are still far away from the true data value. Overall the forecast of orange data for Family A is invalid.



To-Do List

  • Attempt R Shiny : write functions for applying ARIMA method on multiple DoC data sets (Track levels).
  • Review Dynamic Regression Models as another possible model.
  • Impose a constraint on the forecasts to ensure they stay within some specified range [a,b].