Second Report

Time Series Regression Model

1. GDP By Region

The GDP by Region (in million) data obtained from Statistics New Zealand is annually therefore it needs to be converted into monthly data. The process of converting the data into monthly is split into two parts : interpolate the annual data into quarterly and interpolate the converted quarterly data into monthly. The formula used for coverting the annual data into quarterly data is as follow:

\[\text{GDP}_{year,quarter} = \frac{\text{GDP}_{year + 1}}{4} + \frac{\Delta_{year + 1}\times \text{nb}}{10}, \hspace{.3in} \Delta_{year + 1} = \text{GDP}_{year + 1} - \text{GDP}_{year}\] where \(\text{nb} = 1,2,3,4\) corresponding to \(quarter = Q1,Q2,Q3,Q4\), respectively for each \(year = 2008,...,2018\).

The quarterly data can then be converted into monthly data using cubic interpolation (spline() function).

2. Seasonal Dummy

Figure 1. Monthly seasonal time series data

Comment: By looking at the boxplot, it’s clear that Family A frequently bought oranges in most months over the past eight years. It can also be interepreted that Family A bought the most oranges majority of the months, where the less popular months are April, June and July.

The aim is to forecast the 2018 bought oranges for Family A. We can model this data using a regression model with a linear trend and monthly dummy variables, \[y_t = \beta_o + \beta_1t + \beta_2m_{2,t} + \beta_3m_{3,t} + \dots + \beta_{12}m_{12,t} + \epsilon_t\] where \(\beta_1\) is the trend predictor and \(\beta_2m_{2,t},\dots,\beta_{12}m_{12,t}\) are the seasonal dummy predictors for 12 months. Notice that only eleven dummy variables are needed to code twelve categories. That is because the first category (in this case January) is captured by the intercept, and is specified when the dummy variables are all set to zero. In R the trend predictor is coded as trend and the seasonal dummy predictor is coded as season.

sumVal <- tapply(training, cycle(training), FUN = sum)
fit.reg <- tslm(training~ trend + relevel(season, ref = which.min(sumVal)))
names(fit.reg$coefficients)[3:13] <- paste("month", substr(names(fit.reg$coefficients)[3:13],41,42))
summary(fit.reg)

## 
## Call:
## tslm(formula = training ~ trend + relevel(season, ref = which.min(sumVal)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.167  -5.583   0.000   5.583  22.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  16.3056    14.8390   1.099    0.295  
## trend        -0.5278     0.6310  -0.836    0.421  
## month 1      24.8889    18.7190   1.330    0.211  
## month 2      17.4167    18.6444   0.934    0.370  
## month 3      12.9444    18.5909   0.696    0.501  
## month 4      22.9722    18.5588   1.238    0.242  
## month 6      15.0278    18.5588   0.810    0.435  
## month 7       8.0556    18.5909   0.433    0.673  
## month 8      13.0833    18.6444   0.702    0.497  
## month 9       9.6111    18.7190   0.513    0.618  
## month 10     36.1389    18.8145   1.921    0.081 .
## month 11     17.6667    18.9305   0.933    0.371  
## month 12     30.6944    19.0668   1.610    0.136  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.55 on 11 degrees of freedom
## Multiple R-squared:  0.3816, Adjusted R-squared:  -0.293 
## F-statistic: 0.5657 on 12 and 11 DF,  p-value: 0.8291

Comment: The p-value reported is the probability of the estimated \(\beta\) coefficient being as large as it is if there was no real relationship between the response variable and the corresponding predictor. In this case, no months is shown to have an effect on the number of oranges bought, implying seasonality is not significant.

Figure 2. Forecast of the predicted regression model

Comment: Figure 3 plots the actual data versus the fitted data. If the predictions are close to the actual values,\(\text{R}^2\) is expected to be close to 1. The Pearson correlation test shows that the correlation between these variables is \(\text{R}^2\) = 0.6177605. In this case the model doesn’t seem like an appropriate choice for fitting the data of Family A as it only explains 32% of the variation in the data.

Figure 3. Time plot of Family A tre and predicted bought oranges

## 
##  Pearson's product-moment correlation
## 
## data:  training and fitted(fit.reg)
## t = 3.6847, df = 22, p-value = 0.001297
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2855148 0.8174473
## sample estimates:
##       cor 
## 0.6177605

Figure 4. Time plot of Tasman visitors and predicted Tasman visitors

Note: A plot of the residuals against the fitted values should show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant.

The residuals based on the residual plots are not showing any obvious patterns or trends, indicating constant variance.

Figure 5. Residuals of regression model including trend and seasonal dummy predictors

The CV() (short for cross-validation statistic) function computes the CV, AIC, AICc and BIC measures of predictive accuracy for a linear model. For these measures, the model fits the data better with the lowest value of CV, AIC, AICc and BIC; the model fits the data better with the highest value for Adjusted \(\text{R}^2\). Note: This is useful when studying the effect of each predictor, but is not particularly useful for forecasting.

CV(fit.reg)

##          CV         AIC        AICc         BIC       AdjR2 
## 750.6115702 149.4537085 196.1203751 165.9464621  -0.2929596

This function’s purpose is to select the best predictors to use in a regression model when there are multiple predictors. Here the result shows that the seasonal dummy variables are not required in the model.

Other methods

These methods are going to use the full data (2013-2017) to fit the following methods, instead of using training data. This is so that we could see whether the prediction results would be better now that there is more data to fit the models or will the prediction results stay the same since the data is similar to white noise. The 2018 data is then used as a “test” set in order to find the error values of each model to check their performances.

1. STL-ETS model

The stlf() function decomposes the time series using STL, forecast the seasonally adjusted data (data without seasonality) and return the ‘reseasonalized’ forecasts (forecasts that take the seasonality into account). If the method argument is not specified, the function will use the ETS approach applied to the seasonally adjusted series.

fulldata <- window(ts.A, start = c(2013, 1), end = c(2017, 12))
STL <- stlf(fulldata, biasadj = TRUE, h = 12, level = c(30,50,70))
STL$model$aic

## [1] 566.1593

2. NNAR model

The nnetar() function in the forecast package for R fits a Neural Network Model (NNAR) to a time series with lagged values of the time series as inputs (and possibly some other exogenous inputs). It is therefore a non-linear autogressive model, allowing complex non-linear relationships between the response variable and its predictors. The NNAR model for seasonal data has the form: \[NNAR(p,P,k)[m]\]

set.seed(2015)
nnetar.model <- nnetar(fulldata)
NNAR <- forecast(nnetar.model, h = 12, biasadj = TRUE, level = c(30,50,70))
nnetar.model

## Series: fulldata 
## Model:  NNAR(1,1,2)[12] 
## Call:   nnetar(y = fulldata)
## 
## Average of 20 networks, each of which is
## a 2-2-1 network with 9 weights
## options were - linear output units 
## 
## sigma^2 estimated as 193.6

Since NNAR models usually (and in this case) have no underlying statistical model, calculating an AIC/BIC does not make sense here. A possible solution to select the best model is to fit various models to 90% of the data and use these models to forecast the remaining 10%, i.e., use a holdout sample. Choose the model that performs best on the holdout sample (“best” will depend on the error measure(s)). Refit this model based on the entire sample.

3. TBATS model

Both the NNAR and TBATS models are mainly used for series exhibiting multiple complex seasonalities. TBATS is short for Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components:

T for trigonometric regressors to model multiple-seasonalities
B for Box-Cox transformations
A for ARMA errors
T for trend
S for seasonality

The TBATS model can be fitted using the tbats() command in the forecast package for R. The forecast function when running the TBATS model only returns the AIC value hence in this section we are comparing models using AIC. However AIC is not valid for neither NNAR or Combination (the combination is not really model but merely an average of all the methods’forecasts) thus these methods are going to be compared using RMSE.

TBATS <- forecast(tbats(fulldata, biasadj = TRUE, use.box.cox = FALSE), h = 12, level = c(30,50,70))
TBATS$model$AIC

## [1] 583.1142

4. Forecast Combinations

An easy way to improve forecast accuracy is to use several different methods on the same time series, and to average the resulting forecasts. The forecasts used in this example are from the following models: ETS, ARIMA, STL-ETS, NNAR, and TBATS.

arima.model <- auto.arima(fulldata, stepwise = FALSE, approximation = FALSE)

ets.model <- ets(fulldata)

ARIMA <- forecast(arima.model, h = 12, biasadj = TRUE, level = c(30,50,70))
ETS <- forecast(ets.model, h = 12, biasadj = TRUE, level = c(30,50,70))

Combination <- (ARIMA[["mean"]] + ETS[["mean"]] + STL[["mean"]] + NNAR[["mean"]] + 
                  TBATS[["mean"]])/5

Comment: Though the Combination models performance is particularly well in this series, the ARIMA model’s performance is shown to have the smallest RMSE error value.

##         ARIMA      ETS  STL-ETS     NNAR    TBATS Combination
## RMSE 13.73726 13.73734 16.87187 15.32699 13.74732    14.19774

Results

Table 3: True values vs. Predicted values for non-transformed data
year2018	TrueData	ARIMA	ETS	Seasonal	STL	NNAR	TBATS	Combination
Jan	10	25	25	27	33	28	26	27
Feb	42	25	25	19	18	25	26	23
Mar	8	25	25	15	21	35	26	26
Apr	37	25	25	24	36	25	26	27
May	31	25	25	0	12	35	26	24
June	19	25	25	15	21	21	26	23
Jul	16	25	25	7	16	25	26	23
Aug	45	25	25	12	25	24	26	25
Sep	14	25	25	8	28	25	26	26
Oct	43	25	25	34	29	24	26	26
Nov	33	25	25	15	24	36	26	27
Dec	9	25	25	27	39	25	26	28

Comment: Linear Regression Model with seasonal dummy variables and STL are the only two models that have different prediction value for each month. However these results are still far away from the true data value. Overall the forecast of orange data for Family A is invalid.

To-Do List

Attempt R Shiny : write functions for applying ARIMA method on multiple DoC data sets (Track levels).
Review Dynamic Regression Models as another possible model.
Impose a constraint on the forecasts to ensure they stay within some specified range [a,b].

References

https://www.stats.govt.nz/information-releases/regional-gross-domestic-product-year-ended-march-2018

https://robjhyndman.com/hyndsight/nnetar-prediction-intervals/

Page built on: 📆 2020-07-13 ‒ 🕢 10:02:30