The GDP by Region (in million) data obtained from Statistics New Zealand is annually therefore it needs to be converted into monthly data. The process of converting the data into monthly is split into two parts : interpolate the annual data into quarterly and interpolate the converted quarterly data into monthly. The formula used for coverting the annual data into quarterly data is as follow:
\[\text{GDP}_{year,quarter} = \frac{\text{GDP}_{year + 1}}{4} + \frac{\Delta_{year + 1}\times \text{nb}}{10}, \hspace{.3in} \Delta_{year + 1} = \text{GDP}_{year + 1} - \text{GDP}_{year}\] where \(\text{nb} = 1,2,3,4\) corresponding to \(quarter = Q1,Q2,Q3,Q4\), respectively for each \(year = 2008,...,2018\).
The quarterly data can then be converted into monthly data using cubic interpolation (spline()
function).
Comment: By looking at the boxplot, it’s clear that Family A frequently bought oranges in most months over the past eight years. It can also be interepreted that Family A bought the most oranges majority of the months, where the less popular months are April, June and July.
The aim is to forecast the 2018 bought oranges for Family A. We can model this data using a regression model with a linear trend and monthly dummy variables, \[y_t = \beta_o + \beta_1t + \beta_2m_{2,t} + \beta_3m_{3,t} + \dots + \beta_{12}m_{12,t} + \epsilon_t\] where \(\beta_1\) is the trend predictor and \(\beta_2m_{2,t},\dots,\beta_{12}m_{12,t}\) are the seasonal dummy predictors for 12 months. Notice that only eleven dummy variables are needed to code twelve categories. That is because the first category (in this case January) is captured by the intercept, and is specified when the dummy variables are all set to zero. In R the trend predictor is coded as trend
and the seasonal dummy predictor is coded as season
.
sumVal <- tapply(training, cycle(training), FUN = sum)
fit.reg <- tslm(training~ trend + relevel(season, ref = which.min(sumVal)))
names(fit.reg$coefficients)[3:13] <- paste("month", substr(names(fit.reg$coefficients)[3:13],41,42))
summary(fit.reg)
##
## Call:
## tslm(formula = training ~ trend + relevel(season, ref = which.min(sumVal)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.167 -5.583 0.000 5.583 22.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.3056 14.8390 1.099 0.295
## trend -0.5278 0.6310 -0.836 0.421
## month 1 24.8889 18.7190 1.330 0.211
## month 2 17.4167 18.6444 0.934 0.370
## month 3 12.9444 18.5909 0.696 0.501
## month 4 22.9722 18.5588 1.238 0.242
## month 6 15.0278 18.5588 0.810 0.435
## month 7 8.0556 18.5909 0.433 0.673
## month 8 13.0833 18.6444 0.702 0.497
## month 9 9.6111 18.7190 0.513 0.618
## month 10 36.1389 18.8145 1.921 0.081 .
## month 11 17.6667 18.9305 0.933 0.371
## month 12 30.6944 19.0668 1.610 0.136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.55 on 11 degrees of freedom
## Multiple R-squared: 0.3816, Adjusted R-squared: -0.293
## F-statistic: 0.5657 on 12 and 11 DF, p-value: 0.8291
Comment: The p-value
reported is the probability of the estimated \(\beta\) coefficient being as large as it is if there was no real relationship between the response variable and the corresponding predictor. In this case, no months is shown to have an effect on the number of oranges bought, implying seasonality is not significant.
Comment: Figure 3 plots the actual data versus the fitted data. If the predictions are close to the actual values,\(\text{R}^2\) is expected to be close to 1. The Pearson correlation test shows that the correlation between these variables is \(\text{R}^2\) = 0.6177605. In this case the model doesn’t seem like an appropriate choice for fitting the data of Family A as it only explains 32% of the variation in the data.
##
## Pearson's product-moment correlation
##
## data: training and fitted(fit.reg)
## t = 3.6847, df = 22, p-value = 0.001297
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2855148 0.8174473
## sample estimates:
## cor
## 0.6177605
Note: A plot of the residuals against the fitted values should show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant.
The residuals based on the residual plots are not showing any obvious patterns or trends, indicating constant variance.
The CV()
(short for cross-validation statistic) function computes the CV, AIC, AICc and BIC measures of predictive accuracy for a linear model. For these measures, the model fits the data better with the lowest value of CV, AIC, AICc and BIC; the model fits the data better with the highest value for Adjusted \(\text{R}^2\). Note: This is useful when studying the effect of each predictor, but is not particularly useful for forecasting.
CV(fit.reg)
## CV AIC AICc BIC AdjR2
## 750.6115702 149.4537085 196.1203751 165.9464621 -0.2929596
This function’s purpose is to select the best predictors to use in a regression model when there are multiple predictors. Here the result shows that the seasonal dummy variables are not required in the model.
These methods are going to use the full data (2013-2017) to fit the following methods, instead of using training data. This is so that we could see whether the prediction results would be better now that there is more data to fit the models or will the prediction results stay the same since the data is similar to white noise. The 2018 data is then used as a “test” set in order to find the error values of each model to check their performances.
The stlf()
function decomposes the time series using STL, forecast the seasonally adjusted data (data without seasonality) and return the ‘reseasonalized’ forecasts (forecasts that take the seasonality into account). If the method
argument is not specified, the function will use the ETS approach applied to the seasonally adjusted series.
fulldata <- window(ts.A, start = c(2013, 1), end = c(2017, 12))
STL <- stlf(fulldata, biasadj = TRUE, h = 12, level = c(30,50,70))
STL$model$aic
## [1] 566.1593
The nnetar()
function in the forecast package for R fits a Neural Network Model (NNAR) to a time series with lagged values of the time series as inputs (and possibly some other exogenous inputs). It is therefore a non-linear autogressive model, allowing complex non-linear relationships between the response variable and its predictors. The NNAR model for seasonal data has the form: \[NNAR(p,P,k)[m]\]
set.seed(2015)
nnetar.model <- nnetar(fulldata)
NNAR <- forecast(nnetar.model, h = 12, biasadj = TRUE, level = c(30,50,70))
nnetar.model
## Series: fulldata
## Model: NNAR(1,1,2)[12]
## Call: nnetar(y = fulldata)
##
## Average of 20 networks, each of which is
## a 2-2-1 network with 9 weights
## options were - linear output units
##
## sigma^2 estimated as 193.6
Since NNAR
models usually (and in this case) have no underlying statistical model, calculating an AIC/BIC does not make sense here. A possible solution to select the best model is to fit various models to 90% of the data and use these models to forecast the remaining 10%, i.e., use a holdout sample. Choose the model that performs best on the holdout sample (“best” will depend on the error measure(s)). Refit this model based on the entire sample.
Both the NNAR and TBATS models are mainly used for series exhibiting multiple complex seasonalities. TBATS is short for Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components:
The TBATS model can be fitted using the tbats()
command in the forecast
package for R. The forecast function when running the TBATS model only returns the AIC value hence in this section we are comparing models using AIC. However AIC is not valid for neither NNAR or Combination (the combination is not really model but merely an average of all the methods’forecasts) thus these methods are going to be compared using RMSE.
TBATS <- forecast(tbats(fulldata, biasadj = TRUE, use.box.cox = FALSE), h = 12, level = c(30,50,70))
TBATS$model$AIC
## [1] 583.1142
An easy way to improve forecast accuracy is to use several different methods on the same time series, and to average the resulting forecasts. The forecasts used in this example are from the following models: ETS, ARIMA, STL-ETS, NNAR, and TBATS.
arima.model <- auto.arima(fulldata, stepwise = FALSE, approximation = FALSE)
ets.model <- ets(fulldata)
ARIMA <- forecast(arima.model, h = 12, biasadj = TRUE, level = c(30,50,70))
ETS <- forecast(ets.model, h = 12, biasadj = TRUE, level = c(30,50,70))
Combination <- (ARIMA[["mean"]] + ETS[["mean"]] + STL[["mean"]] + NNAR[["mean"]] +
TBATS[["mean"]])/5
Comment: Though the Combination models performance is particularly well in this series, the ARIMA model’s performance is shown to have the smallest RMSE error value.
## ARIMA ETS STL-ETS NNAR TBATS Combination
## RMSE 13.73726 13.73734 16.87187 15.32699 13.74732 14.19774
year2018 | TrueData | ARIMA | ETS | Seasonal | STL | NNAR | TBATS | Combination |
---|---|---|---|---|---|---|---|---|
Jan | 10 | 25 | 25 | 27 | 33 | 28 | 26 | 27 |
Feb | 42 | 25 | 25 | 19 | 18 | 25 | 26 | 23 |
Mar | 8 | 25 | 25 | 15 | 21 | 35 | 26 | 26 |
Apr | 37 | 25 | 25 | 24 | 36 | 25 | 26 | 27 |
May | 31 | 25 | 25 | 0 | 12 | 35 | 26 | 24 |
June | 19 | 25 | 25 | 15 | 21 | 21 | 26 | 23 |
Jul | 16 | 25 | 25 | 7 | 16 | 25 | 26 | 23 |
Aug | 45 | 25 | 25 | 12 | 25 | 24 | 26 | 25 |
Sep | 14 | 25 | 25 | 8 | 28 | 25 | 26 | 26 |
Oct | 43 | 25 | 25 | 34 | 29 | 24 | 26 | 26 |
Nov | 33 | 25 | 25 | 15 | 24 | 36 | 26 | 27 |
Dec | 9 | 25 | 25 | 27 | 39 | 25 | 26 | 28 |
Comment: Linear Regression Model with seasonal dummy variables and STL are the only two models that have different prediction value for each month. However these results are still far away from the true data value. Overall the forecast of orange data for Family A is invalid.
https://www.stats.govt.nz/information-releases/regional-gross-domestic-product-year-ended-march-2018
https://robjhyndman.com/hyndsight/nnetar-prediction-intervals/
Page built on: 📆 2020-07-13 ‒ 🕢 10:02:30