Time Series Forecasting: KNN vs. ARIMA

It is always hard to find a proper model to forecast time series data. One of the reasons is that models that use time-series data often expose to serial correlation. In this article, we will compare k nearest neighbor (KNN) regression which is a supervised machine learning method, with a more classical and stochastic process, autoregressive integrated moving average (ARIMA).

We will use the monthly prices of refined gold futures(XAUTRY) for one gram in Turkish Lira traded on BIST(Istanbul Stock Exchange) for forecasting. We created the data frame starting from 2013. You can download the relevant excel file from here.

#building the time series data
library(readxl)

df_xautry <- read_excel("xau_try.xlsx")
xautry_ts <- ts(df_xautry$price,start = c(2013,1),frequency = 12)

KNN Regression

We are going to use tsfknn package which can be used to forecast time series in R programming language. KNN regression process consists of instance, features, and targets components. Below is an example to understand the components and the process.

library(tsfknn)

pred <- knn_forecasting(xautry_ts, h = 6, lags = 1:12,k=3)
autoplot(pred, highlight = "neighbors",faceting = TRUE)

The lags parameter indicates the lagged values of the time series data. The lagged values are used as features or explanatory variables. In this example, because our time series data is monthly, we set the parameters to 1:12. The last 12 observations of the data build the instance, which is shown by purple points on the graph.

This instance is used as a reference vector to find features that are the closest vectors to that instance. The relevant distance metric is calculated by the Euclidean formula as shown below:

$\sqrt {\displaystyle\sum_{x=1}^{n} (f_{x}^i-q_x)^2 }$

$q_x$ denotes the instance and $f_{x}^i$ indicates the features that are ranked in order by the distance metric. The k parameter determines the number of k closest features vectors which are called k nearest neighbors.

nearest_neighbors function shows the instance, k nearest neighbors, and the targets.

nearest_neighbors(pred)

#$instance
#Lag 12 Lag 11 Lag 10  Lag 9  Lag 8  Lag 7  Lag 6  Lag 5  Lag 4  Lag 3  Lag 2 
#272.79 277.55 272.91 291.12 306.76 322.53 345.28 382.02 384.06 389.36 448.28 
# Lag 1 
#462.59 

#$nneighbors
#  Lag 12 Lag 11 Lag 10  Lag 9  Lag 8  Lag 7  Lag 6  Lag 5  Lag 4  Lag 3  Lag 2
#1 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12 306.76
#2 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12
#3 223.97 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91
#   Lag 1     H1     H2     H3     H4     H5     H6
#1 322.53 345.28 382.02 384.06 389.36 448.28 462.59
#2 306.76 322.53 345.28 382.02 384.06 389.36 448.28
#3 291.12 306.76 322.53 345.28 382.02 384.06 389.36

Targets are the time-series data that come right after the nearest neighbors and their number is the value of the h parameter. The targets of the nearest neighbors are averaged to forecast the future h periods.

$\displaystyle\sum_{i=1}^k \frac {t^i} {k}$

As you can see from the above plotting, features or targets might overlap the instance. This is because the time series data has no seasonality and is in a specific uptrend. This process we mentioned so far is called MIMO(multiple-input-multiple-output) strategy that is a forecasting method used as a default with KNN.

Decomposing and analyzing the time series data

Before we mention the model, we first analyze the time series data on whether there is seasonality. The decomposition analysis is used to calculate the strength of seasonality which is described as shown below:

#Seasonality and trend measurements
library(fpp2)

fit <- stl(xautry_ts,s.window = "periodic",t.window = 13,robust = TRUE)
seasonality <- fit %>% seasonal()
trend <- fit %>% trendcycle()
remain <- fit %>% remainder()

#Trend
1-var(remain)/var(trend+remain)
#[1] 0.990609
#Seasonality
1-var(remain)/var(seasonality+remain)
#[1] 0.2624522

The stl function is a decomposing time series method. STL is short for seasonal and trend decomposition using loess, which loess is a method for estimating nonlinear relationships. The t.window(trend window) is the number of consecutive observations to be used for estimating the trend and should be odd numbers. The s.window(seasonal window) is the number of consecutive years to estimate each value in the seasonal component, and in this example, is set to ‘periodic‘ to be the same for all years. The robust parameter is set to ‘TRUE‘ which means that the outliers won’t affect the estimations of trend and seasonal components.

When we examine the results from the above code chunk, it is seen that there is a strong uptrend with 0.99 and a seasonality strength of 0.26; because of any value less than 0.4 is accepted as a very weak seasonal effect, we will prefer the non-seasonal ARIMA model.

Non-seasonal ARIMA

This model consists of differencing with autoregression and moving average. Let’s explain each part of the model.

Differencing: First of all, we have to explain stationary data. If data doesn’t contain information pattern like trend or seasonality in other words is white noise that data is stationary. White noise time series has no autocorrelation at all.

Differencing is a simple arithmetic operation that extracts the difference between two consecutive observations to make that data stationary.

$y'_t=y_t-y_{t-1}$

The above equation shows the first differences that difference at lag 1. Sometimes, the first difference is not enough to obtain stationary data, hence, we might have to do differencing of the time series data one more time(second-order differencing).

In autoregressive models, our target variable is a linear combination of its own lagged variables. This means the explanatory variables of the target variable are past values of that target variable. The AR(p) notation denotes the autoregressive model of order p and the $\boldsymbol\epsilon_t$ denotes the white noise.

$y_t=c + \phi_1 y_{t-1}+ \phi_2y_{t-2}+...+\phi_py_{t-p}+\epsilon_t$

Moving average models, unlike autoregressive models, they use past error(white noise) values for predictor variables. The MA(q) notation denotes the autoregressive model of order q.

$y_t=c + \theta_1 \epsilon_{t-1}+ \theta_2\epsilon_{t-2}+...+\theta_q\epsilon_{t-q}$

If we integrate differencing with autoregression and the moving average model, we obtain a non-seasonal ARIMA model which is short for the autoregressive integrated moving average.

$y_t'$ is the differenced data and we must remember it may have been first and second order. The explanatory variables are both lagged values of $y_t$ and past forecast errors. This is denoted as ARIMA(p,d,q) where p; the order of the autoregressive; d, degree of first differencing; q, the order of the moving average.

Modeling with non-seasonal ARIMA

Before we model the data, first we split the data as train and test to calculate accuracy for the ARIMA model.

#Splitting time series into training and test data
test <- window(xautry_ts, start=c(2019,3))
train <- window(xautry_ts, end=c(2019,2))

#ARIMA modeling
library(fpp2)

fit_arima<- auto.arima(train, seasonal=FALSE,
                       stepwise=FALSE, approximation=FALSE)

fit_arima

#Series: train 
#ARIMA(0,1,2) with drift 

#Coefficients:
#          ma1      ma2   drift
#      -0.1539  -0.2407  1.8378
#s.e.   0.1129   0.1063  0.6554

#sigma^2 estimated as 86.5:  log likelihood=-264.93
#AIC=537.85   AICc=538.44   BIC=547.01

As seen above code chunk, stepwise=FALSE, approximation=FALSE parameters are used to amplify the searching for all possible model options. The drift component indicates the constant c which is the average change in the historical data. From the results above, we can see that there is no autoregressive part of the model, but a second-order moving average with the first differencing.

Modeling with KNN

#Modeling and forecasting
library(tsfknn)

pred <- knn_forecasting(xautry_ts, h = 18, lags = 1:12,k=3)

#Forecasting plotting for KNN
autoplot(pred, highlight = "neighbors", faceting = TRUE)

Forecasting and accuracy comparison between the models

#ARIMA accuracy
f_arima<- fit_arima %>% forecast(h =18) %>%
  accuracy(test)

f_arima[,c("RMSE","MAE","MAPE")]

#                 RMSE       MAE      MAPE      
#Training set  9.045488  5.529203  4.283023 
#Test set     94.788638 74.322505 20.878096

For forecasting accuracy, we take the results of the test set shown above. The shaded color indicates the prediction interval in the below plot; the lighter one shows 95%, and the darker inside part shows the 80% confidence interval. The inside line in the colored area indicates the point forecasts.

#Forecasting plot for ARIMA
fit_arima %>% forecast(h=18) %>% autoplot()+ autolayer(test)

#KNN Accuracy
ro <- rolling_origin(pred, h = 18,rolling = FALSE)
ro$global_accu

#  RMSE       MAE      MAPE 
#137.12465 129.77352  40.22795

The rolling_origin function is used to evaluate the accuracy based on rolling origin. The rolling parameter should be set to FALSE which makes the last 18 observations as the test set and the remaining as the training set; just like we did for ARIMA modeling before. The test set would not be a constant vector if we had set the rolling parameter to its default value of TRUE. Below, there is an example for h=6 that rolling_origin parameter set to TRUE. You can see the test set dynamically changed from 6 to 1 and they eventually build as a matrix, not a constant vector.

#Accuracy plot for KNN
plot(ro)

Conclusion

When we compare the results of the accuracy measurements like RMSE or MAPE, we can easily see that the ARIMA model is much better than the KNN model for our non-seasonal time series data.

References

Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos
Time Series Forecasting with KNN in R: the tsfknn Package, Francisco Martínez, María P. Frías, Francisco Charte, and Antonio J. Rivera
Autoregression as a means of assessing the strength of seasonality in a time series: Rahim Moineddin, Ross EG Upshur, Eric Crighton & Muhammad Mamdani

12 responses to “Time Series Forecasting: KNN vs. ARIMA”

Time Series Forecasting: KNN vs. ARIMA – Data Science Austria

September 30, 2020

[…] by data_admin [This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

LikeLike

Time Series Forecasting: KNN vs. ARIMA | Authors Digest Daily

September 30, 2020

[…] article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

LikeLike

Time Series Forecasting: KNN vs. ARIMA | Best New Authors Magazine

September 30, 2020

[…] article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

LikeLike

Time Series Forecasting in R – Curated SQL

October 2, 2020

[…] Selcuk Disci contrasts a couple of methods for time series forecasting: […]

LikeLike

Approaches to Time Series Data with Weak Seasonality – DataGeeek

October 23, 2020

[…] In the previous article, we have tried to model the gold price in Turkey per gram. We will continue to do that to find the best fit for our data. When we chose the KNN and Arima model, we saw the traditional Arima model was much better than the KNN, which is a machine learning algorithm. This time we will try the regression model as a machine learning model and also try to improve our Arima model with some mathematical operations. […]

LikeLike

Approaches to Time Series Data with Weak Seasonality – Data Science Austria

October 23, 2020

[…] the previous article, we have calculated AICC values for non- seasonal ARIMA. The reason we choose the non-seasonal process was that our data had a very […]

LikeLike

Approaches to Time Series Data with Weak Seasonality

October 25, 2020

[…] to model the gold price in Turkey per gram. We will continue to do that to find the best fit for our data. When we chose the KNN and Arima model, we saw the traditional Arima model was much better than the […]

LikeLike

Approaches to Time Series Data with Weak Seasonality – Data Science Central

October 26, 2020

[…] In the previous article, we have tried to model the gold price in Turkey per gram. We will continue to do that to find the best fit for our data. When we chose the KNN and Arima model, we saw the traditional Arima model was much better than the KNN, which is a machine learning algorithm. This time we will try the regression model as a machine learning model and also try to improve our Arima model with some mathematical operations. […]

LikeLike

Time Series Forecasting: KNN vs. ARIMA — – Fayadh Notes

December 10, 2020

[…] Time Series Forecasting: KNN vs. ARIMA — […]

LikeLike

Bootstrapping Time Series for Gold Rush

December 14, 2020

[…] we remembered from the previous two articles, we have tried to model gold prices per gram in Turkey. We have determined the ARIMA model the best for forecasting. This time, we will try to improve […]

LikeLike

Bootstrapping Time Series for Gold Rush

December 20, 2020

[…] we remembered from the previous two articles, we have tried to model gold prices per gram in Turkey. We have determined the ARIMA model the best for forecasting. This time, we will try to improve […]

LikeLike

Bootstrapping Time Series for Gold Rush – Data Science Central

December 20, 2020

[…] we remembered from the previous two articles, we have tried to model gold prices per gram in Turkey. We have determined the ARIMA model the best for forecasting. This time, we will try to improve […]

LikeLike