In order to check that we have to model inflation rates with some variables. The most common view of the economic authorities is that the variables affecting the rates are currency exchange rates, and CDS(credit default swap). Of course, we will also add the funding rates variable, the president mentioned, to the model to compare with the other explanatory variables.

Because the variables can be highly correlated with each other, we will prefer the random forest model. This algorithm also has a built-in function to compute the feature importance.

**Random Forest**; for **regression**, constructs multiple decision trees and, inferring the average estimation result of each decision tree. This algorithm is more robust to overfitting than the classical decision trees.

The random forest algorithms average these results; that is, it reduces the variation by training the different parts of the train set. This increases the performance of the final model, although this situation creates a small increase in bias.

The random forest uses bootstrap aggregating(**bagging**) algortihms. We would take for training sample, ** X = x_{1}, …, x_{n}** and,

The unseen values, * x’*, would be fitted by

The **permutation feature importance** method would be used to determine the effects of the variables in the random forest model. This method calculates the increase in the prediction error(**MSE**) after permuting the feature values. If the permuting wouldn’t change the model error, the related feature is considered unimportant. We will use the ** varImp** function to calculate variable importance. It has a default parameter,

The data we are going to use can be download here. The variables we will examine:

: The annual consumer price index. It is also called the inflation rate. This is our target variable.*cpi*: The one-week repo rate, which determined by the Turkish Central Bank. It is also called the political rate.*funding_rate*: The currency exchange rates between Turkish Liras and American dollars.*exchange_rate*: The credit defaults swap. It kind of measures the investment risk of a country or company. It is mostly affected by foreign policy developments in Turkey. Although the most popular CDS is 5year, I will take CDS of 1 year USD in this article.*CDS*

```
library(readxl)
library(dplyr)
library(ggplot2)
library(randomForest)
library(varImp)
#Create the dataframe
df <- read_excel("cpi.xlsx")
#Random Forest Modelling
model <- randomForest(CPI ~ funding_rate+exchange_rate,
data = df, importance=TRUE)
#Conditional=True, adjusts for correlations between predictors.
i_scores <- varImp(model, conditional=TRUE)
#Gathering rownames in 'var' and converting it to the factor
#to provide 'fill' parameter for the bar chart.
i_scores <- i_scores %>% tibble::rownames_to_column("var")
i_scores$var<- i_scores$var %>% as.factor()
#Plotting the bar and polar charts for comparing variables
i_bar <- ggplot(data = i_scores) +
geom_bar(
stat = "identity",#it leaves the data without count and bin
mapping = aes(x = var, y=Overall, fill = var),
show.legend = FALSE,
width = 1
) +
labs(x = NULL, y = NULL)
i_bar + coord_polar() + theme_minimal()
i_bar + coord_flip() + theme_minimal()
```

When we examine the charts above, we can clearly see that the funding and exchange rates have similar effects on the model and, CDS has significant importance in the behavior of the CPI. So the theory of the president seems to fall short of what he claims.

**References**

**Neural Network Structure**

The neural network consists of three layers. The predictors form the bottom layer, the predicted output form the top layer and, the ones of the middle between them form the intermediate layers called the hidden neurons.

As we can see from the above graph, each layer of intermediate nodes takes the input from the previous layer. Each node and corresponding weight coefficients are combined in linear regression. Because of the hidden neurons(j), the output is recalculated by a nonlinear function. This is called the **multilayer feed-forward network**.

As mentioned before, the output is recalculated by the sigmoid function, which is nonlinear.

The modified output is given to the next layer as an input. This process is useful for the network to be robust to outliers. The weights () are randomly chosen and then learn from the observations by minimizing the cost function (e.g. MSE).

In time series regression, lagged values of data are used as input by a neural network, which is called the **neural network autoregression(NNAR)**. We will use feed-forward with one hidden layer, which is denoted NNAR(p, k). **p **indicates lagged values, and **k **denotes the nodes in the hidden layer.

For seasonal time series, it is better to include that the last observation from the same term from the previous year. In this case, it is denoted as indicates that:

- for the lagged inputs(predictors).
- for the seasonal terms.

For non-seasonal time series data, the default **p **value is determined by the optimal number of lagged inputs according to the Akaike’s information criterion.

**T**is the number of observations.**SSE**: sum of the squared of errors.**k**is the number of predictors.

If k is not selected at first, it is determined as . The network repeats 20 times as default and is trained for starting points which are different random weights. All the results are averaged for ultimate output. The process uses lagged values as inputs for a one-step forecast. This forecast along with the lagged values is used to predict a two-step forecast. This process continues until it meets the required forecasts.

We will prefer the non-seasonal approach because the data we will be using is less than the two periodic cycles required for detecting seasonality. We are going to predict daily cases in the UK for the next 30 days. The data file(**covid_uk.csv**) can be downloaded from here.

```
library(tidyverse)
library(forecast)
library(purrr)
library(lubridate)
#Building the dataframe
df <- read_csv("covid_uk.csv",
col_types = list(col_character(),
col_date(format = "%d/%m/%Y"),
col_double()))
#Converting to the time series data
inds <- seq(as.Date("2020-01-31"), as.Date("2021-05-21"), by = "day")
df_ts <- df %>%
.$new_cases %>%
ts(start = c(2020, as.numeric(format(inds[1], "%j"))),frequency = 365)
```

Before modeling the data, we will use bootstrap aggregating to prevent overfitting and improve accuracy. This process has been detailed in one of the previous articles. We will make a function to predict, and show the results in a plot, but before doing that, we create a date transform variable to be used for readable dates.

```
#The function to convert decimal date to the proper format
date_transform <- function(x) {format(date_decimal(x), "%b %Y")}
#Forecasting with Simulated Neural Network and plotting the results
sim_nn <- function(data, freq, h, n=100){
#For reproducible results
set.seed(123)
#Bootstraping the time series
sim <- bld.mbb.bootstrap(data, n)
#Bagged Neural Network
future <- sim %>%
map(function(x){simulate(nnetar(x, P=0),nsim=h)}) %>%
unlist() %>%
matrix(ncol = h, nrow = n, byrow = TRUE) %>%
pmax(0)#preventing the negative values
#The beginning date of the prediction
start <- tsp(data)[2]+1/freq
#Forecast object with prediction intervals
#at %5 significance level
sim_fc <- structure(list(
mean = ts(colMeans(future), start=start, frequency=freq),
lower = future %>% as.data.frame() %>%
map_dbl(quantile, prob = 0.025) %>%
ts(start = start,frequency = freq),
upper = future %>% as.data.frame() %>%
map_dbl(quantile, prob = 0.975) %>%
ts(start = start,frequency = freq),
level=95),
class="forecast")
#Making predictions global variable for easy access
assign("simfc",simfc,envir = .GlobalEnv)
#Plotting the prediction results with the observations
sim_plot <- autoplot(data)+
autolayer(simfc)+
scale_x_continuous(labels = date_transform,breaks = seq(2020,2022,0.2))+
ylab("")+xlab("")+
ggtitle("Daily Cases of United Kingdom")+
theme_light()+
theme(plot.title = element_text(hjust = 0.5))
#Converting time series with decimal date
#to the dataframe for proper demonstration
start <- start %>% as.numeric() %>% date_decimal() %>% as.Date()
h_date <- seq(start, start+29, by = "day")
simfc_df <- simfc %>% data.frame()
rownames(simfc_df)<- h_date
#The output list
list(simfc_df,sim_plot)
}
```

```
sim_nn(df_ts,365,30)
```

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* - Our World in Data

In order to that, we will compare the performance of some developed countries; these are France, Germany, the United Kingdom, and the United States. The dataset for this article can be found here.

In the dataset, we have the number of new cases and new deaths per day, and the population number of related countries. In the below code block, we are replacing blank and NA data with zeros for new cases and new deaths variables. We will limit the data to only this year.

```
library(tidyverse)
library(lubridate)
library(readxl)
library(plotly)
library(scales)
#Preparing the dataset
data <- read_excel("covid-19_comparing.xlsx")
data %>%
mutate(
date=as.Date(.$date),
new_deaths=if_else(is.na(.$new_deaths), 0, .$new_deaths),
new_cases=if_else(is.na(.$new_cases), 0 , .$new_cases)
) %>%
filter(.$date >= "2021-01-01", .$date <= "2021-04-26" ) -> df
```

First, we will compare the number of cases per day(new_cases) but we have to scale the related variable because each country has a different population. The scaling metric we are going to use is the normalized function. I preferred the ** ggplotly** function for ease of analysis on the graph because we can click on each legend to extract the related countries from the plot, and re-click to put it on back.

```
#Normalizing function
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
#Draw normalized results for comparising
df %>%
group_by(location) %>%
mutate(new_cases=normalize(new_cases)) %>%
ggplot(aes(date,new_cases,color=location))+
geom_line(size=1, alpha = 0.7) +
scale_y_continuous(limits = c(0, 1))+
guides(color=guide_legend(title="Countries"))+
labs(x="", y="New Cases")+
theme_minimal() -> cases_norm
ggplotly(cases_norm)
```

Since the beginning of the year, the spread of the disease has decreased dramatically in the UK and the USA, but the opposite has happened in France. As for Germany, it declined until March, but then their numbers rose. But in April, all countries declined radically.

To confirm the above trends for all countries, this time we will draw the number of daily cases per population on the graph. It looks like the same trends for the countries when we examine the below plot.

```
#New cases per population
df %>%
group_by(location) %>%
mutate(casesPerPop=new_cases/population) %>%
ggplot(aes(date,casesPerPop,color=location))+
geom_line(size=1, alpha = 0.7) +
scale_y_continuous(limits = c(0, 0.002),labels = percent_format())+
guides(color=guide_legend(title="Countries"))+
labs(x="",y="",title = "New Cases per Population")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5)) -> cases_perpop
ggplotly(cases_perpop)
```

I also want to know that the death rate, in order to shed on light the health system for the related countries. To do that we will plot the death toll per population and per case amount.

```
#Deaths per Population
df %>%
group_by(location) %>%
mutate(deathsPerPop=sum(new_deaths)/population) %>%
ggplot(aes(location,deathsPerPop,color=location))+
geom_bar(stat = "identity",position = "identity",fill=NA)+
scale_y_continuous(limits = c(0, NA),labels = percent_format())+
labs(x="",y="",title = "Deaths per Population")+
theme_minimal()+
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
```

When we examine the below graph, despite the UK is better for the spread of the virus, it looks worst for the death rate per population.

```
#Deaths per Case
df %>%
group_by(location) %>%
mutate(deathsPerCase=sum(new_deaths)/sum(new_cases)) %>%
ggplot(aes(location,deathsPerCase,color=location))+
geom_bar(stat = "identity",position = "identity",fill=NA)+
scale_y_continuous(limits = c(0, NA),labels = percent_format())+
labs(x="",y="",title = "Deaths per Case")+
theme_minimal()+
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
```

This time Germany performs worse than others. It should be noted that despite France less performed for the daily cases, it seems to have best results for both death rates.

While the effect of the pandemic decreases in developed countries, we had to be a full closure in Turkey by the government, unfortunately. As much as the virus is dangerous, bad management is also dangerous.

**References**

*The data source*: Our World in Data- R for Data Science:
*Hadley Wickham*& Garrett Grolemund - Predicting injuries for Chicago traffic crashes:
*Julia Silge*

Time series modeling, most of the time, uses past observations as predictor variables. But sometimes, we need external variables that affect the target variables. To include those variables, we have to use regression models. However, we are going to use **dynamic regression** to capture elaborated patterns; the difference from the orthodox regression models is that residuals are not white noise and are modeled by** ARIMA**.

The residuals we mentioned above, have autocorrelation, which means contain information. To indicate that, we will show as . In this way, the residuals term can be modeled by ARIMA. For instance, dynamic regression with ARIMA(1,1,1) as described:

denotes the white noise and **B**, the backshift notation. As we can see above equation, There two error terms: the one from regression model, , and the other from ARIMA model, .

In the previous article, we have created the dataset variable,* df_xautry*. We will transform it into the multivariate time series and split it as a test and training set. Finally, we will model the training data.

```
library(dplyr)
library(forecast)
#Building the multivariate time series
df <- df_xautry[-1]
df_mts <- df %>% ts(start=c(2013,1),frequency=12)
#Split the dataset
train <- df_mts %>% window(end=c(2020,12))
test <- df_mts %>% window(start=2021)
#Modeling the training data
fit_dynamic <- auto.arima(train[,"xau_try_gram"], xreg =train[,c(1,2)])
#Series: train[, "xau_try_gram"]
#Regression with ARIMA(1,0,2) errors
#Coefficients:
# ar1 ma1 ma2 intercept xe xau_usd_ounce
# 0.9598 -0.0481 0.4003 -150.8309 43.8402 0.1195
#s.e. 0.0390 0.0992 0.1091 23.7781 2.1198 0.0092
#sigma^2 estimated as 27.15: log likelihood=-293.31
#AIC=600.62 AICc=601.89 BIC=618.57
```

Based on the above results, we have ARIMA(1,0,2) model as described below:

Now, we will do forecasting and then calculate accuracy. The accuracy for xgboost will be calculated from the *forecast_xautrygram* variable.

```
#Forecasting
fcast_dynamic <- forecast(fit_dynamic, xreg = test[,1:2])
#Accuracy
acc_dynamic <- fcast_dynamic %>%
accuracy(test[,3]) %>%
.[,c("RMSE","MAPE")]
acc_xgboost <- forecast_xautrygram %>%
accuracy(test[,3]) %>%
.[,c("RMSE","MAPE")]
```

In order to visualize the accuracy results, we’re going to build the data frame and prepare it for a suitable bar chart.

```
#Tidying the dataframe
df_comparison <- data.frame(
"dynamic"=acc_dynamic,
"xgboost"=acc_xgboost
)
df_comparison
# dynamic.RMSE dynamic.MAPE xgboost.RMSE xgboost.MAPE
#Training set 5.044961 2.251683 0.001594868 0.000805107
#Test set 10.695489 2.501123 11.038134819 2.060825426
library(tidyr)
library(tibble)
df_comparison %>%
rownames_to_column(var = "data") %>%
gather(`dynamic.RMSE`, `dynamic.MAPE`,`xgboost.RMSE`, `xgboost.MAPE`,
key = "models", value="score") -> acc_comparison
acc_comparison
# data models score
#1 Training set dynamic.RMSE 5.044960948
#2 Test set dynamic.RMSE 10.695489161
#3 Training set dynamic.MAPE 2.251682989
#4 Test set dynamic.MAPE 2.501122965
#5 Training set xgboost.RMSE 0.001594868
#6 Test set xgboost.RMSE 11.038134819
#7 Training set xgboost.MAPE 0.000805107
#8 Test set xgboost.MAPE 2.060825426
```

```
#Plotting comparing models
ggplot(acc_comparison,aes(x=data,y=score,fill = models)) +
geom_bar(stat = "identity",position = "dodge") +
theme_bw()
```

**Conclusion**

When we examined the above results and the bar chart for unseen data, we are seeing some interesting results. For the training set, the xgboost model has near-zero accuracy rates which can lead to overfitting. The dynamic model looks slightly better for the RMSE but vice versa for the MAPE criteria.

This shed light that while the first-month dynamic model is better, for the second month the xgboost looks closer to the actual observation. Of course, the reason for that is maybe we have few test data but I wanted to predict the first two months of the current year. Maybe next time, if we have more test data, we can try again.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos*

In recent years, XGBoost is an uptrend machine learning algorithm in time series modeling. XGBoost (**Extreme Gradient Boosting**) is a supervised learning algorithm based on boosting tree models. This kind of algorithms can explain how relationships between features and target variables which is what we have intended. We will try this method for our time series data but first, explain the mathematical background of the related tree model.

- K represents the number of tree
- represents the basic tree model.

We need a function that trains the model by measuring how well it fits the training data. This is called** the objective function**.

- represents the loss function which is the error between the predicted values and observed values.
- is
**the regularization function**to prevent overfitting.

- T represents the leaves, the number of
**leaf nodes**of each tree. Leaf node or terminal node means that has no child nodes. - represents the score (weight) of the leaves of each tree, so it is calculated in euclidean norm.
- represents the learning rate which is also called the shrinkage parameter. With shrinking the weights, the model is more robust against the closeness to the observed values. This prevents overfitting. It is between 0 and 1. The lower values mean that the more trees, the better performance, and the longer training time.
- represents the splitting threshold. The parameter is used to prevent the growth of a tree so the model is less complex and more robust against overfitting. The leaf node would split if the information gain less than . Its range is at

Now, we can start to examine our case we mention at the beginning of the article. In order to do that, we are downloading the dataset we are going to use, from here.

```
#Building data frame
library(readxl)
df_xautry <- read_excel("datasource/xautry_reg.xlsx")
df_xautry$date <- as.Date(df_xautry$date)
#Splitting train and test data set
train <- df_xautry[df_xautry$date < "2021-01-01",]
test <- df_xautry[-(1:nrow(train)),]
```

We will transform the train and test dataset to the DMatrix object to use in the xgboost process. And we will get the target values of the train set in a different variable to use in training the model.

```
#Transform train and test data to DMatrix form
library(dplyr)
library(xgboost)
train_Dmatrix <- train %>%
dplyr::select(xe, xau_usd_ounce) %>%
as.matrix() %>%
xgb.DMatrix()
pred_Dmatrix <- test %>%
dplyr::select(xe, xau_usd_ounce) %>%
as.matrix() %>%
xgb.DMatrix()
targets <- train$xau_try_gram
```

We will execute the cross-validation to prevent overfitting, and set the parallel computing parameters enable because the xgboost algorithm needs it. We will adjust all the parameter we’ve just mentioned above with * trainControl* function in caret package.

We also will make a list of parameters to train the model. Some of them are:

: A maximum number of iterations. It was shown by**nrounds**at the tree model equation. We will set a vector of values. It executes the values separately to find the optimal result. Too large values can lead to overfitting however, too small values can also lead to underfitting.**t**: The maximum number of trees. The greater the value of depth, the more complex and robust the model is but also the more likely it would be overfitting.*max_depth*As we mentioned before with**min_child_weight:**in the objective function, it determines the minimum sum of weights of leaf nodes to prevent overfitting.*w*: It is by subsetting the train data before the boosting tree process, it prevents overfitting. It is executed once at every iteration.*subsample*

```
#Cross-validation
library(caret)
xgb_trcontrol <- trainControl(
method = "cv",
number = 10,
allowParallel = TRUE,
verboseIter = FALSE,
returnData = FALSE
)
#Building parameters set
xgb_grid <- base::expand.grid(
list(
nrounds = seq(100,200),
max_depth = c(6,15,20),
colsample_bytree = 1,
eta = 0.5,
gamma = 0,
min_child_weight = 1,
subsample = 1)
)
```

Now that all the parameters and needful variables are set, we can build our model.

```
#Building the model
model_xgb <- caret::train(
train_Dmatrix,targets,
trControl = xgb_trcontrol,
tuneGrid = xgb_grid,
method = "xgbTree",
nthread = 10
)
```

We can also see the best optimal parameters.

```
model_xgb$bestTune
# nrounds max_depth eta gamma colsample_bytree min_child_weight #subsample
#1 100 6 0.5 0 1 1 1
```

To do some visualization in the ** forecast** function, we have to transform the predicted results into the

```
#Making the variables used in forecast object
fitted <- model_xgb %>%
stats::predict(train_Dmatrix) %>%
stats::ts(start = c(2013,1),frequency = 12)
ts_xautrygram <- ts(targets,start=c(2013,1),frequency=12)
forecast_xgb <- model_xgb %>% stats::predict(pred_Dmatrix)
forecast_ts <- ts(forecast_xgb,start=c(2021,1),frequency=12)
#Preparing forecast object
forecast_xautrygram <- list(
model = model_xgb$modelInfo,
method = model_xgb$method,
mean = forecast_ts,
x = ts_xautrygram,
fitted = fitted,
residuals = as.numeric(ts_xautrygram) - as.numeric(fitted)
)
class(forecast_xautrygram) <- "forecast"
```

We will show train, unseen, and predicted values for comparison in the same graph.

```
#The function to convert decimal time label to wanted format
library(lubridate)
date_transform <- function(x) {format(date_decimal(x), "%Y")}
#Making a time series varibale for observed data
observed_values <- ts(test$xau_try_gram,start=c(2021,1),frequency=12)
#Plot forecasting
library(ggplot2)
library(forecast)
autoplot(forecast_xautrygram)+
autolayer(forecast_xautrygram$mean,series="Predicted",size=0.75) +
autolayer(forecast_xautrygram$x,series ="Train",size=0.75 ) +
autolayer(observed_values,series = "Observed",size=0.75) +
scale_x_continuous(labels =date_transform,breaks = seq(2013,2021,2) ) +
guides(colour=guide_legend(title = "Time Series")) +
ylab("Price") + xlab("Time") +
ggtitle("") +
theme_bw()
```

To satisfy that curiosity we mentioned at the very beginning of the article, we will find the ratio that affects the target variable of each explanatory variable separately.

```
#Feature importance
library(Ckmeans.1d.dp)
xgb_imp <- xgb.importance(
feature_names = colnames(train_Dmatrix),
model = model_xgb$finalModel)
xgb.ggplot.importance(xgb_imp,n_clusters = c(2))+
ggtitle("") +
theme_bw()+
theme(legend.position="none")
xgb_imp$Importance
#[1] 0.92995147 0.07004853
```

**Conclusion**

When we examine the above results and plot, contrary to popular belief, it is seen that the exchange rate has a more dominant effect than the price of ounce gold. In the next article, we will compare this method with the dynamic regression ARIMA model.

**References**

The reason for this statement was the pressure of the Istanbul Metropolitan Mayor. He has said that according to data released by the cemetery administration, a municipal agency, the daily number of infected deaths were nearly that two times the daily number of the death tolls explained by the ministry.

So, I decided to check the mayor’s claims. To do that, I have to do some predictions; but, not for the future, for the past. Fortunately, there is a method for this that is called **Backcasting**. Let’s take a vector of time series and estimate .

- One-step estimation for
**backcasting**: with .

- One- step estimation for
**forecasting**, with

As you can see above, the backcasting coefficients are the same as the forecasting coefficients(). For instance, in this case, the model for new cases is **ARIMA(0, 1, 2) with drift**:

- For
**forecasting**: - For
**backcasting**:

```
#Function to reverse the time series
reverse_ts <- function(y)
{
y %>%
rev() %>%
ts(start=tsp(y)[1L], frequency=frequency(y))
}
#Function to reverse the forecast
reverse_forecast <- function(object)
{
h <- object[["mean"]] %>% length()
f <- object[["mean"]] %>% frequency()
object[["x"]] <- object[["x"]] %>% reverse_ts()
object[["mean"]] <- object[["mean"]] %>% rev() %>%
ts(end=tsp(object[["x"]])[1L]-1/f, frequency=f)
object[["lower"]] <- object[["lower"]][h:1L,]
object[["upper"]] <- object[["upper"]][h:1L,]
return(object)
}
```

We would first reverse the time series and then make predictions and again reverse the forecast results. The data that we are going to model is the number of daily new cases and daily new deaths, between the day the health minister’s explanation was held and the day the vaccine process in Turkey has begun. We will try to predict the ten days before the date 26-11-2020.

```
#Creating datasets
df <- read_excel("datasource/covid-19_dataset.xlsx")
df$date <- as.Date(df$date)
#The data after the date 25-11-2020:Train set
df_after<- df[df$date > "2020-11-25",]
#The data between 15-11-2020 and 26-11-2020:Test set
df_before <- df[ df$date > "2020-11-15" & df$date < "2020-11-26",]
#Creating dataframes for daily cases and deaths
df_cases <- bc_cases %>% data.frame()
df_deaths <- bc_deaths %>% data.frame()
#Converting the numeric row names to date object
options(digits = 9)
date <- df_cases %>%
rownames() %>%
as.numeric() %>%
date_decimal() %>%
as.Date()
#Adding date object created above to the data frames
df_cases <- date %>% cbind(df_cases) %>% as.data.frame()
colnames(df_cases)[1] <- "date"
df_deaths <- date %>% cbind(df_deaths) %>% as.data.frame()
colnames(df_deaths)[1] <- "date"
#Convert date to numeric to use in ts function
n <- as.numeric(as.Date("2020-11-26")-as.Date("2020-01-01")) + 1
#Creating time series variables
ts_cases <- df_after$new_cases %>%
ts(start = c(2020, n),frequency = 365 )
ts_deaths <- df_after$new_deaths %>%
ts(start = c(2020, n),frequency = 365 )
#Backcast variables
ts_cases %>%
reverse_ts() %>%
auto.arima() %>%
forecast(h=10) %>%
reverse_forecast() -> bc_cases
ts_deaths %>%
reverse_ts() %>%
auto.arima() %>%
forecast(h=10) %>%
reverse_forecast() -> bc_deaths
```

It might be very useful to make a function to plot the comparison for backcast values and observed data.

```
#Plot function for comparison
plot_fun <- function(data,column){
ggplot(data = data,aes(x=date,y=Point.Forecast))+
geom_line(aes(color="blue"))+
geom_line(data = df_before,aes(x=date,y=.data[[column]],color="red"))+
geom_line(data = df_after,aes(x=date,y=.data[[column]],color="black"))+
geom_ribbon(aes(ymin=Lo.95, ymax=Hi.95), linetype=2,alpha=0.1,fill="blue")+
geom_ribbon(aes(ymin=Lo.80, ymax=Hi.80), linetype=2, alpha=0.1,fill="blue")+
scale_color_identity(name = "Lines",
breaks = c("black", "red", "blue"),
labels = c("After", "Real", "Backcast"),
guide = "legend")+
ylab(str_replace(column,"_"," "))+
theme_light()
}
```

```
plot_fun(df_cases, "new_cases")
```

```
plot_fun(df_deaths, "new_deaths")
```

**Conclusion**

When we examine the graph, the difference in death toll seems relatively close. However, the levels of daily cases are significantly different from each other. Although this estimate only covers ten days, it suggests that there is inconsistency in the numbers given.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* *Joan Bruna*, Stat 153: Introduction to Time Series- Our World in Data

Modeling time series data is difficult because the data are autocorrelated. In this case, **moving block bootstrap (MBB)** should be preferred because MBB resamples the data inside overlapping blocks to imitate the autocorrelation in the data. If the length of a time series, **n**, and the block size **l**, the number of overlapping blocks are found as below:

What we mean by overlapping block is that observation 1 to l would be block 1, observation 2 to l+1 would be block 2, etc. We should use a block size for at least two years(**l=24**) for monthly data because we have to be certain whether there is any remaining seasonality in the block.

From these **n-l+1** blocks, **n/l** blocks will be selected randomly and they will be gathered in order, to build the bootstrap observations. The time series values can be repetitive in different blocks.

This bootstrap process would be exercised to the remainder component after the time series decomposition. If there is seasonality it is used the stl function(trend, seasonal, remainder) otherwise the loess function(trend, remainder) is chosen for the decomposition. It should not be forgotten that the data has to be stationary in the first place.

Box-Cox transformation is made at the beginning but back-transformed at the end of the process; as we mentioned before, when we do average all the bootstrapped series, which is called **bagging**, we could handle the non-stability data problem and improve accuracy compared to the original series.

As we remembered from the previous two articles, we have tried to model gold prices per gram in Turkey. We have determined the ARIMA model the best for forecasting. This time, we will try to improve using the bagging mentioned above.

In order to that, we will create a function that makes bootstrapping simulations and builds the prediction intervals we want. We will adjust the simulation number, model, and confidence level as default. We will use the assign function to make the **bagged **data(**simfc**) as a global variable, so we will able to access it outside the function as well.

```
#Simulation function
library(purrr)
library(forecast)
sim_forecast <- function(data, nsim=100L, h, mdl=auto.arima, level=95){
sim <- bld.mbb.bootstrap(data, nsim)
h <- as.integer(h)
future <- matrix(0, nrow=nsim, ncol=h)
future <- sim %>% map(function(x){simulate(mdl(x),nsim=h)}) %>%
unlist() %>% matrix(ncol = h, nrow = nsim, byrow = TRUE)
start < - tsp(data)[2]+1/12
simfc <- structure(list(
mean = future %>% colMeans() %>% ts(start = start, frequency = 12),
lower = future %>% as.data.frame() %>%
map_dbl(quantile, prob = (1-level/100)/2) %>%
ts(start = start,frequency = 12),
upper = future %>% as.data.frame() %>%
map_dbl(quantile, prob = (1-level/100)/2+level/100) %>%
ts(start = start,frequency = 12),
level=level),
class="forecast")
assign("simfc",simfc,envir = .GlobalEnv)
simfc
}
```

Because of the averaging part of the bagging, we don’t use the lambda parameter of Box-Cox transformation for the stability of the variance. We can see the forecasting results for 18 months in %95 confidence interval for training set below. We can also change the model type or confidence level if we want.

```
sim_forecast(train, h=18)
Point Forecast Lo 95 Hi 95
#Mar 2019 242.3121 215.5464 268.2730
#Apr 2019 243.4456 206.4015 274.5155
#May 2019 249.9275 216.8712 283.4226
#Jun 2019 252.8518 219.7168 283.0535
#Jul 2019 259.0699 216.7776 302.4991
#Aug 2019 267.2599 219.5771 310.7458
#Sep 2019 270.8745 214.1733 324.4255
#Oct 2019 272.0894 215.1619 333.2733
#Nov 2019 275.5566 213.8802 337.9301
#Dec 2019 280.3914 219.2063 349.1284
#Jan 2020 291.4792 215.9117 364.1899
#Feb 2020 296.3475 221.9117 380.2887
#Mar 2020 302.0706 219.0779 399.1135
#Apr 2020 304.4595 217.5600 400.7724
#May 2020 310.8251 217.5561 420.6515
#Jun 2020 315.5942 221.5791 431.9727
#Jul 2020 322.4536 220.4798 452.4229
#Aug 2020 331.1163 223.3746 465.2015
```

We will create the Arima model as same as we did before and compare it with a bagged version of it in a graph.

```
arimafc <- train %>%
auto.arima(stepwise = FALSE,approximation = FALSE,
seasonal = FALSE, lambda = "auto") %>%
forecast(h=h,level=95)
autoplot(train) +
ggtitle("Monthly Golden Price per Gram") +
xlab("Year") + ylab("Price") +
autolayer(test, series="Real values",PI=FALSE) +
autolayer(simfc, series="Bagged ARIMA",PI=FALSE) +
autolayer(arimafc, series="ARIMA",PI=FALSE)+
theme_light()
```

When we examine the above plot, we can see that the bagged Arima model is smoother and more accurate compared to the classic version; but it is seen that when the forecasting horizon increases, both models are failed to capture the uptrend.

In the below, we are comparing the accuracy of models in numeric. We can easily see the difference in the accuracy level that we saw in the plot. The reason **NaN** values of the simulated version is that there is no estimation of fitted values(one-step forecasts) in the training set.

```
#Accuracy comparison
acc_arimafc <- arimafc %>%accuracy(test)
acc_arimafc[,c("RMSE","MAPE")]
# RMSE MAPE
#Training set 9.045056 3.81892
#Test set 67.794358 14.87034
acc_simu <- simfc %>% accuracy(test)
acc_simu[,c("RMSE","MAPE")]
# RMSE MAPE
#Training set NaN NaN
#Test set 54.46326 8.915361
```

**Conclusion**

When we examine the results we have found, it is seen that bootstrapping simulation with averaging (bagging) improves the accuracy significantly. Besides that, due to the simulation process, it can be very time-consuming.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* - Bagging Exponential Smoothing Methods usingSTL Decomposition and Box-Cox Transformation,
*Christoph Bergmeira, Rob J Hyndmanb, Jos ́e M Benitez* - Block Bootstrap, Wikipedia
- Bootstrap aggregating, Wikipedia

A regression that has Fourier terms is called **dynamic harmonic regression**. This harmonic structure is built of the successive Fourier terms that consist of sine and cosine terms to form a periodic function. These terms could catch seasonal patterns delicately.

, , ,

, , …

**m **is for the seasonal periods. If the number of terms increases, the period would converge to a square wave. While Fourier terms capture the seasonal pattern, the ARIMA model process the error term to determine the other dynamics like prediction intervals.

We will examine the regression models with K values from 1 to 6 and plot them down to compare corrected Akaike’s information criterion(**AICc**) measurement, which should be minimum. We will set the **seasonal parameter** to **FALSE**; because of that Fourier terms will catch the seasonality, we don’t want that the auto.arima function to search for seasonal patterns, and waste time. We should also talk about the transformation concept to understand **the lambda parameter** we are going to use in the models.

**Transformation**, just like differentiation, is a mathematical operation that simplifies the model and thus increases the prediction accuracy. In order to do that it stabilizes the variance so that makes the pattern more consistent.

These transformations can be automatically made by the **auto.arima function** based on the optimum value of the **lambda parameter** that belongs to the Box-Cox transformations which are shown below, if the lambda parameter set to “**auto**“.

; if

; if **otherwise**

```
#Comparing with plots
plots <- list()
for (i in seq(6)) {
fit <- train %>%
auto.arima(xreg = fourier(train, K = i), seasonal = FALSE, lambda = "auto")
plots[[i]] <- autoplot(forecast(fit, xreg=fourier(train, K=i, h=18))) +
xlab(paste("K=",i," AICC=",round(fit[["aicc"]],2))) +
ylab("") +
theme_light()
}
gridExtra::grid.arrange(
plots[[1]],plots[[2]],plots[[3]],
plots[[4]],plots[[5]],plots[[6]], nrow=3)
```

You can also see from the above plots that the more K value the more toothed point forecasting line and prediction intervals we get. It is seen that after the K=3, AICC values increase significantly. Hence, K should be equals to 2 for the minimum AICC value.

```
#Modeling with Fourier Regression
fit_fourier <- train %>%
auto.arima(xreg = fourier(train,K=2), seasonal = FALSE, lambda = "auto")
#Accuracy
f_fourier<- fit_fourier %>%
forecast(xreg=fourier(train,K=2,h=18)) %>%
accuracy(test)
f_fourier[,c("RMSE","MAPE")]
# RMSE MAPE
#Training set 8.586783 4.045067
#Test set 74.129452 17.068118
```

```
#Accuracy plot of the Fourier Regression
fit_fourier %>% forecast(xreg=fourier(train,K=2,h=18)) %>%
autoplot() +
autolayer(test) +
theme_light() +
ylab("")
```

In the previous article, we have calculated AICC values for non- seasonal ARIMA. The reason we choose the non-seasonal process was that our data had a very weak seasonal pattern, but the pairs of Fourier terms have caught this weak pattern very subtly. We can see from the above results that the Fourier regression is much better than the non-seasonal ARIMA for RMSE and MAPE accuracy measurements of the test set.

Since we are also taking into account the seasonal pattern even if it is weak, we should also examine the **seasonal ARIMA** process. This model is built by adding seasonal terms in the non-seasonal ARIMA model we mentioned before.

: non-seasonal part.

: seasonal part.

: the number of observations before the next year starts; **seasonal period**.

The seasonal parts have term non-seasonal components with backshifts of the seasonal period. For instance, we take model for monthly data, m=12. This process can be written as:

for **p = 1**,

for **P=1**,

for **first difference**(d=lag1),

for **seasonal difference**(D=lag12),

for **q**=1,

for **Q**=1.

```
#Modeling the Arima model with transformed data
fit_arima<- train %>%
auto.arima(stepwise = FALSE, approximation = FALSE, lambda = "auto")
#Series: .
#ARIMA(3,1,2) with drift
#Box Cox transformation: lambda= -0.7378559
#Coefficients:
# ar1 ar2 ar3 ma1 ma2 drift
# 0.8884 -0.8467 -0.1060 -1.1495 0.9597 3e-04
#s.e. 0.2463 0.1557 0.1885 0.2685 0.2330 2e-04
#sigma^2 estimated as 2.57e-06: log likelihood=368.02
#AIC=-722.04 AICc=-720.32 BIC=-706.01
```

Despite the seasonal parameter set to **TRUE **as default, **the** **auto.arima** **function **couldn’t find a model with seasonality because the time series data has a very weak seasonal strength level as we mentioned before. Unlike the Arima model that we did in the previous article, we set to **lambda parameter** to “**auto**“. It makes the data transformed with =-0.7378559.

```
#Accuracy of the Arima model with transformed data
fit_arima %>% forecast(h=18) %>%
autoplot() +
autolayer(test) +
theme_light()
```

```
f_arima<- fit_arima %>% forecast(h =18) %>%
accuracy(test)
f_arima[,c("RMSE","MAPE")]
# RMSE MAPE
#Training set 9.045056 3.81892
#Test set 67.794358 14.87034
```

As we will be remembered, the RMSE and MAPE values of the Arima model without transformation were 94.788638 and 20.878096 respectively. We can easily confirm from the above results that the transformation improves the accuracy if the time series have an unstabilized variance.

**Conclusion**

The time-series data with weak seasonality like our data has been modeled with dynamic harmonic regression, but the accuracy results were worst than Arima models without seasonality.

In addition to that, the transformed data has been modeled with the Arima model more accurately than the one not transformed; because our data has the variance that has changed with the level of time series. Another important thing is that when we take a look at the accuracy plots of both the Arima model and Fourier regression, we can clearly see that as the forecast horizon increased, the prediction error increased with it.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* - Statistic How To: Box-Cox Transformation
- Wikipedia: Fourier Series

]]>

We will use the monthly prices of refined gold futures(**XAUTRY**) for one gram in Turkish Lira traded on **BIST(Istanbul Stock Exchange)** for forecasting. We created the data frame starting from 2013. You can download the relevant excel file from here.

```
#building the time series data
library(readxl)
df_xautry <- read_excel("xau_try.xlsx")
xautry_ts <- ts(df_xautry$price,start = c(2013,1),frequency = 12)
```

**KNN Regression**

We are going to use **tsfknn** package which can be used to forecast time series in R programming language. KNN regression process consists of **instance, features, and targets** components. Below is an example to understand the components and the process.

```
library(tsfknn)
pred <- knn_forecasting(xautry_ts, h = 6, lags = 1:12,k=3)
autoplot(pred, highlight = "neighbors",faceting = TRUE)
```

The **lags **parameter indicates the lagged values of the time series data. The lagged values are used as features or explanatory variables. In this example, because our time series data is monthly, we set the parameters to 1:12. The last 12 observations of the data build the **instance**, which is shown by purple points on the graph.

This **instance **is used as a reference vector to find **features **that are the closest vectors to that instance. The relevant distance metric is calculated by the Euclidean formula as shown below:

denotes the instance and indicates the features that are ranked in order by the distance metric. The k parameter determines the number of k closest features vectors which are called **k nearest neighbors**.

**nearest_neighbors** function shows the instance, k nearest neighbors, and the targets.

```
nearest_neighbors(pred)
#$instance
#Lag 12 Lag 11 Lag 10 Lag 9 Lag 8 Lag 7 Lag 6 Lag 5 Lag 4 Lag 3 Lag 2
#272.79 277.55 272.91 291.12 306.76 322.53 345.28 382.02 384.06 389.36 448.28
# Lag 1
#462.59
#$nneighbors
# Lag 12 Lag 11 Lag 10 Lag 9 Lag 8 Lag 7 Lag 6 Lag 5 Lag 4 Lag 3 Lag 2
#1 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12 306.76
#2 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12
#3 223.97 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91
# Lag 1 H1 H2 H3 H4 H5 H6
#1 322.53 345.28 382.02 384.06 389.36 448.28 462.59
#2 306.76 322.53 345.28 382.02 384.06 389.36 448.28
#3 291.12 306.76 322.53 345.28 382.02 384.06 389.36
```

**Targets **are the time-series data that come right after the nearest neighbors and their number is the value of the **h parameter**. The targets of the nearest neighbors are averaged to forecast the future h periods.

As you can see from the above plotting, features or targets might overlap the instance. This is because the time series data has no seasonality and is in a specific uptrend. This process we mentioned so far is called **MIMO**(multiple-input-multiple-output) strategy that is a forecasting method used as a default with KNN.

**Decomposing and** **analyzing the time series data**

Before we mention the model, we first analyze the time series data on whether there is seasonality. The decomposition analysis is used to calculate the strength of seasonality which is described as shown below:

```
#Seasonality and trend measurements
library(fpp2)
fit <- stl(xautry_ts,s.window = "periodic",t.window = 13,robust = TRUE)
seasonality <- fit %>% seasonal()
trend <- fit %>% trendcycle()
remain <- fit %>% remainder()
#Trend
1-var(remain)/var(trend+remain)
#[1] 0.990609
#Seasonality
1-var(remain)/var(seasonality+remain)
#[1] 0.2624522
```

The **stl **function is a decomposing time series method. STL is short for seasonal and trend decomposition using loess, which loess is a method for estimating nonlinear relationships. The **t.window**(trend window) is the number of consecutive observations to be used for estimating the trend and should be odd numbers. The s.window(seasonal window) is the number of consecutive years to estimate each value in the seasonal component, and in this example, is set to ‘**periodic**‘ to be the same for all years. The **robust parameter** is set to ‘**TRUE**‘ which means that the outliers won’t affect the estimations of trend and seasonal components.

When we examine the results from the above code chunk, it is seen that there is a strong uptrend with 0.99 and a seasonality strength of 0.26; because of any value less than 0.4 is accepted as a very weak seasonal effect, we will prefer the non-seasonal ARIMA model.

**Non-seasonal ARIMA**

This model consists of differencing with autoregression and moving average. Let’s explain each part of the model.

**Differencing**: First of all, we have to explain **stationary **data. If data doesn’t contain information pattern like trend or seasonality in other words is white noise that data is stationary. **White noise** time series has no autocorrelation at all.

Differencing is a simple arithmetic operation that extracts the difference between two consecutive observations to make that data stationary.

The above equation shows the first differences that difference at lag 1. Sometimes, the first difference is not enough to obtain stationary data, hence, we might have to do differencing of the time series data one more time(**second-order differencing**).

In **autoregressive models**, our target variable is a linear combination of its own lagged variables. This means the explanatory variables of the target variable are past values of that target variable. The **AR(p)** notation denotes the autoregressive model of order **p** and the denotes the white noise.

**Moving average models**, unlike autoregressive models, they use past error(white noise) values for predictor variables. The **MA(q)** notation denotes the autoregressive model of order **q**.

If we integrate differencing with autoregression and the moving average model, we obtain a non-seasonal ARIMA model which is short for the autoregressive integrated moving average.

is the differenced data and we must remember it may have been first and second order. The explanatory variables are both lagged values of and past forecast errors. This is denoted as **ARIMA(p,d,q)** where **p**; the order of the autoregressive; **d**, degree of first differencing; **q**, the order of the moving average.

**Modeling with non-seasonal ARIMA**

Before we model the data, first we split the data as train and test to calculate accuracy for the ARIMA model.

```
#Splitting time series into training and test data
test <- window(xautry_ts, start=c(2019,3))
train <- window(xautry_ts, end=c(2019,2))
```

```
#ARIMA modeling
library(fpp2)
fit_arima<- auto.arima(train, seasonal=FALSE,
stepwise=FALSE, approximation=FALSE)
fit_arima
#Series: train
#ARIMA(0,1,2) with drift
#Coefficients:
# ma1 ma2 drift
# -0.1539 -0.2407 1.8378
#s.e. 0.1129 0.1063 0.6554
#sigma^2 estimated as 86.5: log likelihood=-264.93
#AIC=537.85 AICc=538.44 BIC=547.01
```

As seen above code chunk, `stepwise=FALSE, approximation=FALSE`

parameters are used to amplify the searching for all possible model options. The **drift** component indicates the constant **c** which is the average change in the historical data. From the results above, we can see that there is no autoregressive part of the model, but a second-order moving average with the first differencing.

**Modeling with KNN**

```
#Modeling and forecasting
library(tsfknn)
pred <- knn_forecasting(xautry_ts, h = 18, lags = 1:12,k=3)
```

```
#Forecasting plotting for KNN
autoplot(pred, highlight = "neighbors", faceting = TRUE)
```

**Forecasting and** **accuracy comparison between the models**

```
#ARIMA accuracy
f_arima<- fit_arima %>% forecast(h =18) %>%
accuracy(test)
f_arima[,c("RMSE","MAE","MAPE")]
# RMSE MAE MAPE
#Training set 9.045488 5.529203 4.283023
#Test set 94.788638 74.322505 20.878096
```

For forecasting accuracy, we take the results of the test set shown above. The shaded color indicates the prediction interval in the below plot; the lighter one shows 95%, and the darker inside part shows the 80% confidence interval. The inside line in the colored area indicates the point forecasts.

```
#Forecasting plot for ARIMA
fit_arima %>% forecast(h=18) %>% autoplot()+ autolayer(test)
```

```
#KNN Accuracy
ro <- rolling_origin(pred, h = 18,rolling = FALSE)
ro$global_accu
# RMSE MAE MAPE
#137.12465 129.77352 40.22795
```

The **rolling_origin** function is used to evaluate the accuracy based on rolling origin. The **rolling** parameter should be set to **FALSE **which makes the last 18 observations as the test set and the remaining as the training set; just like we did for ARIMA modeling before. The test set would not be a constant vector if we had set the rolling parameter to its default value of TRUE. Below, there is an example for h=6 that rolling_origin parameter set to TRUE. You can see the test set dynamically changed from 6 to 1 and they eventually build as a matrix, not a constant vector.

```
#Accuracy plot for KNN
plot(ro)
```

**Conclusion**

When we compare the results of the accuracy measurements like RMSE or MAPE, we can easily see that the ARIMA model is much better than the KNN model for our non-seasonal time series data.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* - Time Series Forecasting with KNN in R: the tsfknn Package,
*Francisco Martínez, María P. Frías, Francisco Charte, and Antonio J. Rivera* - Autoregression as a means of assessing the strength of seasonality in a time series:
*Rahim Moineddin, Ross EG Upshur, Eric Crighton & Muhammad Mamdani*

In my previous article, we analyzed the COVID-19 data of Turkey and selected the cubic model for predicting the spread of disease. In this article, we will show in detail why we selected the cubic model for prediction and see whether our decision was right or not.

When we analyze the regression trend models we should consider overfitting and underfitting situations;** underfitting** indicates high bias and low variance while **overfitting** indicates low bias and high variance.

**Bias** is the difference between the expected value of fitted values and observed values:

The **variance** of fitted values is the expected value of squared deviation from the mean of fitted values:

The **adjusted coefficient of determination** is used in the different degrees of polynomial trend regression models comparing. In the below formula** p** denotes the number of explanatory terms and** n** denotes the number of observations.

is the residual sum of squares:

is the total sum of squares:

When we examine the above formulas, we can notice the similarity between SSE and bias. We can easily say that if bias decreases, SSE will decrease and will increase. So we will use instead of bias to balance with variance and find the optimal degree of the polynomial regression.

The dataset and data frame(tur) we are going to use can be found from the previous article. First of all, we will create all the polynomial regression models which we’re going to compare.

```
a <- 1:15
models <- list()
for (i in seq_along(a)){
models[[i]] <- assign(paste0('model_', i),lm(new_cases ~ poly(index, i), data=tur))
}
```

The variances of fitted values of all the degrees of polynomial regression models:

```
variance <- c()
for (i in seq_along(a)) {
variance[i] <- mean((models[[i]][["fitted.values"]]-mean(models[[i]][["fitted.values"]]))^2)
}
```

To create an adjusted R-squared object we first create a summary object of the trend regression models; because the adj.r.squared feature is calculated in summary function.

```
models_summ <- list()
adj_R_squared <- c()
for (i in seq_along(a)) {
models_summ[[i]] <- summary(models[[i]])
adj_R_squared[i] <- (models_summ[[i]][["adj.r.squared"]])
}
```

Before analyzing numeric results of variance and we will show all the trend regression lines in separate plots for comparing models.

```
# Facetting models by degree for finding the best fit
library(ggplot2)
dat <- do.call(rbind, lapply(1:15, function(d) {
x <- tur$index
preds <- predict(lm(new_cases ~ poly(x, d), data=tur), newdata=data.frame(x=x))
data.frame(cbind(y=preds, x=x, degree=d))
}))
ggplot(dat, aes(x,y)) +
geom_point(data=tur, aes(index, new_cases)) +
geom_line(color="steelblue", lwd=1.1)+
facet_wrap(~ degree,nrow = 3)
```

When we examine the above plots, we should pay attention to the curved of the tails; because it indicates overfitting which shows extreme sensitivity to the observed data points. In light of this approach, the second and third-degree of models appear to be more convenient to the data.

Let’s examine -variance tradeoff on the plot we created below.

```
library(gridExtra)
plot_variance <- ggplot(df_tradeoff,aes(degree,variance))+
geom_line(size=2,colour="orange")+
scale_x_continuous(breaks = 1:15)+
theme_bw(base_line_size = 2)
plot_adj.R.squared <- ggplot(df_tradeoff,aes(degree,adj_R_squared))+
geom_line(size=2,colour="steelblue")+
labs(y="Adjusted R-Squared")+
scale_x_continuous(breaks = 1:15)+
theme_bw(base_line_size = 2)
grid.arrange(plot_variance,plot_adj.R.squared,ncol=1)
```

As we can see above, adjusted R-squared and variance have very similar trend lines. The more adjusted R-squared means the more complexity and low bias, but we have to take into account the variance; otherwise, we fall into the overfitting trap. So we need to look for low bias (or high ) and low variance as much as possible for optimal selection.

When we examine the variance plot, we can see that there is not much difference between second and third-degree; but after the third degree, there seems to be a more steep increase; it’s some kind of breaking. This might lead to overfitting. Thus the most reasonable option seems to be the 3rd degree. This approach is not a scientific fact but could be used for an optimal solution.

]]>