```
library(tidyverse)
library(tidymodels)
library(tsibble)
library(C50)
library(caret)
library(pROC)
data <- readr::read_csv("https://raw.githubusercontent.com/mesdi/trees/main/credit.csv")
```

The first algorithm we will talk about is the **C5.0 decision tree**. This method is based upon entropy to select optimal features to split upon. The high entropy means that there is little or no **homogeneity **within the set of class values. The algorithm aims to find splits that reduce the entropy. (**Lantz, 2015: 133**)

- The refers to the number of class levels
- refers to the rate of values in class level i

The algorithm calculates the difference in the entropy of the last two segments on a feature, F, to determine the change of homogeneity. This is called** information gain**. The higher the information gain the better the homogeneity in groups after the split. (**Lantz, 2015: 134**)

Now after some brief information, we can begin the fitting process.

```
#Building tibble dataset
df <-
data %>%
mutate_if(sapply(data,is_character),as_factor) %>%
mutate(default= default %>% factor(labels=c("no","yes")))
#Creating train and test set
set.seed(36894)
df_split <- initial_split(df, prop = 9/10, strata = "default")
df_train <- training(df_split)
df_test <- testing(df_split)
```

```
#C5.0 model
model_c50 <- C5.0(df_train[-17],df_train$default,trials = 10)
pred_c50 <- predict(model_c50, df_test)
confusionMatrix(pred_c50,df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 65 12
# yes 5 18
# Accuracy : 0.83
# 95% CI : (0.7418, 0.8977)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.002163
```

No information rate(**NIR**) means that the largest proportion of the class levels in our response variable. So, when we look at the above results, we can see this rate is %70 and, our accuracy result is higher than NIR at the %5 significance level(**P-Value: 0.002163** **< 0.05**). Nevertheless, the type 2 error rate is quite annoying (**12/30**).

The second tree algorithm we are going to use is decision_tree() from parsnip.

```
#Decision trees model
model_dt <-
decision_tree() %>%
set_mode("classification") %>%
set_engine("rpart") %>%
fit(default ~ ., data=df_train)
pred_dt <- predict(model_dt, df_test)
confusionMatrix(pred_dt$.pred_class, df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 65 17
# yes 5 13
# Accuracy : 0.78
# 95% CI : (0.6861, 0.8567)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.04787
```

It looks like the accuracy and type 2 error rate are much worse than the C5.0 function. The final algorithm is the boosted trees.

```
#Boosted trees model
model_bt <-
boost_tree(trees = 100) %>%
set_mode("classification") %>%
set_engine("C5.0") %>%
fit(default ~ ., data=df_train)
pred_bt <- predict(model_bt, df_test)
confusionMatrix(pred_bt$.pred_class, df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 63 14
# yes 7 16
# Accuracy : 0.79
# 95% CI : (0.6971, 0.8651)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.02883
```

It is seen a slight improvement, but not like the first one.

Finally, we will be comparing ROC curves with the help of the ggroc function from the pROC package.

```
#Multiple ROC curve
roc_df <- tibble(
c50=pred_c50 %>% ordered(),
dt=pred_dt$.pred_class %>% ordered(),
bt=pred_bt$.pred_class %>% ordered(),
response=df_test$default
)
roc.list <- roc(response ~ dt + bt + c50, data = roc_df)
roc_values <-
lapply(roc.list,"[[","auc") %>%
as.numeric() %>%
round(2)
ggroc(roc.list) +
geom_line(size=2)+
geom_abline(slope=1,
intercept = 1,
linetype = "dashed",
alpha=0.5,
size=2,
color = "grey") +
scale_color_brewer(palette = "Set2",
labels=c(
paste0("DT: ", roc_values[1]),
paste0("BT: ", roc_values[2]),
paste0("C50: ",roc_values[3]))) +
annotate("text",
x=0.5,
y=0.5,
label="no predictive value",
angle=45)+
guides(color= guide_legend(title = "Models AUC Scores")) +
theme_minimal() +
coord_equal()
```

The above ROC curves plot and the AUC results confirm the accuracy results we found earlier. It is understood, except for the C50 model, the other two models seem to be inefficient.

]]>We will model the gas prices in terms of the exchange rate(**xe**), but this time, not only the current exchange rate but also, I want to add lagged values of xe to the model; Because the change of xe rates might not affect the prices immediately. Our model should be as below:

is ARIMA errors. is a lag level and can be determined by AICc.

The dataset we are going to use is that from 2019 to the present, gas prices and xe rates. In the code block below, we build tsibble data frame object and separate it into the train and test set.

```
library(tidyverse)
library(tidymodels)
library(readxl)
library(tsibble)
library(fable)
library(feasts)
df <- read_excel("gasoline.xlsx")
#Building tsibble dataset
df_tsbl <- df %>%
mutate(date=as.Date(date)) %>%
as_tsibble() %>%
fill_gaps() %>% #makes regular time series by filling the time gaps
#fills in NAs with previous values
fill(xe,gasoline,.direction = "down")
#Separating into train and test set without random selection
#Takes first %75 for the training and the rest for the test set
df_split <- initial_time_split(df_tsbl, prop = 3/4)
df_train <- training(df_split)
df_test <- testing(df_split)
```

First of all, we will check the stationary of dependent and independent variables of the model as always.

```
#Checking the stationary
df_train %>%
pivot_longer(gasoline:xe) %>%
ggplot(aes(x=date,y=value))+
geom_line()+
facet_grid(vars(name),scales = "free_y")+
xlab("")+ylab("")+
theme_minimal()
```

It seems that it is better we make differencing the gas and xe variables.

```
#Comparing the probable lagged models
model_lags_diff <- df_train %>%
#Excludes the first three days so all versions are in the same time period
mutate(gasoline=c(NA,NA,NA,gasoline[4:788])) %>%
model(
lag0 = ARIMA(gasoline ~ pdq(d = 1) + xe),
lag1 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe)),
lag2 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe) +
lag(xe, 2)),
lag3 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe) +
lag(xe, 2) + lag(xe, 3))
)
glance(model_lags_diff) %>%
arrange(AICc) %>%
select(.model:AICc)
# A tibble: 4 x 5
# .model sigma2 log_lik AIC AICc
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 lag1 0.00310 1152. -2294. -2294.
#2 lag2 0.00311 1152. -2292. -2292.
#3 lag3 0.00311 1152. -2290. -2290.
#4 lag0 0.00313 1148. -2288. -2288.
```

By the AICc results, it is better to select the one-level lagged model.

```
#Creates the model variable to the above results
model_lag1 <- df_train %>%
model(ARIMA(gasoline ~ pdq(d = 1) + xe + lag(xe)))
```

```
#Checking the model residuals consistency with white noise
model_lag1 %>% gg_tsresiduals()
```

All the spikes seem to be within the threshold line to the ACF plot, so the residuals can be considered white noise. So, we are comfortable with the model we chose.

```
report(model_lag1)
#Series: gasoline
#Model: LM w/ ARIMA(0,1,0)(2,0,0)[7] errors
#Coefficients:
# sar1 sar2 xe lag(xe)
# 0.0920 0.1065 -0.0436 0.1060
#s.e. 0.0356 0.0359 0.0361 0.0368
#sigma^2 estimated as 0.003103: log likelihood=1155.84
#AIC=-2301.69 AICc=-2301.61 BIC=-2278.35
```

The above results indicate that the model has the weekly seasonal **AR(1)** and **AR(2)** components. Now, we will predict with the test set by using the above regression equations to assess the accuracy of our model.

```
#Lagged Forecasts
model_pred <- forecast(model_lag1,df_test)
#Accuracy of the lagged model
cor(df_test$gasoline,model_pred$.mean) %>% round(2)
#[1] 0.77
```

The result seems to be fine but we can try to improve it. In order to do that we will use bootstrapping and bagging methods.

```
set.seed(456)#getting the optimal and similar results for reproducibility
#Bootstrapped forecasts
sim_fc <- model_lag1 %>%
fabletools::generate(new_data=df_test,
times=10,
bootstrap_block_size=14) %>%
select(-.model)
#Drawing bootstrapped forecasts
sim_fc %>%
autoplot(.sim)+
autolayer(df_train,gasoline)+
guides(color="none")+
xlab("")+ylab("")+
theme_minimal()
```

```
#Bagged forecasts
bagged_fc <- sim_fc %>%
summarise(bagged_mean=mean(.sim))
#Accuracy of bagging model
cor(df_test$gasoline,bagged_fc$bagged_mean) %>% round(2)
#[1] 0.85
```

Quite a significant improvement for the accuracy of our model; it seems the bootstrapping simulation works in this process.

]]>The dataset we have built for this article consists of the number of formal education **students** and the housing **capacity** provided by Higher Education Loans and Dormitories Institution(**KYK**), which is a state organization.

We will determine whether there is a capacity shortage this year based on historical data. The model we are going to use is the **dynamic regression model with ARIMA errors**; Because we will model the dormitories’ **capacity **in terms of the number of **students **by the historical data between 1992-2020.

First of all, we will check the stationary of the variables. If one of them is not stationary, all the variables have to be differenced. This modeling process is called model in differences.

```
#Creating the tsibble variables and examining the stationary
library(tidyverse)
library(fable)
library(feasts)
library(tsibble)
library(readxl)
df <- read_excel("kyk.xlsx")
df_ts <- df %>%
as_tsibble(index = date)
train <- df_ts %>% filter_index(.~ "2020")
test <- df_ts %>% filter_index("2021"~.)
train %>%
pivot_longer(c(capacity, students),
names_to = "var", values_to = "value") %>%
ggplot(aes(x = date, y = value)) +
geom_line(size=1.5, color="blue") +
facet_grid(vars(var), scales = "free_y") +
scale_x_continuous(breaks = seq(1990,2020,5))+
scale_y_continuous(labels = scales::label_number_si()) +
labs(title = "The Number of Capacity and Students",y = "",x="") +
theme_minimal() +
theme(plot.title=element_text(face = "bold.italic",hjust = 0.5),
panel.background = element_rect(fill = "#f0f0f0", color = NA))
```

When we examine the above plots, it seems like the two variables need differencing as well. After we fit the model, we will check the residuals for stationary by drawing the innovations residuals.

```
#Modeling
fit <- train %>%
model(ARIMA(capacity ~ students + pdq(d=1)))
#Checking residuals for stationary
fit %>%
gg_tsresiduals()
```

In the ACF graph, we can clearly see that the spikes are within the thresholds which means the residuals are white noise. In the below code block, we have confirmed that the residuals have no patterns by using the Ljung-Box test in which we saw the **p-value** greater than 0.05 at %5 significance level.

```
#Ljung-Box independence test for white noise
augment(fit) %>%
features(.innov, ljung_box)
#.model lb_stat
# <chr> <dbl>
#1 ARIMA(capacity ~ students + pdq(d = 1)) 1.32
# lb_pvalue
# <dbl>
#1 0.251
```

We are writing the below command to examine the components of the model.

```
report(fit)
#Series: capacity
#Model: LM w/ ARIMA(0,1,1) errors
#Coefficients:
# ma1 students intercept
# 0.8672 -0.022 22964.621
#s.e. 0.1177 0.006 8369.892
```

The differenced model is shown as: , where is a **MA(1)** error. The insteresting thing is that the ** students **component has a negative impact on the

Before finding the expected value of the model, we will take a look at the relation between the number of capacity and students over time.

```
#Comparing students and capacity
ggplot(df_ts,aes(x=date))+
geom_bar(aes(y=students),stat = "identity",fill="skyblue")+
geom_line(aes(y=students),size=1.5,color="orange")+
geom_line(aes(y=(capacity/students)*10^7),size=1.5,color="red")+
scale_x_continuous(breaks = seq(1991,2021,5))+
scale_y_continuous(labels = scales::label_number_si(),
sec.axis = sec_axis(~./10^7,
labels = scales::percent_format(accuracy = 1),
name = "The rate of capacity per student"))+
xlab("")+ylab("The number of students")+
theme_minimal()+
theme(panel.background = element_rect(fill = "#f0f0f0", color = NA),
#Main y-axis
axis.title.y = element_text(color = "skyblue",
#setting positions of axis title
margin = margin(r=0.5,unit = "cm"),
hjust = 1),
#Second y-axis
axis.title.y.right = element_text(color = "red",
#setting positions of second axis title
margin = margin(l=0.5,unit = "cm"),
hjust = 0.01))
```

We have to keep in mind that approximately half of the higher education students study out of their hometown, according to the common opinion before the analyze the above graph. The orange line shows the trend of the number of students.

It is understood that while the number of students increased, the capacity per student decreased until 2014. Although the capacity rate has increased after this year, it is clear that the reason for this is the decrease in the number of students. However, it is still seen well below 50%.

Now that we’ve found our model, we can predict this year’s capacity value and see whether the government falls short of this expected value.

```
#Forecasting the current year
capacity_fc<- forecast(fit, new_data = test)
#Setting legends for separate columns for the plot
colors <- c("Actual"="red","Predicted"="blue")
#Plotting predicted and the actual value
ggplot(train,mapping=aes(x=date,y=capacity))+
geom_line(size=1.5)+
geom_point(capacity_fc,mapping=aes(x=date,y=.mean,color="Predicted"))+
geom_point(test,mapping=aes(x=date,y=capacity,color="Actual"))+
scale_x_continuous(breaks = seq(1991,2021,5))+
scale_y_continuous(breaks = seq(200000,750000,100000),
labels = scales::label_number_si())+
scale_color_manual(values = colors,name="Points")+
theme_minimal() +
theme(axis.title.y=element_text(vjust = 2),
panel.background = element_rect(fill = "#f0f0f0", color = NA))
```

To the above plot, the number provided by the government falls short than the expected value of the model. They seem right to complain about the capacity of dormitories considering the half of the students study out of their hometown.

]]>In order to answer this question, we have to choose some variables to monitor economic conditions, to seek a proper election date in terms of government. The variables we are going to use are the chain-weighted GDP, which clears the inflation effect from GDP, and CPI.

We will predict for the next 8 quarters the CPI and GDP with the ARIMA method from the fable package. In the dataset, we will use the quarters’ values of related variables between 2010-Q1 and 2021-Q2.

```
library(tidyverse)
library(readxl)
library(fable)
library(tsibble)
library(feasts)
#Building and tidying dataset
df <- read_excel("election.xlsx")
options(digits = 11)#for printing the decimal digits of the gdp variables
df %>%
mutate(cpi=round(as.numeric(cpi),2),) %>%
na.omit() -> df
#Creating tsibble variables for CPI and GDP variables
cpi_ts <- df %>%
select(cpi) %>%
ts(start = 2010,frequency = 4) %>%
as_tsibble()
gdp_ts <- df %>%
select(gdp) %>%
ts(start = 2010, frequency = 4) %>%
as_tsibble()
```

```
#Analyzing the GDP data
gdp_ts %>%
autoplot(color="blue",size=1) +
scale_y_continuous(labels = scales::label_number_si()) + #for SI prefix
xlab("") + ylab("") +
theme_minimal()
```

When we examine the above plot, there is no significant indication of a change of variance. Hence, we won’t take the logarithmic transformation of the data. Besides that, it is clearly seen that there is strong seasonality in the data. Therefore, we will first take a seasonal difference of the data. Although the mean of the data seems non-stationary either, if there is strong seasonality in the data, we would first take a seasonal difference; because, after the seasonal differencing, the mean can also be stabilized.

```
#Seasonally differenced of the data
gdp_ts %>%
gg_tsdisplay(difference(value, 4),
plot_type='partial') +
labs(title="Seasonally differenced", y="")
```

To the plot above, the mean seems to be stabilized; we don’t need to take a further difference. In order to find appropriate the ARIMA model, we first would look at the **ACF **graph. There is a spike at lag 4 which leads to a seasonal **MA(1)** component. On the other hand, based on **PACF**, there is a significant spike at lag 4 which also leads to a seasonal **AR(1)** component.

We can start with the below models according to the ACF/PACF graphs mentioned above. And, we will also add the automatic selection process to find the best results. We will set the ** stepwise** parameters to

```
#Modeling the GDP
fit_gdp <- gdp_ts %>%
model(
arima000011 = ARIMA(value ~ pdq(0,0,0) + PDQ(0,1,1)),
arima000110 = ARIMA(value ~ pdq(0,0,0) + PDQ(1,1,0)),
auto = ARIMA(value, stepwise = FALSE)
)
```

```
#Shows all models detailed
fit_gdp %>%
pivot_longer(everything(),
names_to = "Model name",
values_to = "Orders")
# A mable: 3 x 2
# Key: Model name [3]
# `Model name` Orders
# <chr> <model>
#1 arima000011 <ARIMA(0,0,0)(0,1,1)[4] w/ drift>
#2 arima000110 <ARIMA(0,0,0)(1,1,0)[4] w/ drift>
#3 auto <ARIMA(1,0,0)(0,1,1)[4] w/ drift>
```

```
#Ranks the models by AICc criteria
fit_gdp %>%
glance() %>%
arrange(AICc) %>%
select(.model,AICc)
# A tibble: 3 x 2
# .model AICc
# <chr> <dbl>
#1 auto 1518.
#2 arima000011 1520.
#3 arima000110 1524.
```

If we analyze the above results, we would conclude that the auto model is better than the others by AICc results. The model consists of non-seasonal **AR(1)** and seasonal **MA(1)** components in seasonally differenced data which is very similar to our guess based on **ACF**. **[4]** indicates the quarterly time series. Drift indicates the intercept in the model formula.

After finding our model, it would be better to check the residuals to see that if there is any correlation in the data.

```
#Checks the residuals
fit_gdp %>%
select(auto) %>%
gg_tsresiduals(lag=12)
```

We clearly see that there is no spike within threshold limits, which means that the residuals are white noise by the ACF graph above. Now, we can pass on to calculating forecasts with certainty.

```
#Forecasting GDP
(fit_gdp %>%
forecast(h=8) %>%
filter(.model=="auto") %>%
as.data.frame() %>% #to use the select function below, we convert data to a data frame
select(index,.mean) -> f_gdp)
# index .mean
#1 2021 Q3 521226607.35
#2 2021 Q4 526962302.61
#3 2022 Q1 454107505.86
#4 2022 Q2 476945092.38
#5 2022 Q3 536502403.29
#6 2022 Q4 544456726.15
#7 2023 Q1 472308990.28
#8 2023 Q2 495371912.09
```

Now, we will do the same steps for the CPI data.

```
#Analyzing the CPI data
cpi_ts %>%
autoplot(color="red",size=1) +
scale_y_continuous(labels = scales::label_number_si()) + #for SI prefix
xlab("") + ylab("") +
theme_minimal()
```

It is seen a very strong uptrend in the data while no seasonality is detected on the graph. Therefore, we will do the first difference to stabilize the mean. But, because of the exponential uptrend, we would better do log-transforming to stabilize the variance, as well.

```
#The first difference of the log-transformed data
cpi_ts %>%
gg_tsdisplay(difference(log(value)),
plot_type='partial') +
labs(title="", y="")
```

The data seem to be stationary in terms of the mean. As we analyzed the **ACF/PACF** plots, we would see that we have seasonal **AR(1)** and seasonal **MA(1)** components; because both plots have spikes at seasonal lags(**4**).

```
#Modeling the CPI
fit_cpi <- cpi_ts %>%
model(
arima010001 = ARIMA(log(value) ~ pdq(0,1,0) + PDQ(0,0,1)),
arima010100 = ARIMA(log(value) ~ pdq(0,1,0) + PDQ(1,0,0)),
auto = ARIMA(log(value), stepwise = FALSE)
)
#Shows all CPI models detailed
fit_cpi %>%
pivot_longer(everything(),
names_to = "Model name",
values_to = "Orders")
# A mable: 3 x 2
# Key: Model name [3]
# `Model name` Orders
# <chr> <model>
#1 arima010001 <ARIMA(0,1,0)(0,0,1)[4] w/ drift>
#2 arima010100 <ARIMA(0,1,0)(1,0,0)[4] w/ drift>
#3 auto <ARIMA(0,2,1)(1,0,0)[4]>
#Ranks the models by AICc criteria
fit_cpi %>%
glance() %>%
arrange(AICc) %>%
select(.model,AICc)
# A tibble: 3 x 2
# .model AICc
# <chr> <dbl>
#1 arima010100 -246.
#2 arima010001 -245.
#3 auto -240.
```

To the **AICc** results, the best model has seasonal **AR(1)** part with the first difference. In the last step for the fitness of the model, we will check for the residuals.

```
#Checks the residuals
fit_cpi %>%
select(arima010100) %>%
gg_tsresiduals(lag=12)
```

The residuals are within thresholds of the ACF plot, which means no remained information in the data. We can safely forecast with our model.

```
#Forecasting CPI
(fit_cpi %>%
forecast(h=8) %>%
filter(.model=="arima010100") %>%
as.data.frame() %>% #to use the select function below, we convert data to a data frame
select(index,.mean) -> f_cpi)
# index .mean
#1 2021 Q3 552.9687
#2 2021 Q4 572.1427
#3 2022 Q1 590.3115
#4 2022 Q2 608.4114
#5 2022 Q3 624.0214
#6 2022 Q4 642.0021
#7 2023 Q1 659.8191
#8 2023 Q2 677.8676
```

Now, we can investigate the convenient time for the election in terms of ruling parties. In order to do that, a graph might be useful. But first of all, we need a decisive variable. The rate of GDP per CPI would be an effective indicator for people’s perception of the country’s situation because it is related to people’s consumer level and the country’s growth rate either.

```
#Finding the proper time on the plot for the election
tibble(
ratio=f_gdp$.mean/f_cpi$.mean,
date=f_gdp$index
) %>%
ggplot(aes(x=date))+
geom_line(aes(y=ratio),
size=1,
color="green")+
geom_point(aes(x=date[5],y=ratio[5]),#red point
color="red",
size=2)+
ylab("GDP/CPI")+
xlab("")+
theme_minimal()+
theme(axis.title.y = element_text(margin = margin(r=0.5, unit = "cm")))#moves away from the tick numbers
```

When we examine the above plot, a significant decrease is observed from the third quarter of this year to the end first quarter of the next year. It is too risky for an election, but after the first quarter of 2022, an increasing trend is seen. But, from the third quarter of 2022, there is again a big decrease until the quarter, in which officially announced election date. The ideal election calendar seems to be the third quarter of 2022 for the ruling parties.

]]>The countries that we will compare are Greece, Italy, and Turkey, which are all in the same Mediterranean climate zone and had brutal damage of fires. We will compare the hectares(ha) of damaged areas and the number of fires. The data set has annual observations between the 2009-2021 periods. Of course, the data for this year had to be taken until 20 august based on the EFFIS database; but it doesn’t change the underlying concept of this article.

```
library(tidyverse)
library(tsfknn)
library(forecast)
df <- read_excel("fires_statistics.xlsx")
```

For comparison purposes, we will use the second y-axis in our plot. We have to use transformation rate to create a second axis because the ** sec.axis** function, which we are going to use for building the second y-axis, can not build an independent y-axis; it creates related to the main y-axis via some mathematical transformations. The transformation rate could be taken as approximate

```
fun <- function(country="Greece"){
df %>%
.[.$country==country,] %>%
ggplot(aes(x=year))+
geom_bar(aes(y=burnt_areas_ha),stat = "identity",fill="orange")+
#Bar labels(burnt_areas_ha)
geom_text(aes(y=burnt_areas_ha,label=burnt_areas_ha),vjust=-0.2,color="orange")+
#multiplied by transformation rate to tune in with the main y-axis(Burnt Areas)
geom_line(aes(y=fires_number*100),size=2,color="lightblue")+
#Line labels(fires_number)
geom_text(aes(y=fires_number,label=fires_number),vjust=1,color="lightblue")+
scale_x_continuous(breaks = seq(2009,2021,1))+
scale_y_continuous(
#for the main y-axis
labels = scales::label_number_si(),
breaks = scales::pretty_breaks(n=7),
#for the second y-axis
sec.axis = sec_axis(~./100,#divided by transformation rate, in order to be represented based on the first y-axis
name = "Number of Fires",
breaks = scales::pretty_breaks(7)),
)+
xlab("")+ylab("Burnt Areas(ha)")+
ggtitle(paste("The comparison of the volume(ha) and numbers of Forest Fires in",
country))+
theme_minimal()+
theme(
#Main y-axis
axis.title.y = element_text(color = "orange",
#puts distance with the main axis labels
margin = margin(r=0.5,unit = "cm"),
hjust = 1),#sets the main axis title top left
#Second y-axis
axis.title.y.right = element_text(color = "lightblue",
#puts distance with the second axis labels
margin = margin(l=0.5,unit = "cm"),
hjust = 0.01),#sets the second axis title bottom right
#adjusts the date labels to avoid overlap line numbers
axis.text.x = element_text(angle = 30,
vjust = 1,
hjust = 1),
panel.grid.minor = element_blank(),#removes the minor grid of panel
plot.title=element_text(face = "bold.italic",hjust = 0.5)
)
}
```

```
fun("Turkey")
```

```
fun("Italy")
```

When we take a look above plots, the damaged areas have been quite increased compared to the past years for all countries. Probably the underlying reason, that confused Turkish people was the high ratio of the damaged areas to the number of fires. The other two countries also seem to have the same problem, compared to the past years. Undoubtedly, the reason beyond that is global warming. Of course, the insufficient treats of the governments to the fires had also played a part in the issue.

Now let’s forecast the number of fires this year for determining if there is an anomaly in Turkey. Because our data set is too small, it would be convenient to choose the k-nearest neighbor(KNN) algorithm. We will set ** k **to 3 and 4 for optimization because the squared root of the number of observations is approximately 3,6.

Since the PACF order of the pattern are zero, we set the ** lags** parameter as 1 to 5. In order to determine the order of PACF of the model, we first check the stationary of the data.

```
#Checking the pattern if it is stationary
df %>%
filter(country=="Turkey") %>%
ggplot(aes(x=year,y=fires_number))+
geom_line(size=1)+
xlab("")+ylab("")+
theme_minimal()
```

There seems to be exponential growth in the above graph. Hence, we will take the first difference of the data to stabilize the mean.

```
#Making the time series stationary and determining the order of ACF and PACF
df %>%
filter(country=="Turkey") %>%
select(fires_number) %>%
pull() %>% #converts the dataframe column to a vector
ts(start = 2009,frequency = 1) %>% #converts vector to a time series
diff() %>% #gets first difference of the time series
ggtsdisplay()# gets ACF/PACF plots
```

It seems the mean is stabilized, so we don’t have to take a second difference. To the PACF plot, there is no significant spike beyond the limits; that was why we took the order of PACF as zero.

```
#Forecasting with KNN
df %>%
filter(country=="Turkey") %>%
select(fires_number) %>%
pull() %>% #converts the dataframe column to a vector
head(-1) %>% #removes the last observation for prediction
ts(start = 2009,frequency = 1) %>% #converts the vector to a time series
knn_forecasting(h=1,lags = 1:5, k=c(3,4)) %>%
#.$pred (gets prediction value for 2021, which is 424.575)
plot()
#actual value(the last observation)
points(x = 2021,
y = last(df$fires_number[df$country=="Turkey"]),
pch = 16,
col="blue")
legend("topleft",
legend = c("Predicted Value","Actual Value"),
col = c("red","blue"),
pch = c(16,16)
)
```

When we analyze the plot, we see that the predicted value for this year is quite higher than the current value, which shows the speculation about the number of fires in Turkey seems to be baseless.

]]>The dataset we’re going to use for this article is from Our World in Data. What we want to do here is that we will determine the partial and total vaccination rates as a percentage relative to the relevant population. And we will examine the relationship between the number of cases and deaths given in certain monthly periods and the vaccination rates.

```
library(readxl)
library(dplyr)
library(lubridate)
library(rlang)
library(scales)
library(tidyr)
library(stringr)
library(purrr)
library(plotly)
df_raw <- read_excel("covid19_vacc.xlsx")
cement <- function(...){
args <- ensyms(...)
paste(purrr::map(args,as_string))
}
```

The function we’ve built called ** cement** is taken from Advanced R book. The purpose of the function is that saves us from writing our arguments in string form, which sometimes can be quite annoying; with this function, we can write our arguments without in quotes.

```
fun <- function(...){
#Assigning the symbols as a strings
var_loc <- cement(...)[1]
var_toll <- cement(...)[2]
df <- df_raw %>%
.[.$location==var_loc,] %>%
transmute(
date=date %>% as.Date(),
#First phase vaccinated people
partly=people_vaccinated-people_fully_vaccinated,
#Fully vaccinated people
fully=people_fully_vaccinated,
total_vaccinations,
new_cases,
new_deaths,
population
) %>% na.omit() %>%
group_by(month = format(date, "%Y%m")) %>%
mutate(
#The monthly amount
new_cases=cumsum(new_cases),
new_deaths=cumsum(new_deaths),
) %>%
#The last observation for each month
slice(which.max(cumsum(partly & fully))) %>%
ungroup() %>%
select(-month) %>%
#For percentage demonstration
mutate(partly_rate=round(partly/population,4)*100,
fully_rate=round(fully/population,4)*100)
fig <- plot_ly(df)
fig %>%
add_trace(x = ~date, y = ~partly_rate, type = 'bar',
name = 'partly',
marker = list(color = 'ligthblue'),
hoverinfo = "text",
text = ~paste(partly_rate, '%')) %>%
add_trace(x = ~date, y = ~fully_rate, type = 'bar',
name = 'fully',
marker = list(color = 'orange'),
hoverinfo = "text",
text = ~paste(fully_rate, '%')) %>%
add_trace(x=~date, y=~.data[[var_toll]], type='scatter',mode='lines',
name=str_replace(var_toll,"_"," "),
yaxis="y2",
line = list(color = 'red',width=4),
hoverinfo="text",
#.data is the pronoun provided by data masks:https://adv-r.hadley.nz/evaluation.html#pronouns
text=~paste(comma(.data[[var_toll]]))) %>%
layout(barmode = "stack",
legend=list(orientation='h', #horizontal
xanchor="center", #center of x axis
x=0.5),
xaxis=list(title=""),
yaxis=list(side='left',
ticksuffix="%",
title='Total Percentage of Vaccinations'),
yaxis2=list(overlaying="y",
side='right',
automargin=TRUE, #prevents axis title from overlapping the labels
title=paste("The number of",
str_replace(var_toll,"_"," "),
"given monthly interval")))
}
```

```
fun(Turkey,new_cases)
```

```
fun(Germany,new_cases)
```

When we examine the graphs, we see that the total vaccinations are below 50% in Turkey and above 60% in Germany. Both countries peaked in the number of cases in May, but there has been a steady decline in the cases line since then as the proportion of people fully vaccinated has risen. We can clearly see that the vaccine has worked so far. So get vaccinated as soon as possible, please!

]]>In order to check that we have to model inflation rates with some variables. The most common view of the economic authorities is that the variables affecting the rates are currency exchange rates, and CDS(credit default swap). Of course, we will also add the funding rates variable, the president mentioned, to the model to compare with the other explanatory variables.

Because the variables can be highly correlated with each other, we will prefer the random forest model. This algorithm also has a built-in function to compute the feature importance.

**Random Forest**; for **regression**, constructs multiple decision trees and, inferring the average estimation result of each decision tree. This algorithm is more robust to overfitting than the classical decision trees.

The random forest algorithms average these results; that is, it reduces the variation by training the different parts of the train set. This increases the performance of the final model, although this situation creates a small increase in bias.

The random forest uses bootstrap aggregating(**bagging**) algortihms. We would take for training sample, ** X = x_{1}, …, x_{n}** and,

The unseen values, * x’*, would be fitted by

The **permutation feature importance** method would be used to determine the effects of the variables in the random forest model. This method calculates the increase in the prediction error(**MSE**) after permuting the feature values. If the permuting wouldn’t change the model error, the related feature is considered unimportant. We will use the ** varImp** function to calculate variable importance. It has a default parameter,

The data we are going to use can be download here. The variables we will examine:

: The annual consumer price index. It is also called the inflation rate. This is our target variable.*cpi*: The one-week repo rate, which determined by the Turkish Central Bank. It is also called the political rate.*funding_rate*: The currency exchange rates between Turkish Liras and American dollars.*exchange_rate*: The credit defaults swap. It kind of measures the investment risk of a country or company. It is mostly affected by foreign policy developments in Turkey. Although the most popular CDS is 5year, I will take CDS of 1 year USD in this article.*CDS*

```
library(readxl)
library(dplyr)
library(ggplot2)
library(randomForest)
library(varImp)
#Create the dataframe
df <- read_excel("cpi.xlsx")
#Random Forest Modelling
model <- randomForest(CPI ~ funding_rate+exchange_rate,
data = df, importance=TRUE)
#Conditional=True, adjusts for correlations between predictors.
i_scores <- varImp(model, conditional=TRUE)
#Gathering rownames in 'var' and converting it to the factor
#to provide 'fill' parameter for the bar chart.
i_scores <- i_scores %>% tibble::rownames_to_column("var")
i_scores$var<- i_scores$var %>% as.factor()
#Plotting the bar and polar charts for comparing variables
i_bar <- ggplot(data = i_scores) +
geom_bar(
stat = "identity",#it leaves the data without count and bin
mapping = aes(x = var, y=Overall, fill = var),
show.legend = FALSE,
width = 1
) +
labs(x = NULL, y = NULL)
i_bar + coord_polar() + theme_minimal()
i_bar + coord_flip() + theme_minimal()
```

When we examine the charts above, we can clearly see that the funding and exchange rates have similar effects on the model and, CDS has significant importance in the behavior of the CPI. So the theory of the president seems to fall short of what he claims.

**References**

**Neural Network Structure**

The neural network consists of three layers. The predictors form the bottom layer, the predicted output form the top layer and, the ones of the middle between them form the intermediate layers called the hidden neurons.

As we can see from the above graph, each layer of intermediate nodes takes the input from the previous layer. Each node and corresponding weight coefficients are combined in linear regression. Because of the hidden neurons(j), the output is recalculated by a nonlinear function. This is called the **multilayer feed-forward network**.

As mentioned before, the output is recalculated by the sigmoid function, which is nonlinear.

The modified output is given to the next layer as an input. This process is useful for the network to be robust to outliers. The weights () are randomly chosen and then learn from the observations by minimizing the cost function (e.g. MSE).

In time series regression, lagged values of data are used as input by a neural network, which is called the **neural network autoregression(NNAR)**. We will use feed-forward with one hidden layer, which is denoted NNAR(p, k). **p **indicates lagged values, and **k **denotes the nodes in the hidden layer.

For seasonal time series, it is better to include that the last observation from the same term from the previous year. In this case, it is denoted as indicates that:

- for the lagged inputs(predictors).
- for the seasonal terms.

For non-seasonal time series data, the default **p **value is determined by the optimal number of lagged inputs according to the Akaike’s information criterion.

**T**is the number of observations.**SSE**: sum of the squared of errors.**k**is the number of predictors.

If k is not selected at first, it is determined as . The network repeats 20 times as default and is trained for starting points which are different random weights. All the results are averaged for ultimate output. The process uses lagged values as inputs for a one-step forecast. This forecast along with the lagged values is used to predict a two-step forecast. This process continues until it meets the required forecasts.

We will prefer the non-seasonal approach because the data we will be using is less than the two periodic cycles required for detecting seasonality. We are going to predict daily cases in the UK for the next 30 days. The data file(**covid_uk.csv**) can be downloaded from here.

```
library(tidyverse)
library(forecast)
library(purrr)
library(lubridate)
#Building the dataframe
df <- read_csv("covid_uk.csv",
col_types = list(col_character(),
col_date(format = "%d/%m/%Y"),
col_double()))
#Converting to the time series data
inds <- seq(as.Date("2020-01-31"), as.Date("2021-05-21"), by = "day")
df_ts <- df %>%
.$new_cases %>%
ts(start = c(2020, as.numeric(format(inds[1], "%j"))),frequency = 365)
```

Before modeling the data, we will use bootstrap aggregating to prevent overfitting and improve accuracy. This process has been detailed in one of the previous articles. We will make a function to predict, and show the results in a plot, but before doing that, we create a date transform variable to be used for readable dates.

```
#The function to convert decimal date to the proper format
date_transform <- function(x) {format(date_decimal(x), "%b %Y")}
#Forecasting with Simulated Neural Network and plotting the results
sim_nn <- function(data, freq, h, n=100){
#For reproducible results
set.seed(123)
#Bootstraping the time series
sim <- bld.mbb.bootstrap(data, n)
#Bagged Neural Network
future <- sim %>%
map(function(x){simulate(nnetar(x, P=0),nsim=h)}) %>%
unlist() %>%
matrix(ncol = h, nrow = n, byrow = TRUE) %>%
pmax(0)#preventing the negative values
#The beginning date of the prediction
start <- tsp(data)[2]+1/freq
#Forecast object with prediction intervals
#at %5 significance level
sim_fc <- structure(list(
mean = ts(colMeans(future), start=start, frequency=freq),
lower = future %>% as.data.frame() %>%
map_dbl(quantile, prob = 0.025) %>%
ts(start = start,frequency = freq),
upper = future %>% as.data.frame() %>%
map_dbl(quantile, prob = 0.975) %>%
ts(start = start,frequency = freq),
level=95),
class="forecast")
#Making predictions global variable for easy access
assign("simfc",simfc,envir = .GlobalEnv)
#Plotting the prediction results with the observations
sim_plot <- autoplot(data)+
autolayer(simfc)+
scale_x_continuous(labels = date_transform,breaks = seq(2020,2022,0.2))+
ylab("")+xlab("")+
ggtitle("Daily Cases of United Kingdom")+
theme_light()+
theme(plot.title = element_text(hjust = 0.5))
#Converting time series with decimal date
#to the dataframe for proper demonstration
start <- start %>% as.numeric() %>% date_decimal() %>% as.Date()
h_date <- seq(start, start+29, by = "day")
simfc_df <- simfc %>% data.frame()
rownames(simfc_df)<- h_date
#The output list
list(simfc_df,sim_plot)
}
```

```
sim_nn(df_ts,365,30)
```

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* - Our World in Data

In order to that, we will compare the performance of some developed countries; these are France, Germany, the United Kingdom, and the United States. The dataset for this article can be found here.

In the dataset, we have the number of new cases and new deaths per day, and the population number of related countries. In the below code block, we are replacing blank and NA data with zeros for new cases and new deaths variables. We will limit the data to only this year.

```
library(tidyverse)
library(lubridate)
library(readxl)
library(plotly)
library(scales)
#Preparing the dataset
data <- read_excel("covid-19_comparing.xlsx")
data %>%
mutate(
date=as.Date(.$date),
new_deaths=if_else(is.na(.$new_deaths), 0, .$new_deaths),
new_cases=if_else(is.na(.$new_cases), 0 , .$new_cases)
) %>%
filter(.$date >= "2021-01-01", .$date <= "2021-04-26" ) -> df
```

First, we will compare the number of cases per day(new_cases) but we have to scale the related variable because each country has a different population. The scaling metric we are going to use is the normalized function. I preferred the ** ggplotly** function for ease of analysis on the graph because we can click on each legend to extract the related countries from the plot, and re-click to put it on back.

```
#Normalizing function
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
#Draw normalized results for comparising
df %>%
group_by(location) %>%
mutate(new_cases=normalize(new_cases)) %>%
ggplot(aes(date,new_cases,color=location))+
geom_line(size=1, alpha = 0.7) +
scale_y_continuous(limits = c(0, 1))+
guides(color=guide_legend(title="Countries"))+
labs(x="", y="New Cases")+
theme_minimal() -> cases_norm
ggplotly(cases_norm)
```

Since the beginning of the year, the spread of the disease has decreased dramatically in the UK and the USA, but the opposite has happened in France. As for Germany, it declined until March, but then their numbers rose. But in April, all countries declined radically.

To confirm the above trends for all countries, this time we will draw the number of daily cases per population on the graph. It looks like the same trends for the countries when we examine the below plot.

```
#New cases per population
df %>%
group_by(location) %>%
mutate(casesPerPop=new_cases/population) %>%
ggplot(aes(date,casesPerPop,color=location))+
geom_line(size=1, alpha = 0.7) +
scale_y_continuous(limits = c(0, 0.002),labels = percent_format())+
guides(color=guide_legend(title="Countries"))+
labs(x="",y="",title = "New Cases per Population")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5)) -> cases_perpop
ggplotly(cases_perpop)
```

I also want to know that the death rate, in order to shed on light the health system for the related countries. To do that we will plot the death toll per population and per case amount.

```
#Deaths per Population
df %>%
group_by(location) %>%
mutate(deathsPerPop=sum(new_deaths)/population) %>%
ggplot(aes(location,deathsPerPop,color=location))+
geom_bar(stat = "identity",position = "identity",fill=NA)+
scale_y_continuous(limits = c(0, NA),labels = percent_format())+
labs(x="",y="",title = "Deaths per Population")+
theme_minimal()+
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
```

When we examine the below graph, despite the UK is better for the spread of the virus, it looks worst for the death rate per population.

```
#Deaths per Case
df %>%
group_by(location) %>%
mutate(deathsPerCase=sum(new_deaths)/sum(new_cases)) %>%
ggplot(aes(location,deathsPerCase,color=location))+
geom_bar(stat = "identity",position = "identity",fill=NA)+
scale_y_continuous(limits = c(0, NA),labels = percent_format())+
labs(x="",y="",title = "Deaths per Case")+
theme_minimal()+
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
```

This time Germany performs worse than others. It should be noted that despite France less performed for the daily cases, it seems to have best results for both death rates.

While the effect of the pandemic decreases in developed countries, we had to be a full closure in Turkey by the government, unfortunately. As much as the virus is dangerous, bad management is also dangerous.

**References**

*The data source*: Our World in Data- R for Data Science:
*Hadley Wickham*& Garrett Grolemund - Predicting injuries for Chicago traffic crashes:
*Julia Silge*

Time series modeling, most of the time, uses past observations as predictor variables. But sometimes, we need external variables that affect the target variables. To include those variables, we have to use regression models. However, we are going to use **dynamic regression** to capture elaborated patterns; the difference from the orthodox regression models is that residuals are not white noise and are modeled by** ARIMA**.

The residuals we mentioned above, have autocorrelation, which means contain information. To indicate that, we will show as . In this way, the residuals term can be modeled by ARIMA. For instance, dynamic regression with ARIMA(1,1,1) as described:

denotes the white noise and **B**, the backshift notation. As we can see above equation, There two error terms: the one from regression model, , and the other from ARIMA model, .

In the previous article, we have created the dataset variable,* df_xautry*. We will transform it into the multivariate time series and split it as a test and training set. Finally, we will model the training data.

```
library(dplyr)
library(forecast)
#Building the multivariate time series
df <- df_xautry[-1]
df_mts <- df %>% ts(start=c(2013,1),frequency=12)
#Split the dataset
train <- df_mts %>% window(end=c(2020,12))
test <- df_mts %>% window(start=2021)
#Modeling the training data
fit_dynamic <- auto.arima(train[,"xau_try_gram"], xreg =train[,c(1,2)])
#Series: train[, "xau_try_gram"]
#Regression with ARIMA(1,0,2) errors
#Coefficients:
# ar1 ma1 ma2 intercept xe xau_usd_ounce
# 0.9598 -0.0481 0.4003 -150.8309 43.8402 0.1195
#s.e. 0.0390 0.0992 0.1091 23.7781 2.1198 0.0092
#sigma^2 estimated as 27.15: log likelihood=-293.31
#AIC=600.62 AICc=601.89 BIC=618.57
```

Based on the above results, we have ARIMA(1,0,2) model as described below:

Now, we will do forecasting and then calculate accuracy. The accuracy for xgboost will be calculated from the *forecast_xautrygram* variable.

```
#Forecasting
fcast_dynamic <- forecast(fit_dynamic, xreg = test[,1:2])
#Accuracy
acc_dynamic <- fcast_dynamic %>%
accuracy(test[,3]) %>%
.[,c("RMSE","MAPE")]
acc_xgboost <- forecast_xautrygram %>%
accuracy(test[,3]) %>%
.[,c("RMSE","MAPE")]
```

In order to visualize the accuracy results, we’re going to build the data frame and prepare it for a suitable bar chart.

```
#Tidying the dataframe
df_comparison <- data.frame(
"dynamic"=acc_dynamic,
"xgboost"=acc_xgboost
)
df_comparison
# dynamic.RMSE dynamic.MAPE xgboost.RMSE xgboost.MAPE
#Training set 5.044961 2.251683 0.001594868 0.000805107
#Test set 10.695489 2.501123 11.038134819 2.060825426
library(tidyr)
library(tibble)
df_comparison %>%
rownames_to_column(var = "data") %>%
gather(`dynamic.RMSE`, `dynamic.MAPE`,`xgboost.RMSE`, `xgboost.MAPE`,
key = "models", value="score") -> acc_comparison
acc_comparison
# data models score
#1 Training set dynamic.RMSE 5.044960948
#2 Test set dynamic.RMSE 10.695489161
#3 Training set dynamic.MAPE 2.251682989
#4 Test set dynamic.MAPE 2.501122965
#5 Training set xgboost.RMSE 0.001594868
#6 Test set xgboost.RMSE 11.038134819
#7 Training set xgboost.MAPE 0.000805107
#8 Test set xgboost.MAPE 2.060825426
```

```
#Plotting comparing models
ggplot(acc_comparison,aes(x=data,y=score,fill = models)) +
geom_bar(stat = "identity",position = "dodge") +
theme_bw()
```

**Conclusion**

When we examined the above results and the bar chart for unseen data, we are seeing some interesting results. For the training set, the xgboost model has near-zero accuracy rates which can lead to overfitting. The dynamic model looks slightly better for the RMSE but vice versa for the MAPE criteria.

This shed light that while the first-month dynamic model is better, for the second month the xgboost looks closer to the actual observation. Of course, the reason for that is maybe we have few test data but I wanted to predict the first two months of the current year. Maybe next time, if we have more test data, we can try again.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos*