Of course, it is not a good development for the world. For this reason, I’ve just wondered which motives affect the military spending of the countries. I will take G20 countries’ annual military expenditure as a target variable. And as the predictors:

All the data we will use is from Our World in Data. We are going to use a linear model with simulations in which the Markov chain Monte Carlo(MCMC) method generates samples from Bayesian posterior distributions. This process is called JAGS and in order to use that, we will use *rjags* package. But we have to install JAGS library separately.

First, we create a data frame that we are going to use in analysis and modeling.

```
library(tidyverse)
library(readxl)
library(sysfonts)
library(showtext)
library(rjags)
library(coda)
df <- read_excel("military_expenditure.xlsx")
#Creating region variable for comparing analysis and
#region_int variable for jags modeling
df <-
df %>%
mutate(region=if_else(country %in% c("Australia","China","India","Indonesia",
"Japan","Russia","Saudi Arabia",
"South Africa","South Korea","Turkey"),
"East","West") %>% as_factor(),
region_int = region %>% as.integer()) %>%
na.omit() %>%
as.data.frame()
```

We will examine the expenditure of the G20 countries from the cold war ends until the recent, for comparison. We colorize each region separately.

```
#load fonts(google)
font_add_google("Roboto Slab")
showtext_auto()
#Comparison Plot
df %>%
filter(year > 1991) %>%
mutate(country = fct_reorder(country, expenditure)) %>%
ggplot(aes(expenditure, country, fill = region, color = region)) +
geom_boxplot(position = position_dodge(), alpha = 0.5) +
scale_x_log10(labels = scales::label_number_si()) +
scale_fill_brewer(palette = "Pastel1") +
scale_color_brewer(palette = "Pastel1") +
labs(y = NULL, color = NULL, fill = NULL, x = "Military Expenditures per Country($)")+
theme_minimal()+
theme(legend.position = "right",
panel.background = element_rect(fill = "#e5f5f9",color = NA),
axis.text = element_text(family = "Roboto Slab", size = 10),
axis.title = element_text(family = "Roboto Slab"))
```

The interesting thing is Germany doesn’t appear so much different from Russia.

First of all, we will build the model with the region variable we created before, so we will see its effect on military expenditures change. Because of the *jags.model* function can not employ non-numeric data as predictors we have to use the integer type of region variable (** region_int**) we created before.

```
#modeling with region
set.seed(123)
model_with <-
lm(expenditure ~ civlib + gdp + region, data = df)
> model_with
#Call:
#lm(formula = expenditure ~ civlib + gdp + region, data = df)
#Coefficients:
#(Intercept) civlib gdp regionEast
# -3.052e+10 7.380e+10 5.367e-02 -4.726e+10
```

Because of the linear model above, we take the ** region** variable as

```
#the dataset for jags model
lt_with <- list(x = df[,c(3, 4, 7)], y = df[,5],
n = nrow(df))
#estimation coefficients
lt_with$b_guess <- model_with$coefficients
#Model string
jags.script_with <-
"#--------------------------------------------------
model{
# likelihood
for( i in 1:n) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- intercept + civlib*x[i,1] + gdp*x[i,2]+ regionEast*x[i,3]
}
# priors
intercept ~ dnorm(b_guess[1], 0.1)
civlib ~ dnorm(b_guess[2], 0.1)
gdp ~ dnorm(b_guess[3], 0.1)
regionEast ~ dnorm(b_guess[4], 0.1)
tau ~ dgamma(.01,.01)
# transform
sigma <- 1 / sqrt(tau)
}#--------------------------------------------------
"
#compiling
mod_with <- jags.model(textConnection(jags.script_with),
data = lt_with,
n.chains = 4, n.adapt = 2000)
#burn-in process
update(mod_with, 1000)
#posterior sampling
mod.samples_with <- coda.samples(
model = mod_with,
variable.names = c("intercept", "civlib", "gdp", "regionEast", "sigma"),
n.iter = 1000)
# descriptive statistics of posteriors
sm_with <- summary(mod.samples_with)
sm_with[1]
#$statistics
# Mean SD Naive SE Time-series SE
#civlib 7.379818e+10 3.127715e+00 4.945351e-02 4.885913e-02
#gdp 7.665518e-02 2.448805e-03 3.871901e-05 3.598913e-05
#intercept -3.051892e+10 3.197261e+00 5.055314e-02 5.056658e-02
#regionEast -4.726244e+10 3.099717e+00 4.901083e-02 4.900839e-02
#sigma 1.182377e+11 2.493767e+09 3.942992e+07 3.798331e+07
```

As you can see above, the GDP variable has no effect, while the eastern region(** regionEast**) is a powerful predictor. This time we will exclude the

```
#without region
model_without <-
lm(expenditure ~ civlib + gdp, data = df)
#the dataset for jags model
lt_without <- list(x = df[,c(3, 4)], y = df[,5],
n = nrow(df))
#estimation coefficients
lt_without$b_guess <- model_without$coefficients
# Model string
jags.script_without <-
"#--------------------------------------------------
model{
# likelihood
for( i in 1:n) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- intercept + civlib*x[i,1] + gdp*x[i,2]
}
# priors
intercept ~ dnorm(b_guess[1], 0.1)
civlib ~ dnorm(b_guess[2], 0.1)
gdp ~ dnorm(b_guess[3], 0.1)
tau ~ dgamma(.01,.01)
# transform
sigma <- 1 / sqrt(tau)
}#--------------------------------------------------
"
#compiling
mod_without <- jags.model(textConnection(jags.script_without),
data = lt_without,
n.chains = 4, n.adapt = 2000)
#burn-in process
update(mod_with, 1000)
#posterior sampling
mod.samples_without <- coda.samples(
model = mod_without,
variable.names = c("intercept", "civlib", "gdp", "sigma"),
n.iter = 1000)
# descriptive statistics of posteriors
sm_without <- summary(mod.samples_without)
#The plot of the effects of the terms on the change of expenditure
bind_rows(
sm_with$statistics %>%
as_tibble(rownames = "terms") %>%
rename("mean"= Mean,
"std.error" = `Time-series SE`) %>%
select(c(1,2,5)) %>%
mutate(regions = "include"),
sm_without$statistics %>%
as_tibble(rownames = "terms") %>%
rename("mean"= Mean,
"std.error" = `Time-series SE`) %>%
select(c(1,2,5)) %>%
mutate(regions = "not include")
) %>%
filter(terms != "intercept" , terms != "sigma") %>%
ggplot(aes(mean, terms, color = regions)) +
geom_vline(xintercept = 0, size = 1.5, lty = 2, color = "gray50") +
geom_errorbar(size = 2, alpha = 0.7,
aes(xmin = mean - 1.96 * std.error,
xmax = mean + 1.96 * std.error)) +
geom_point(size = 3) +
scale_x_continuous(labels = scales::dollar,
expand = expansion(mult = c(.1, .1))) +
scale_color_brewer(palette = "Dark2") +
labs(x="", y = "", color = "Region in Model?")+
theme_minimal()+
theme(legend.position="right",
panel.background = element_rect(fill = "#e5f5f9", color = NA))
```

According to the plot above, while being in the east decrease the military expenditures, the effect of civil liberties increases the military budget, which is unexpected for me. This anomaly can be explained by the United States’ extreme values, which we have seen in the first plot we created before. Also, the prediction intervals seem to be near zero because of the MCMC simulation we did with the high iteration number.

**References**

Because of the nuclear danger we’ve just mentioned above, I have wondered about the number of nuclear weapons the countries have nowadays and especially the change from the end of the cold war.

The dataset we are going to work on is from Our World in Data. In terms of the easiness of writing, we will change the names of some of the countries.

```
library(tidyverse)
library(fable)
library(sysfonts)
library(showtext)
library(plotly)
library(readxl)
library(rlang)
library(glue)
df <-
read_excel("nuclear.xlsx") %>%
mutate(country=case_when(
country == "United Kingdom" ~ "UK",
country == "United States" ~ "USA",
country == "North Korea" ~ "PRK",
country == country ~ country))
```

We will subset the dataset to dates that we want to examine are 1991, the end of the cold war, and the present day.

```
#Tidying the dataset for the plot
df_tidy <-
df %>%
filter(year==1991 | year==2022) %>%
pivot_wider(names_from = year, values_from = stockpile) %>%
mutate(
color= case_when (
`1991` - `2022` > 0 ~ "darkgrey",
`1991` - `2022` < 0 ~ "red"
)) %>%
na.omit() %>%
pivot_longer(c(`1991`, `2022`),
names_to = "year",
values_to = "stockpile") %>%
mutate(stockpile_pct=percent_rank(stockpile))
```

Now, we can draw the plot which shows the changes that countries in terms of the number of nuclear warheads at interested years. We will do that with *ggplotly* as dynamic plotting, so when we hoover the cursor on the plot and country names, we will be able to see the corresponding numbers. We will indicate red the ones on rising and dark grey the ones on the descent.

```
#Comparing the change of the numbers of nuclear wearheads
#load fonts(google)
font_add_google("Roboto Slab")
showtext_auto()
p <-
df_tidy %>%
ggplot(aes(x = year,
y = stockpile_pct,
group = country,
color = color,
fill = color,
text=stockpile)) +
geom_line(size= 1.5)+
geom_point(size = 5,stroke = 1) +
geom_text(
data= filter(df_tidy,year == "2022"),
aes(label = country),
hjust = c(0.1),
nudge_x = c(0.18),
family = "Roboto Slab",
size = 6
) +
coord_cartesian(clip = "off")+
scale_y_continuous(limits = c(0, 1),
breaks = scales::pretty_breaks(n = 5))+
scale_color_identity()+
scale_fill_identity()+
scale_size_identity()+
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(family = "Roboto Slab", size=20),
axis.title = element_blank(),
panel.background = element_rect(fill = "lightblue",color = NA)
)
#setting font family for ggplotly
font <- list(family= "Roboto Slab")
ggplotly(p, tooltip = "text") %>%
hide_legend() %>%
layout(font=font)
```

It’s interesting that Pakistan and India have entered the arms race since 1991 from zero point. Also, we can clearly see that while Russia and USA have decreased the numbers of warheads, China has increased the stockpiles.

Another important issue is how the stockpiles of each nuclear power country will change over the next decade. Hence, we will make a function that forecasts the next 10-years and shows the results on a plot for each country.

The algorithm we will use is the neural network autoregression (NNAR), and we prefer the function *NNETAR* from the fable package to implement it. We use the * ggplotly* function as we used before, again.

```
#The function that predicts 10-years ahead for each country.
fun_nuc <-
function(...){
#A function that automatically quotes the arguments for easy writing.
#(source: https://adv-r.hadley.nz/quasiquotation.html#quasi-motivation)
cement <- function(...) {
args <- ensyms(...)
paste(purrr::map(args, as_string), collapse = " ")
}
var_loc <- cement(...) #country value in quotation
#Modeling
set.seed(1234)
fit <-
df %>%
filter(country==var_loc) %>%
as_tsibble(index = "year") %>%
select(year, stockpile) %>%
model(NNETAR(stockpile ~ year))
#Forecasting with NNAR
pred <-
fit %>%
forecast(h=10)
#Plot variable
p_pred <-
df[df$country==var_loc,] %>%
ggplot(aes(x=year,y=stockpile))+
geom_line(size=1)+
geom_point(data=pred,
mapping = aes(y=round(.mean)),
color=df_tidy[["color"]][df_tidy$country == var_loc][1])+
scale_x_continuous(breaks = scales::pretty_breaks(n=7))+
scale_y_continuous(breaks = scales::pretty_breaks())+
xlab("")+
ylab("")+
labs(title = var_loc)+
theme_minimal()+
theme(panel.grid.minor = element_blank(),
plot.title = element_text(size=15,
colour = df_tidy[["color"]][df_tidy$country== var_loc][1],
hjust = 0.5))
#Interactive plotting
ggplotly(p_pred) %>%
style(
text=glue("stockpile: {round(pred$.mean)}\nyear: {pred$year}"),
traces = 2 #for the predicted time series
) %>%
layout(font = font)
}
```

```
fun_nuc(China)
```

```
fun_nuc(USA)
```

```
fun_nuc(Russia)
```

If we can hover on the line and points, we could see the corresponding year and the numbers of stockpiles thanks to the interactive plotting. It looks like the nuclear danger will move to Asia-Pacific when the crisis is over on the East-Europa; if we would survive.

]]>One of the first ensemble algorithms we’re going to use is **bagging**(bootstrap aggregating). The bagging produces many training datasets by bootstrapping the original training data. These datasets are used to form a bunch of models, which has a single algorithm usually preferred random forest; because they are **unstable learners**, which means small changes in the input cause significantly different predictions results.

The other ensemble algorithm we will use is, **boosting**. Unlike bagging, this method improves the performance of the model by adding better models to it, which means it forms the complementary learning algorithms. Besides that, the boosting gives learning algorithms more weight based on their past performance, which means a model that performs better has greater influence over the ensembles.

After the explanations of the algorithms we will use, we can build our dataset for the models. The dataset consists of the gasoline prices as Turkish Lira, Brent spot prices in US dollars, USD-TRY exchange rate between 2019 and 2022.

We will do some wrangling to adjust the dataset for our purpose. We create a three-level factor for the changes in gasoline prices, which are going to be our target variable. Also, we will add timestamps to construct regular time series data. In order to do that, we can use *tsibble *package.

```
library(tidyverse)
library(tsibble)
library(readxl)
df <- read_excel("gasoline.xlsx")
#Building the dataset
df_tidy <-
df %>%
mutate(
gasoline = case_when(
gasoline - lag(gasoline) < 0 ~ "down",
gasoline - lag(gasoline) > 0 ~ "up",
TRUE ~ "steady"
) %>% as.factor(),
xe_lag = lag(xe),
brent_lag = lag(brent),
date=as.Date(date)
) %>%
as_tsibble() %>%
fill_gaps() %>% #makes regular time series by filling the time gaps
#fills in NAs with previous values
fill(-date,.direction = "down") %>%
na.omit() %>%
as.data.frame()
```

Before the modeling, we will look at the ratio of the factor levels in the target variable, which leads us to know the NIR level for specifying significant accuracy results. We will build that with *treemap* function.

```
#Treemap of factors
library(treemap)
df_tidy %>%
count(gasoline) %>%
mutate(label=paste(gasoline,scales::percent(n/sum(n)),sep = "\n")) %>%
treemap(
index="label",
vSize="n",
type="index",
title="",
palette="Accent",
border.col=c("black"),
border.lwds=1,
fontcolor.labels="white",
fontface.labels=1,
inflate.labels=F
)
```

We can understand from the above diagram; any model we create must have a more accuracy rate than %65.7 to be significant. The **Kappa **statistic is an effective accuracy measurement in terms of this no information rate(%65.7) we’ve just mentioned.

- is the
**sum**of true positive and true negative rates (actual agreement). - is expected agreement:
- We will use
**unweighted kappa**as the measurement value. Because there is no correlation among the factors levels of the target variable(**varying degrees of agreement**).

Now, we can build our models and compare the kappa results for different seed values on a *plotly* plot.

```
library(adabag)#Boosting
library(ipred)#Bagging
library(caret)#Bagging control object
library(vcd)#Kappa
library(plotly)#interactive plot
#Bagging
ctrl <- trainControl(method = "cv", number = 10)
kappa_bagg <-
lapply(
1:20,
function(x){
set.seed(x)
train(gasoline ~ .,
data = df_tidy,
method = "treebag",
trControl = ctrl)$results[["Kappa"]]}
) %>%
unlist()
#Boosting
kappa_boost <-
lapply(
1:20,
function(x){
set.seed(x)
boosting.cv(gasoline ~ ., data = df_tidy) %>%
.$confusion %>%
Kappa() %>%
.$Unweighted %>%
.[[1]]}
) %>%
unlist()
#Kappa simulation on a plotly chart
kappa_plot <-
data.frame(
seed=rep(1:20, 2),
kappa= c(kappa_bagg, kappa_boost),
ensembles=c(rep("Bagging", 20), rep("Boosting", 20))
) %>%
ggplot(aes(x=seed, y=kappa, color=ensembles))+
geom_point(size=3)+
geom_line(size=1)+
theme_minimal()+
theme(panel.background = element_rect(fill = "midnightblue", color = NA),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
ggplotly(kappa_plot) %>%
layout(legend=list(orientation="v", y=0.5))
```

As can be seen above, boosting algorithm seems to be slightly better than boosting but generally, both algorithms also have very high accuracy results.

**References**

- Lantz, Brett (2015) Evaluating Model Performance.
*Machine Learning with R*. 323-326 - Lantz, Brett (2015) Improving Model Performance.
*Machine Learning with R*. 362-369

The market basket analysis has the **items**, which are in the brackets, to form the **itemset **for each **transaction**(e.g. ). The association rules create the formation of itemsets as shown below:

This association rule indicates that if butter and jam are purchased together, then the bread is most likely to be purchased too. Since the association rule is unsupervised the data does not need to be labeled. These rules are created from subsets of itemsets that on the left-hand side (**LHS**) denotes the condition that needs to be met, and the right-hand side (**RHS**) indicates the expected outcome of meeting that condition.

In order to mine association rules algorithms among massive transactional data, we will use the** apriori algorithm**. This algorithm works on the concept that frequent itemsets can only be frequent if both components(items) of itemsets are frequent separately. Hence an itemset that has rare items would be extracted from the search.

The apriori algorithm uses some thresholds measurements which are called support and confidence measures to reduce the number of rules. The **support **of itemset or rule measures the frequency of it in the data.

**N**is the number of transactions.- is the number of transactions which contain that itemset.

The **confidence **measures the power of a rule; the closer to 1, the stronger the relationship of a rule. The support of the itemset which contains both X and Y is divided by the support of itemset containing only X, as shown below:

The Apriori algorithm uses the minimum support threshold to determine all potential itemsets and builds rules with the minimum confidence level. Besides all of that, we will use a different measurement to evaluate the rank of importance of the association rules, which indicates true connections between items.

The metric we’ve just mentioned is the **Lift,** which measures how much more likely it is that the items in the itemset are found together to the rate being alone.

To examine all concepts, we mentioned so far, we will examine the grocery data from the arules package. Let’s take a look first three transactions of the market basket data. The data is stored in **sparse matrix** format, as seen below. This matrix format stores only the items which are present and removes the missing cells to save the memory.

```
library(arules)
library(tidyverse)
library(scales)
library(glue)
library(ggiraph)
library(ggiraphExtra)
data("Groceries")
Groceries
#transactions in sparse format with
# 9835 transactions (rows) and
# 169 items (columns)
Groceries %>%
.[1:3] %>%
inspect()
# items
#[1] {citrus fruit,
# semi-finished bread,
# margarine,
# ready soups}
#[2] {tropical fruit,
# yogurt,
# coffee}
#[3] {whole milk}
```

Before mining the rules, we will look at the supports(frequencies) rates of the items for each. We will get the ones greater than 0.1 for a readable chart.

```
#frequency/support plot
Groceries %>%
itemFrequency() %>%
as_tibble(rownames = "items") %>%
rename("support"="value") %>%
filter(support >= 0.1) %>%
arrange(-support) %>%
ggDonut(aes(donuts=items,count=support),
explode = c(2,4,6,8),
labelposition=0)
```

The** whole milk** appears to be the most frequent item in all transactions. Let’s see if this superiority is reflected in the association rules.

```
#Mining rules
groceryrules <-
Groceries %>%
apriori(parameter = list(
support =0.006,
minlen = 2,
confidence=0.25))
#Rules plot
gg_rules <-
groceryrules %>%
arules::sort(by="lift") %>%
DATAFRAME() %>%
as_tibble() %>%
head(10) %>%
mutate(ruleName = paste(LHS,"=>",RHS) %>% fct_reorder(lift),
support = support %>% percent(),
lift = lift %>% round(2)) %>%
select(ruleName, support, lift) %>%
ggplot(aes(x=ruleName,y=lift))+
geom_segment(aes(xend=ruleName, yend=0),
color="orange",
size=1) +
geom_point_interactive(aes(tooltip=glue("Support: {support}\nLift: {lift}"),
data_id=support),
size=3,
color="lightblue") +
coord_flip() +
theme_minimal() +
theme(
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.background = element_rect(fill = "#e0ddd7", color = NA),
plot.background = element_rect(fill = "#f5cecb", color = NA)
) +
xlab("") +
ylab("")
girafe(ggobj = gg_rules)
```

In the above code chunk, we have employed the apriori function to find the association rules, and we have sorted the rules by top 10 lift values. When we hoover on the blue points the pop-pup(tooltip) displays the lift and support values. This interactivity of the plot was provided by the ggiraph package.

The up top rule indicates the probability that those who buy herbs would also buy root vegetables is about four times more likely than buying root vegetables alone. The support value in the tooltip indicates the frequency of the rule, not the frequency of the itemsets.

**References**

- Lantz, Brett (2015). Machine Learning with R: Packt Publishing
- LOLLIPOP CHART

```
library(tidyverse)
library(tidymodels)
library(tsibble)
library(C50)
library(caret)
library(pROC)
data <- readr::read_csv("https://raw.githubusercontent.com/mesdi/trees/main/credit.csv")
```

The first algorithm we will talk about is the **C5.0 decision tree**. This method is based upon entropy to select optimal features to split upon. The high entropy means that there is little or no **homogeneity **within the set of class values. The algorithm aims to find splits that reduce the entropy. (**Lantz, 2015: 133**)

- The refers to the number of class levels
- refers to the rate of values in class level i

The algorithm calculates the difference in the entropy of the last two segments on a feature, F, to determine the change of homogeneity. This is called** information gain**. The higher the information gain the better the homogeneity in groups after the split. (**Lantz, 2015: 134**)

Now after some brief information, we can begin the fitting process.

```
#Building tibble dataset
df <-
data %>%
mutate_if(sapply(data,is_character),as_factor) %>%
mutate(default= default %>% factor(labels=c("no","yes")))
#Creating train and test set
set.seed(36894)
df_split <- initial_split(df, prop = 9/10, strata = "default")
df_train <- training(df_split)
df_test <- testing(df_split)
```

```
#C5.0 model
model_c50 <- C5.0(df_train[-17],df_train$default,trials = 10)
pred_c50 <- predict(model_c50, df_test)
confusionMatrix(pred_c50,df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 65 12
# yes 5 18
# Accuracy : 0.83
# 95% CI : (0.7418, 0.8977)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.002163
```

No information rate(**NIR**) means that the largest proportion of the class levels in our response variable. So, when we look at the above results, we can see this rate is %70 and, our accuracy result is higher than NIR at the %5 significance level(**P-Value: 0.002163** **< 0.05**). Nevertheless, the type 2 error rate is quite annoying (**12/30**).

The second tree algorithm we are going to use is decision_tree() from parsnip.

```
#Decision trees model
model_dt <-
decision_tree() %>%
set_mode("classification") %>%
set_engine("rpart") %>%
fit(default ~ ., data=df_train)
pred_dt <- predict(model_dt, df_test)
confusionMatrix(pred_dt$.pred_class, df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 65 17
# yes 5 13
# Accuracy : 0.78
# 95% CI : (0.6861, 0.8567)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.04787
```

It looks like the accuracy and type 2 error rate are much worse than the C5.0 function. The final algorithm is the boosted trees.

```
#Boosted trees model
model_bt <-
boost_tree(trees = 100) %>%
set_mode("classification") %>%
set_engine("C5.0") %>%
fit(default ~ ., data=df_train)
pred_bt <- predict(model_bt, df_test)
confusionMatrix(pred_bt$.pred_class, df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 63 14
# yes 7 16
# Accuracy : 0.79
# 95% CI : (0.6971, 0.8651)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.02883
```

It is seen a slight improvement, but not like the first one.

Finally, we will be comparing ROC curves with the help of the ggroc function from the pROC package.

```
#Multiple ROC curve
roc_df <- tibble(
c50=pred_c50 %>% ordered(),
dt=pred_dt$.pred_class %>% ordered(),
bt=pred_bt$.pred_class %>% ordered(),
response=df_test$default
)
roc.list <- roc(response ~ dt + bt + c50, data = roc_df)
roc_values <-
lapply(roc.list,"[[","auc") %>%
as.numeric() %>%
round(2)
ggroc(roc.list) +
geom_line(size=2)+
geom_abline(slope=1,
intercept = 1,
linetype = "dashed",
alpha=0.5,
size=2,
color = "grey") +
scale_color_brewer(palette = "Set2",
labels=c(
paste0("DT: ", roc_values[1]),
paste0("BT: ", roc_values[2]),
paste0("C50: ",roc_values[3]))) +
annotate("text",
x=0.5,
y=0.5,
label="no predictive value",
angle=45)+
guides(color= guide_legend(title = "Models AUC Scores")) +
theme_minimal() +
coord_equal()
```

The above ROC curves plot and the AUC results confirm the accuracy results we found earlier. It is understood, except for the C50 model, the other two models seem to be inefficient.

]]>We will model the gas prices in terms of the exchange rate(**xe**), but this time, not only the current exchange rate but also, I want to add lagged values of xe to the model; Because the change of xe rates might not affect the prices immediately. Our model should be as below:

is ARIMA errors. is a lag level and can be determined by AICc.

The dataset we are going to use is that from 2019 to the present, gas prices and xe rates. In the code block below, we build tsibble data frame object and separate it into the train and test set.

```
library(tidyverse)
library(tidymodels)
library(readxl)
library(tsibble)
library(fable)
library(feasts)
df <- read_excel("gasoline.xlsx")
#Building tsibble dataset
df_tsbl <- df %>%
mutate(date=as.Date(date)) %>%
as_tsibble() %>%
fill_gaps() %>% #makes regular time series by filling the time gaps
#fills in NAs with previous values
fill(xe,gasoline,.direction = "down")
#Separating into train and test set without random selection
#Takes first %75 for the training and the rest for the test set
df_split <- initial_time_split(df_tsbl, prop = 3/4)
df_train <- training(df_split)
df_test <- testing(df_split)
```

First of all, we will check the stationary of dependent and independent variables of the model as always.

```
#Checking the stationary
df_train %>%
pivot_longer(gasoline:xe) %>%
ggplot(aes(x=date,y=value))+
geom_line()+
facet_grid(vars(name),scales = "free_y")+
xlab("")+ylab("")+
theme_minimal()
```

It seems that it is better we make differencing the gas and xe variables.

```
#Comparing the probable lagged models
model_lags_diff <- df_train %>%
#Excludes the first three days so all versions are in the same time period
mutate(gasoline=c(NA,NA,NA,gasoline[4:788])) %>%
model(
lag0 = ARIMA(gasoline ~ pdq(d = 1) + xe),
lag1 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe)),
lag2 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe) +
lag(xe, 2)),
lag3 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe) +
lag(xe, 2) + lag(xe, 3))
)
glance(model_lags_diff) %>%
arrange(AICc) %>%
select(.model:AICc)
# A tibble: 4 x 5
# .model sigma2 log_lik AIC AICc
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 lag1 0.00310 1152. -2294. -2294.
#2 lag2 0.00311 1152. -2292. -2292.
#3 lag3 0.00311 1152. -2290. -2290.
#4 lag0 0.00313 1148. -2288. -2288.
```

By the AICc results, it is better to select the one-level lagged model.

```
#Creates the model variable to the above results
model_lag1 <- df_train %>%
model(ARIMA(gasoline ~ pdq(d = 1) + xe + lag(xe)))
```

```
#Checking the model residuals consistency with white noise
model_lag1 %>% gg_tsresiduals()
```

All the spikes seem to be within the threshold line to the ACF plot, so the residuals can be considered white noise. So, we are comfortable with the model we chose.

```
report(model_lag1)
#Series: gasoline
#Model: LM w/ ARIMA(0,1,0)(2,0,0)[7] errors
#Coefficients:
# sar1 sar2 xe lag(xe)
# 0.0920 0.1065 -0.0436 0.1060
#s.e. 0.0356 0.0359 0.0361 0.0368
#sigma^2 estimated as 0.003103: log likelihood=1155.84
#AIC=-2301.69 AICc=-2301.61 BIC=-2278.35
```

The above results indicate that the model has the weekly seasonal **AR(1)** and **AR(2)** components. Now, we will predict with the test set by using the above regression equations to assess the accuracy of our model.

```
#Lagged Forecasts
model_pred <- forecast(model_lag1,df_test)
#Accuracy of the lagged model
cor(df_test$gasoline,model_pred$.mean) %>% round(2)
#[1] 0.77
```

The result seems to be fine but we can try to improve it. In order to do that we will use bootstrapping and bagging methods.

```
set.seed(456)#getting the optimal and similar results for reproducibility
#Bootstrapped forecasts
sim_fc <- model_lag1 %>%
fabletools::generate(new_data=df_test,
times=10,
bootstrap_block_size=14) %>%
select(-.model)
#Drawing bootstrapped forecasts
sim_fc %>%
autoplot(.sim)+
autolayer(df_train,gasoline)+
guides(color="none")+
xlab("")+ylab("")+
theme_minimal()
```

```
#Bagged forecasts
bagged_fc <- sim_fc %>%
summarise(bagged_mean=mean(.sim))
#Accuracy of bagging model
cor(df_test$gasoline,bagged_fc$bagged_mean) %>% round(2)
#[1] 0.85
```

Quite a significant improvement for the accuracy of our model; it seems the bootstrapping simulation works in this process.

]]>The dataset we have built for this article consists of the number of formal education **students** and the housing **capacity** provided by Higher Education Loans and Dormitories Institution(**KYK**), which is a state organization.

We will determine whether there is a capacity shortage this year based on historical data. The model we are going to use is the **dynamic regression model with ARIMA errors**; Because we will model the dormitories’ **capacity **in terms of the number of **students **by the historical data between 1992-2020.

First of all, we will check the stationary of the variables. If one of them is not stationary, all the variables have to be differenced. This modeling process is called model in differences.

```
#Creating the tsibble variables and examining the stationary
library(tidyverse)
library(fable)
library(feasts)
library(tsibble)
library(readxl)
df <- read_excel("kyk.xlsx")
df_ts <- df %>%
as_tsibble(index = date)
train <- df_ts %>% filter_index(.~ "2020")
test <- df_ts %>% filter_index("2021"~.)
train %>%
pivot_longer(c(capacity, students),
names_to = "var", values_to = "value") %>%
ggplot(aes(x = date, y = value)) +
geom_line(size=1.5, color="blue") +
facet_grid(vars(var), scales = "free_y") +
scale_x_continuous(breaks = seq(1990,2020,5))+
scale_y_continuous(labels = scales::label_number_si()) +
labs(title = "The Number of Capacity and Students",y = "",x="") +
theme_minimal() +
theme(plot.title=element_text(face = "bold.italic",hjust = 0.5),
panel.background = element_rect(fill = "#f0f0f0", color = NA))
```

When we examine the above plots, it seems like the two variables need differencing as well. After we fit the model, we will check the residuals for stationary by drawing the innovations residuals.

```
#Modeling
fit <- train %>%
model(ARIMA(capacity ~ students + pdq(d=1)))
#Checking residuals for stationary
fit %>%
gg_tsresiduals()
```

In the ACF graph, we can clearly see that the spikes are within the thresholds which means the residuals are white noise. In the below code block, we have confirmed that the residuals have no patterns by using the Ljung-Box test in which we saw the **p-value** greater than 0.05 at %5 significance level.

```
#Ljung-Box independence test for white noise
augment(fit) %>%
features(.innov, ljung_box)
#.model lb_stat
# <chr> <dbl>
#1 ARIMA(capacity ~ students + pdq(d = 1)) 1.32
# lb_pvalue
# <dbl>
#1 0.251
```

We are writing the below command to examine the components of the model.

```
report(fit)
#Series: capacity
#Model: LM w/ ARIMA(0,1,1) errors
#Coefficients:
# ma1 students intercept
# 0.8672 -0.022 22964.621
#s.e. 0.1177 0.006 8369.892
```

The differenced model is shown as: , where is a **MA(1)** error. The insteresting thing is that the ** students **component has a negative impact on the

Before finding the expected value of the model, we will take a look at the relation between the number of capacity and students over time.

```
#Comparing students and capacity
ggplot(df_ts,aes(x=date))+
geom_bar(aes(y=students),stat = "identity",fill="skyblue")+
geom_line(aes(y=students),size=1.5,color="orange")+
geom_line(aes(y=(capacity/students)*10^7),size=1.5,color="red")+
scale_x_continuous(breaks = seq(1991,2021,5))+
scale_y_continuous(labels = scales::label_number_si(),
sec.axis = sec_axis(~./10^7,
labels = scales::percent_format(accuracy = 1),
name = "The rate of capacity per student"))+
xlab("")+ylab("The number of students")+
theme_minimal()+
theme(panel.background = element_rect(fill = "#f0f0f0", color = NA),
#Main y-axis
axis.title.y = element_text(color = "skyblue",
#setting positions of axis title
margin = margin(r=0.5,unit = "cm"),
hjust = 1),
#Second y-axis
axis.title.y.right = element_text(color = "red",
#setting positions of second axis title
margin = margin(l=0.5,unit = "cm"),
hjust = 0.01))
```

We have to keep in mind that approximately half of the higher education students study out of their hometown, according to the common opinion before the analyze the above graph. The orange line shows the trend of the number of students.

It is understood that while the number of students increased, the capacity per student decreased until 2014. Although the capacity rate has increased after this year, it is clear that the reason for this is the decrease in the number of students. However, it is still seen well below 50%.

Now that we’ve found our model, we can predict this year’s capacity value and see whether the government falls short of this expected value.

```
#Forecasting the current year
capacity_fc<- forecast(fit, new_data = test)
#Setting legends for separate columns for the plot
colors <- c("Actual"="red","Predicted"="blue")
#Plotting predicted and the actual value
ggplot(train,mapping=aes(x=date,y=capacity))+
geom_line(size=1.5)+
geom_point(capacity_fc,mapping=aes(x=date,y=.mean,color="Predicted"))+
geom_point(test,mapping=aes(x=date,y=capacity,color="Actual"))+
scale_x_continuous(breaks = seq(1991,2021,5))+
scale_y_continuous(breaks = seq(200000,750000,100000),
labels = scales::label_number_si())+
scale_color_manual(values = colors,name="Points")+
theme_minimal() +
theme(axis.title.y=element_text(vjust = 2),
panel.background = element_rect(fill = "#f0f0f0", color = NA))
```

To the above plot, the number provided by the government falls short than the expected value of the model. They seem right to complain about the capacity of dormitories considering the half of the students study out of their hometown.

]]>In order to answer this question, we have to choose some variables to monitor economic conditions, to seek a proper election date in terms of government. The variables we are going to use are the chain-weighted GDP, which clears the inflation effect from GDP, and CPI.

We will predict for the next 8 quarters the CPI and GDP with the ARIMA method from the fable package. In the dataset, we will use the quarters’ values of related variables between 2010-Q1 and 2021-Q2.

```
library(tidyverse)
library(readxl)
library(fable)
library(tsibble)
library(feasts)
#Building and tidying dataset
df <- read_excel("election.xlsx")
options(digits = 11)#for printing the decimal digits of the gdp variables
df %>%
mutate(cpi=round(as.numeric(cpi),2),) %>%
na.omit() -> df
#Creating tsibble variables for CPI and GDP variables
cpi_ts <- df %>%
select(cpi) %>%
ts(start = 2010,frequency = 4) %>%
as_tsibble()
gdp_ts <- df %>%
select(gdp) %>%
ts(start = 2010, frequency = 4) %>%
as_tsibble()
```

```
#Analyzing the GDP data
gdp_ts %>%
autoplot(color="blue",size=1) +
scale_y_continuous(labels = scales::label_number_si()) + #for SI prefix
xlab("") + ylab("") +
theme_minimal()
```

When we examine the above plot, there is no significant indication of a change of variance. Hence, we won’t take the logarithmic transformation of the data. Besides that, it is clearly seen that there is strong seasonality in the data. Therefore, we will first take a seasonal difference of the data. Although the mean of the data seems non-stationary either, if there is strong seasonality in the data, we would first take a seasonal difference; because, after the seasonal differencing, the mean can also be stabilized.

```
#Seasonally differenced of the data
gdp_ts %>%
gg_tsdisplay(difference(value, 4),
plot_type='partial') +
labs(title="Seasonally differenced", y="")
```

To the plot above, the mean seems to be stabilized; we don’t need to take a further difference. In order to find appropriate the ARIMA model, we first would look at the **ACF **graph. There is a spike at lag 4 which leads to a seasonal **MA(1)** component. On the other hand, based on **PACF**, there is a significant spike at lag 4 which also leads to a seasonal **AR(1)** component.

We can start with the below models according to the ACF/PACF graphs mentioned above. And, we will also add the automatic selection process to find the best results. We will set the ** stepwise** parameters to

```
#Modeling the GDP
fit_gdp <- gdp_ts %>%
model(
arima000011 = ARIMA(value ~ pdq(0,0,0) + PDQ(0,1,1)),
arima000110 = ARIMA(value ~ pdq(0,0,0) + PDQ(1,1,0)),
auto = ARIMA(value, stepwise = FALSE)
)
```

```
#Shows all models detailed
fit_gdp %>%
pivot_longer(everything(),
names_to = "Model name",
values_to = "Orders")
# A mable: 3 x 2
# Key: Model name [3]
# `Model name` Orders
# <chr> <model>
#1 arima000011 <ARIMA(0,0,0)(0,1,1)[4] w/ drift>
#2 arima000110 <ARIMA(0,0,0)(1,1,0)[4] w/ drift>
#3 auto <ARIMA(1,0,0)(0,1,1)[4] w/ drift>
```

```
#Ranks the models by AICc criteria
fit_gdp %>%
glance() %>%
arrange(AICc) %>%
select(.model,AICc)
# A tibble: 3 x 2
# .model AICc
# <chr> <dbl>
#1 auto 1518.
#2 arima000011 1520.
#3 arima000110 1524.
```

If we analyze the above results, we would conclude that the auto model is better than the others by AICc results. The model consists of non-seasonal **AR(1)** and seasonal **MA(1)** components in seasonally differenced data which is very similar to our guess based on **ACF**. **[4]** indicates the quarterly time series. Drift indicates the intercept in the model formula.

After finding our model, it would be better to check the residuals to see that if there is any correlation in the data.

```
#Checks the residuals
fit_gdp %>%
select(auto) %>%
gg_tsresiduals(lag=12)
```

We clearly see that there is no spike within threshold limits, which means that the residuals are white noise by the ACF graph above. Now, we can pass on to calculating forecasts with certainty.

```
#Forecasting GDP
(fit_gdp %>%
forecast(h=8) %>%
filter(.model=="auto") %>%
as.data.frame() %>% #to use the select function below, we convert data to a data frame
select(index,.mean) -> f_gdp)
# index .mean
#1 2021 Q3 521226607.35
#2 2021 Q4 526962302.61
#3 2022 Q1 454107505.86
#4 2022 Q2 476945092.38
#5 2022 Q3 536502403.29
#6 2022 Q4 544456726.15
#7 2023 Q1 472308990.28
#8 2023 Q2 495371912.09
```

Now, we will do the same steps for the CPI data.

```
#Analyzing the CPI data
cpi_ts %>%
autoplot(color="red",size=1) +
scale_y_continuous(labels = scales::label_number_si()) + #for SI prefix
xlab("") + ylab("") +
theme_minimal()
```

It is seen a very strong uptrend in the data while no seasonality is detected on the graph. Therefore, we will do the first difference to stabilize the mean. But, because of the exponential uptrend, we would better do log-transforming to stabilize the variance, as well.

```
#The first difference of the log-transformed data
cpi_ts %>%
gg_tsdisplay(difference(log(value)),
plot_type='partial') +
labs(title="", y="")
```

The data seem to be stationary in terms of the mean. As we analyzed the **ACF/PACF** plots, we would see that we have seasonal **AR(1)** and seasonal **MA(1)** components; because both plots have spikes at seasonal lags(**4**).

```
#Modeling the CPI
fit_cpi <- cpi_ts %>%
model(
arima010001 = ARIMA(log(value) ~ pdq(0,1,0) + PDQ(0,0,1)),
arima010100 = ARIMA(log(value) ~ pdq(0,1,0) + PDQ(1,0,0)),
auto = ARIMA(log(value), stepwise = FALSE)
)
#Shows all CPI models detailed
fit_cpi %>%
pivot_longer(everything(),
names_to = "Model name",
values_to = "Orders")
# A mable: 3 x 2
# Key: Model name [3]
# `Model name` Orders
# <chr> <model>
#1 arima010001 <ARIMA(0,1,0)(0,0,1)[4] w/ drift>
#2 arima010100 <ARIMA(0,1,0)(1,0,0)[4] w/ drift>
#3 auto <ARIMA(0,2,1)(1,0,0)[4]>
#Ranks the models by AICc criteria
fit_cpi %>%
glance() %>%
arrange(AICc) %>%
select(.model,AICc)
# A tibble: 3 x 2
# .model AICc
# <chr> <dbl>
#1 arima010100 -246.
#2 arima010001 -245.
#3 auto -240.
```

To the **AICc** results, the best model has seasonal **AR(1)** part with the first difference. In the last step for the fitness of the model, we will check for the residuals.

```
#Checks the residuals
fit_cpi %>%
select(arima010100) %>%
gg_tsresiduals(lag=12)
```

The residuals are within thresholds of the ACF plot, which means no remained information in the data. We can safely forecast with our model.

```
#Forecasting CPI
(fit_cpi %>%
forecast(h=8) %>%
filter(.model=="arima010100") %>%
as.data.frame() %>% #to use the select function below, we convert data to a data frame
select(index,.mean) -> f_cpi)
# index .mean
#1 2021 Q3 552.9687
#2 2021 Q4 572.1427
#3 2022 Q1 590.3115
#4 2022 Q2 608.4114
#5 2022 Q3 624.0214
#6 2022 Q4 642.0021
#7 2023 Q1 659.8191
#8 2023 Q2 677.8676
```

Now, we can investigate the convenient time for the election in terms of ruling parties. In order to do that, a graph might be useful. But first of all, we need a decisive variable. The rate of GDP per CPI would be an effective indicator for people’s perception of the country’s situation because it is related to people’s consumer level and the country’s growth rate either.

```
#Finding the proper time on the plot for the election
tibble(
ratio=f_gdp$.mean/f_cpi$.mean,
date=f_gdp$index
) %>%
ggplot(aes(x=date))+
geom_line(aes(y=ratio),
size=1,
color="green")+
geom_point(aes(x=date[5],y=ratio[5]),#red point
color="red",
size=2)+
ylab("GDP/CPI")+
xlab("")+
theme_minimal()+
theme(axis.title.y = element_text(margin = margin(r=0.5, unit = "cm")))#moves away from the tick numbers
```

When we examine the above plot, a significant decrease is observed from the third quarter of this year to the end first quarter of the next year. It is too risky for an election, but after the first quarter of 2022, an increasing trend is seen. But, from the third quarter of 2022, there is again a big decrease until the quarter, in which officially announced election date. The ideal election calendar seems to be the third quarter of 2022 for the ruling parties.

]]>The countries that we will compare are Greece, Italy, and Turkey, which are all in the same Mediterranean climate zone and had brutal damage of fires. We will compare the hectares(ha) of damaged areas and the number of fires. The data set has annual observations between the 2009-2021 periods. Of course, the data for this year had to be taken until 20 august based on the EFFIS database; but it doesn’t change the underlying concept of this article.

```
library(tidyverse)
library(tsfknn)
library(forecast)
df <- read_excel("fires_statistics.xlsx")
```

For comparison purposes, we will use the second y-axis in our plot. We have to use transformation rate to create a second axis because the ** sec.axis** function, which we are going to use for building the second y-axis, can not build an independent y-axis; it creates related to the main y-axis via some mathematical transformations. The transformation rate could be taken as approximate

```
fun <- function(country="Greece"){
df %>%
.[.$country==country,] %>%
ggplot(aes(x=year))+
geom_bar(aes(y=burnt_areas_ha),stat = "identity",fill="orange")+
#Bar labels(burnt_areas_ha)
geom_text(aes(y=burnt_areas_ha,label=burnt_areas_ha),vjust=-0.2,color="orange")+
#multiplied by transformation rate to tune in with the main y-axis(Burnt Areas)
geom_line(aes(y=fires_number*100),size=2,color="lightblue")+
#Line labels(fires_number)
geom_text(aes(y=fires_number,label=fires_number),vjust=1,color="lightblue")+
scale_x_continuous(breaks = seq(2009,2021,1))+
scale_y_continuous(
#for the main y-axis
labels = scales::label_number_si(),
breaks = scales::pretty_breaks(n=7),
#for the second y-axis
sec.axis = sec_axis(~./100,#divided by transformation rate, in order to be represented based on the first y-axis
name = "Number of Fires",
breaks = scales::pretty_breaks(7)),
)+
xlab("")+ylab("Burnt Areas(ha)")+
ggtitle(paste("The comparison of the volume(ha) and numbers of Forest Fires in",
country))+
theme_minimal()+
theme(
#Main y-axis
axis.title.y = element_text(color = "orange",
#puts distance with the main axis labels
margin = margin(r=0.5,unit = "cm"),
hjust = 1),#sets the main axis title top left
#Second y-axis
axis.title.y.right = element_text(color = "lightblue",
#puts distance with the second axis labels
margin = margin(l=0.5,unit = "cm"),
hjust = 0.01),#sets the second axis title bottom right
#adjusts the date labels to avoid overlap line numbers
axis.text.x = element_text(angle = 30,
vjust = 1,
hjust = 1),
panel.grid.minor = element_blank(),#removes the minor grid of panel
plot.title=element_text(face = "bold.italic",hjust = 0.5)
)
}
```

```
fun("Turkey")
```

```
fun("Italy")
```

When we take a look above plots, the damaged areas have been quite increased compared to the past years for all countries. Probably the underlying reason, that confused Turkish people was the high ratio of the damaged areas to the number of fires. The other two countries also seem to have the same problem, compared to the past years. Undoubtedly, the reason beyond that is global warming. Of course, the insufficient treats of the governments to the fires had also played a part in the issue.

Now let’s forecast the number of fires this year for determining if there is an anomaly in Turkey. Because our data set is too small, it would be convenient to choose the k-nearest neighbor(KNN) algorithm. We will set ** k **to 3 and 4 for optimization because the squared root of the number of observations is approximately 3,6.

Since the PACF order of the pattern are zero, we set the ** lags** parameter as 1 to 5. In order to determine the order of PACF of the model, we first check the stationary of the data.

```
#Checking the pattern if it is stationary
df %>%
filter(country=="Turkey") %>%
ggplot(aes(x=year,y=fires_number))+
geom_line(size=1)+
xlab("")+ylab("")+
theme_minimal()
```

There seems to be exponential growth in the above graph. Hence, we will take the first difference of the data to stabilize the mean.

```
#Making the time series stationary and determining the order of ACF and PACF
df %>%
filter(country=="Turkey") %>%
select(fires_number) %>%
pull() %>% #converts the dataframe column to a vector
ts(start = 2009,frequency = 1) %>% #converts vector to a time series
diff() %>% #gets first difference of the time series
ggtsdisplay()# gets ACF/PACF plots
```

It seems the mean is stabilized, so we don’t have to take a second difference. To the PACF plot, there is no significant spike beyond the limits; that was why we took the order of PACF as zero.

```
#Forecasting with KNN
df %>%
filter(country=="Turkey") %>%
select(fires_number) %>%
pull() %>% #converts the dataframe column to a vector
head(-1) %>% #removes the last observation for prediction
ts(start = 2009,frequency = 1) %>% #converts the vector to a time series
knn_forecasting(h=1,lags = 1:5, k=c(3,4)) %>%
#.$pred (gets prediction value for 2021, which is 424.575)
plot()
#actual value(the last observation)
points(x = 2021,
y = last(df$fires_number[df$country=="Turkey"]),
pch = 16,
col="blue")
legend("topleft",
legend = c("Predicted Value","Actual Value"),
col = c("red","blue"),
pch = c(16,16)
)
```

When we analyze the plot, we see that the predicted value for this year is quite higher than the current value, which shows the speculation about the number of fires in Turkey seems to be baseless.

]]>The dataset we’re going to use for this article is from Our World in Data. What we want to do here is that we will determine the partial and total vaccination rates as a percentage relative to the relevant population. And we will examine the relationship between the number of cases and deaths given in certain monthly periods and the vaccination rates.

```
library(readxl)
library(dplyr)
library(lubridate)
library(rlang)
library(scales)
library(tidyr)
library(stringr)
library(purrr)
library(plotly)
df_raw <- read_excel("covid19_vacc.xlsx")
cement <- function(...){
args <- ensyms(...)
paste(purrr::map(args,as_string))
}
```

The function we’ve built called ** cement** is taken from Advanced R book. The purpose of the function is that saves us from writing our arguments in string form, which sometimes can be quite annoying; with this function, we can write our arguments without in quotes.

```
fun <- function(...){
#Assigning the symbols as strings
var_loc <- cement(...)[1]
var_toll <- cement(...)[2]
df <- df_raw %>%
.[.$location==var_loc,] %>%
transmute(
date=date %>% as.Date(),
#First phase vaccinated people
partly=people_vaccinated-people_fully_vaccinated,
#Fully vaccinated people
fully=people_fully_vaccinated,
total_vaccinations,
new_cases,
new_deaths,
population
) %>% na.omit() %>%
group_by(month = format(date, "%Y%m")) %>%
mutate(
#The monthly amount
new_cases=cumsum(new_cases),
new_deaths=cumsum(new_deaths),
) %>%
#The last observation for each month
slice(which.max(cumsum(partly & fully))) %>%
ungroup() %>%
select(-month) %>%
#For percentage demonstration
mutate(partly_rate=round(partly/population,4)*100,
fully_rate=round(fully/population,4)*100)
fig <- plot_ly(df)
fig %>%
add_trace(x = ~date, y = ~partly_rate, type = 'bar',
name = 'partly',
marker = list(color = 'ligthblue'),
hoverinfo = "text",
text = ~paste(partly_rate, '%')) %>%
add_trace(x = ~date, y = ~fully_rate, type = 'bar',
name = 'fully',
marker = list(color = 'orange'),
hoverinfo = "text",
text = ~paste(fully_rate, '%')) %>%
add_trace(x=~date, y=~.data[[var_toll]], type='scatter',mode='lines',
name=str_replace(var_toll,"_"," "),
yaxis="y2",
line = list(color = 'red',width=4),
hoverinfo="text",
#.data is the pronoun provided by data masks:https://adv-r.hadley.nz/evaluation.html#pronouns
text=~paste(comma(.data[[var_toll]]))) %>%
layout(barmode = "stack",
legend=list(orientation='h', #horizontal
xanchor="center", #center of x axis
x=0.5),
xaxis=list(title=""),
yaxis=list(side='left',
ticksuffix="%",
title='Total Percentage of Vaccinations'),
yaxis2=list(overlaying="y",
side='right',
automargin=TRUE, #prevents axis title from overlapping the labels
title=paste("The number of",
str_replace(var_toll,"_"," "),
"given monthly interval")))
}
```

```
fun(Turkey,new_cases)
```

```
fun(Germany,new_cases)
```

When we examine the graphs, we see that the total vaccinations are below 50% in Turkey and above 60% in Germany. Both countries peaked in the number of cases in May, but there has been a steady decline in the cases line since then as the proportion of people fully vaccinated has risen. We can clearly see that the vaccine has worked so far. So get vaccinated as soon as possible, please!

]]>