Our World in Data, which we’ll be taking the data for this article, claims that lower-middle countries have the worst waste management systems. Therefore, we will examine the 20 countries that produce the most mismanaged waste based on their GNI income status.

```
library(tidyverse)
library(tidymodels)
library(janitor)
library(rvest)
library(scales)
library(bbplot)
library(vip)
#Scraping the GNI data set
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GNI_(nominal)_per_capita"
list_html <-
read_html(url) %>% #scraping the interested web page
rvest::html_elements("table.wikitable") %>% # takes the tables we want
rvest::html_table()
#gni income levels vector
status <- c("high","high","upper_middle","upper_middle",
"lower_middle","lower_middle","low","low")
#Building and tidying GNI dataframe
df_gni <-
#assigning the gni income levels to the corresponding countries
map(1:8,function(x) list_html[[x]] %>% mutate(gni_status=status[x])) %>%
do.call(what=rbind) %>% #unlisting and merging
clean_names() %>%#merge gaps with the underscore and makes the capitals lowercase
mutate(gni_status = as_factor(gni_status)) %>%
select(country, gni_metric=gni_per_capita_us_1, gni_status) %>%
mutate(gni_metric = str_remove(gni_metric,",") %>% as.numeric())
#Building and tidying waste dataframe
df_waste <- read_csv("https://raw.githubusercontent.com/mesdi/plastic-pollution/main/plastic-waste-mismanaged.csv")
df_waste <-
df_waste %>%
clean_names() %>%
na.omit() %>%
select(country=entity,
waste_metric= mismanaged_plastic_waste_metric_tons_year_1)#renaming
#Binding waste and gni dataframes by country
df_tidy <-
df_waste %>%
left_join(df_gni) %>%
distinct(country, .keep_all = TRUE) %>% #removes duplicated countries
filter(!country=="World") %>%
na.omit()
#Top 6 countries in terms of the amounts of mismanaged plastic waste
#to put their amount on the graph
df_6 <-
df_tidy %>%
slice_max(order_by=waste_metric, n=6) %>%
mutate(waste= paste0(round(waste_metric/10^6,2)," million t"))
```

We can now plot the mismanaged plastic waste versus GNI per capita.

```
#The chart of the top 20 countries in terms of the amounts of plastic wastes
#versus GNI income status
df_tidy %>%
slice_max(order_by= waste_metric, n=20) %>%
ggplot(aes(x= gni_metric, y= waste_metric, color= gni_status))+
geom_text(aes(label= country),
hjust= 0,
vjust= -0.5,
size=4,
#legend key type
key_glyph= "rect")+
geom_text(data = df_6,
aes(x=gni_metric, y= waste_metric,label=waste),
hjust=0,
vjust=1.2)+
#Using scale_*_log10 to zoom in data on the plot
scale_x_log10(labels = scales::label_dollar(accuracy = 2))+
scale_y_log10(labels = scales::label_number(scale_cut = cut_si("t")))+
scale_color_discrete(labels = c("Upper-Middle","Lower-Middle","Low"))+
labs(title= "Mismanaged plastic waste(2019) vs. GNI income")+
coord_fixed(ratio = 0.5, clip = "off")+#fits the text labels to the panel
bbc_style()+
theme(
legend.position = "bottom",
legend.text = element_text(size=12),
plot.title = element_text(hjust=0.5)#centers the plot title
)
```

When we examine the above graph, we can see that there are no high-income countries in the top 20. China leads by far on the list. The top six countries’ amounts have been written on the corresponding labels.

```
#Modeling with tidymodels
#preprocessing
df_rec <-
recipe(waste_metric ~ gni_status, data = df_tidy) %>%
step_log(waste_metric, base = 10) %>% #stabilizing the variance
step_dummy(gni_status)
#modeling
lm_model <-
linear_reg() %>%
set_engine("lm")
#workflow set
lm_wflow <-
workflow() %>%
add_model(lm_model) %>%
add_recipe(df_rec)
lm_fit <- fit(lm_wflow, df_tidy)
#Descriptive and inferential statistics
lm_fit %>%
extract_fit_engine() %>%
summary()
#Call:
#stats::lm(formula = ..y ~ ., data = data)
#Residuals:
# Min 1Q Median 3Q Max
#-3.2425 -0.6261 0.1440 0.8061 2.6911
#Coefficients:
# Estimate Std. Error t value
#(Intercept) 3.5666 0.1583 22.530
#gni_status_upper_middle 0.8313 0.2375 3.501
#gni_status_lower_middle 1.5452 0.2409 6.414
#gni_status_low 1.0701 0.3485 3.071
# Pr(>|t|)
#(Intercept) < 2e-16 ***
#gni_status_upper_middle 0.000627 ***
#gni_status_lower_middle 2.13e-09 ***
#gni_status_low 0.002577 **
#---
#Signif. codes:
#0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 1.119 on 137 degrees of freedom
#Multiple R-squared: 0.2384, Adjusted R-squared: 0.2218
#F-statistic: 14.3 on 3 and 137 DF, p-value: 3.694e-08
```

To the results seen above, the model is statistically significant in terms of the p-value(**3.694e-08 < 0.05**). Also, all the coefficients as well are statistically significant by their p-values. When we look at the R-squared of the model we can say that the model can explain %23 of the change in the target variable.

Finally, we will check the beginning claim that which GNI income status has the most effect on how the relative countries handle plastic waste. We will employ a model-based approach with the *vip* package.

```
#model-based variable importance
lm_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(color = "lightblue", fill = "lightblue")) +
theme_minimal()
```

It is seen that according to the above graph, the lower-middle class is the most determinant factor by far to the other categories in terms of mismanaged plastic waste.

]]>```
library(tidyverse)
library(fpp3)
library(bbplot)
library(plotly)
food <- read_csv("https://raw.githubusercontent.com/mesdi/wheat/main/ukraine-russia-food.csv")
#G20 countries
g20 <- c("Argentina", "Australia", "Brazil", "Canada", "China", "France",
"Germany", "India", "Indonesia", "Italy", "Japan", "South Korea",
"Mexico", "Saudi Arabia", "South Africa", "Turkey", "United Kingdom",
"United States")
```

We will get a look at the 2010-2019 import shares totals of barley, maize(corn), sunflower oil, and wheat for the domestic supply. If we hoover at the chart bars, we can see the last year’s (2019) rates of imports for the interested country and product.

```
#Comparing the G20 Countries' Imported Crops
#for Domestic Supply(%) from Russia or Ukraine
p <-
food %>%
select(Entity,
Year,
`Barley imports from Ukraine + Russia (% supply)`,
`Maize imports from Ukraine + Russia (% supply)`,
`Sunflower oil imports from Ukraine + Russia (% supply)`,
`Wheat imports from Ukraine + Russia (% supply)`) %>%
rename(Barley=`Barley imports from Ukraine + Russia (% supply)`,
Maize=`Maize imports from Ukraine + Russia (% supply)`,
`Sunflower oil`=`Sunflower oil imports from Ukraine + Russia (% supply)`,
Wheat=`Wheat imports from Ukraine + Russia (% supply)`) %>%
mutate(across(where(is.numeric), ~replace_na(.,0))) %>%
filter(Entity %in% g20) %>%
pivot_longer(cols = -c(Entity,Year),names_to = "Crops",values_to = "n") %>%
group_by(Entity,Crops) %>%
summarise(total=sum(n),
#tooltip text showing the last year's values for plotly chart
percent=paste0(round(last(n),2),"%\nin 2019")) %>%
arrange(desc(total),.by_group = TRUE) %>% #ranking stacked fill descending
ggplot(aes(Entity, total, group=Entity, fill= Crops, text=percent)) +
geom_bar(stat = "identity", position = "fill")+
bbc_style()+
scale_y_continuous(labels = scales::percent)+
coord_flip()+
theme(legend.position = "top",
legend.justification = "center",
panel.grid.major.y = element_blank(),
title = element_text(size=5))
#Plotly Graph
ggplotly(p,tooltip = "text") %>%
#adjusting legend position to top and center
layout(legend = list(orientation = "h",
y =1.1,
x=0.25,
title=""),
margin=list(r=50) #gives space to the hover information
)
```

It seems that sunflower oil is the most dominant imported cereal crop; specifically, when we hover over China, we see that %86.47 of the domestic supply was acquired from those countries in 2019. And it is clear that Brazil and Argentina do not need those crops arriving from Russia and Ukraine.

We will now examine the impact of this situation on US wheat futures prices. We will predict weekly prices for the next year with the neural network autoregression model (NNAR).

```
wheat_future <- read_csv("https://raw.githubusercontent.com/mesdi/wheat/main/us_wheat_futures.csv")
wheat_ts <-
wheat_future %>%
#removes comma and then converts to the proper date format
mutate(Date= parse_date(str_remove(Date,"\\,"),"%b %d %Y")) %>%
as_tsibble() %>%
fill_gaps() %>% #fills the gaps with NAs to provide regular time series
fill(Price,.direction = "down") #replaces the NAs with the previous value
```

We can see from the below results that our model has 19 lagged inputs and ten nodes in the hidden layer.

```
set.seed(123)
#NNAR modeling
fit_wheat <-
wheat_ts %>%
model(NNETAR(Price))
# A mable: 1 x 1
# `NNETAR(Price)`
# <model>
#1 <NNAR(19,10)>
```

To check our model, we have to do a residual diagnostic test. So, we can see whether there is some correlation and the distribution of residuals is similar to that normally distributed.

```
#Model residuals diagnostics
fit_wheat %>%
gg_tsresiduals()
```

All the lags are between the threshold lines so that there is no correlation, as seen on the ACF graph. When we look at the residual time plot, we can say that around zero mean and approximately constant variance across the historical data which means the distribution is similar to normal.

Now, we can make forecasting; in order to do that, we will do simulations with our model.

```
#Projection of prices of the US wheat futures
fit_wheat %>%
generate(times = 100, h = 52) %>%
autoplot(.sim) +
autolayer(wheat_ts, Price) +
scale_x_yearweek(date_breaks="104 week")+
labs(title="1-Year Projection of US Wheat Futures (ZWU2) Prices",
xlab="",
ylab="")+
bbc_style()+
theme(legend.position = "none",
axis.text.x=element_text(size=15),
plot.title = element_text(size=20,hjust=0.5))
```

We clearly see that the prices were up above the 1200s because of the war, and then come back to around the 900s. But it seems for the next one-year period the prices will go down to about 600-700s.

]]>I want to check the accuracy of these claims. First, we will compare foreign-born populations in Turkey with other immigrant intense countries; France, Germany, Greece, and the United Kingdom.

In order to do that, we have to download the migration dataset from Our World in Data. We will make waffle plots according to the proportions of the countries to each other.

```
library(tidyverse)
library(readxl)
library(waffle)
library(sysfonts)
library(showtext)
library(tidymodels)
library(tidyquant)
df_migration <-
read_excel("migration.xlsx") %>%
na.omit()
#loading Google fonts
font_add_google("Roboto Slab")
#To support and make Google fonts work
showtext_auto()
#Proportions of foreign-born populations in the given countries
df_migration %>%
mutate(country = fct_reorder(country, migrants_rate)) %>%
group_by(year) %>%
ggplot(aes(fill = country, values = migrants_rate)) +
geom_waffle(color = "white",
size = 0.5,
n_rows = 3,
flip = TRUE,
make_proportional = TRUE) +
facet_wrap(~year, nrow = 1, strip.position = "bottom") +
scale_x_discrete(breaks = scales::pretty_breaks()) +
labs(title = "Proportions of foreign-born populations in the given countries",
caption = "Source: United Nations Department of Economic and Social Affairs (UN DESA)") +
theme_minimal(base_family = "Roboto Slab") +
theme(
axis.text.y = element_blank(),
panel.grid = element_blank(),
legend.title = element_blank(),
text = element_text(size=15),
plot.title = element_text(hjust = 0.5,
size = 14,
face = "bold"),
plot.caption = element_text(size = 10,
color = "blue",
face = "bold"),
plot.caption.position = "plot"
)
```

Although there seems to be a significant increase in Turkey after the Syrian Civil War (2011), the rate still seems quite low compared to other countries, especially Germany.

Now, let’s come to the other claim that The Turkish people said: The immigrants take our job out of our hands! The dataset for this task is foreign-born employment taken from OECD.

```
#Comparing employment rates of the given countries
df_employment <- read_excel("foreign_born_employment.xlsx")
df_employment %>%
mutate(country = fct_reorder(country, employment_rate)) %>%
ggplot(aes(year, employment_rate, fill = country)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(breaks = c(seq(0, 75, 25)))+
labs(
caption = "Source: OECD",
title="Foreign-born employment aged 15-64\nin total foreign-born population of that same age ") +
theme_minimal(base_family = "Roboto Slab") +
theme(
axis.ticks.y = element_line(),
axis.title = element_blank(),
panel.grid = element_blank(),
legend.title = element_blank(),
text = element_text(size=15),
plot.title = element_text(hjust = 0.5,
size = 14,
face = "bold"),
plot.caption = element_text(size = 12,
color = "blue",
face = "bold"),
plot.caption.position = "plot"
)
```

It seems that among the immigrant population less than half have had a job in Turkey, which is a low rate compared to the other countries.

The last thing we are going to do is to try to find a relationship between the conflicts (especially the Syrian Civil War) and the flow of immigrants to the countries we examined above.

For this task, we will add the conflict data to the migration data via the *VLOOKUP* function. We will model the number of migrants with the interaction term for total conflict-related deaths in Middle East and the relevant countries. Our goal is to find whether there is a significant relationship between the number of immigrants and the total conflict deaths in the relevant country.

```
df_conflict <- read_excel("conflict_toll.xlsx")
df_conflict <-
df_migration %>%
#Using the VLOOKUP function of excel to match the relevant time values
#with the help of tidyquant package
mutate(deaths = VLOOKUP(df_migration$year, df_conflict, year, deaths)) %>%
#the cumulative sum of deaths corresponds to the cumulative sum of the number
#of immigrants for the interested years
mutate(deaths = cumsum(deaths))
#Simple linear regression model with interaction terms
model_lm <-
linear_reg() %>%
set_engine("lm")
model_fit <-
model_lm %>%
fit(migrants_rate ~ country:deaths, data = df_conflict)
#Comparing the countries to their p-values for the cause-effect
model_fit %>%
tidy() %>%
#simplifying the term names
mutate(term= case_when(
str_detect(term, "France") ~ "France",
str_detect(term, "Germany") ~ "Germany",
str_detect(term, "Greece") ~ "Greece",
str_detect(term, "Turkey") ~ "Turkey",
str_detect(term, "United Kingdom") ~ "United Kingdom"
)) %>%
.[-1,] %>% #removing the intercept
ggplot(aes(term, p.value)) +
geom_point(aes(color = term) , size = 3) +
geom_hline(yintercept = 0.05,
linetype = "dashed",
alpha = 0.5,
size = 1,
color = "red") +
labs(title ="Comparing the countries to their p-values\nat a significance level of 0.05 (red dashed line)" ,
color = "",
x = "",
y = "") +
theme_minimal(base_family = "Roboto Slab") +
theme(
panel.grid = element_blank(),
axis.text = element_blank(),
panel.background = element_rect(fill = "lightgrey", color = NA),
text = element_text(size = 15),
plot.title = element_text(
hjust = 0.5,
face = "bold",
size = 14
)
)
```

The red dashed line is the p-value of 0.05, which means, under that value, there is a statistically significant relationship between conflict in the Middle East and the immigrants in the relevant countries. To the above plot, it seems that Turkey, Germany, and partially France have that cause-effect.

]]>Of course, it is not a good development for the world. For this reason, I’ve just wondered which motives affect the military spending of the countries. I will take G20 countries’ annual military expenditure as a target variable. And as the predictors:

All the data we will use is from Our World in Data. We are going to use a linear model with simulations in which the Markov chain Monte Carlo(MCMC) method generates samples from Bayesian posterior distributions. This process is called JAGS and in order to use that, we will use *rjags* package. But we have to install JAGS library separately.

First, we create a data frame that we are going to use in analysis and modeling.

```
library(tidyverse)
library(readxl)
library(sysfonts)
library(showtext)
library(rjags)
library(coda)
df <- read_excel("military_expenditure.xlsx")
#Creating region variable for comparing analysis and
#region_int variable for jags modeling
df <-
df %>%
mutate(region=if_else(country %in% c("Australia","China","India","Indonesia",
"Japan","Russia","Saudi Arabia",
"South Africa","South Korea","Turkey"),
"East","West") %>% as_factor(),
region_int = region %>% as.integer()) %>%
na.omit() %>%
as.data.frame()
```

We will examine the expenditure of the G20 countries from the cold war ends until the recent, for comparison. We colorize each region separately.

```
#load fonts(google)
font_add_google("Roboto Slab")
showtext_auto()
#Comparison Plot
df %>%
filter(year > 1991) %>%
mutate(country = fct_reorder(country, expenditure)) %>%
ggplot(aes(expenditure, country, fill = region, color = region)) +
geom_boxplot(position = position_dodge(), alpha = 0.5) +
scale_x_log10(labels = scales::label_number_si()) +
scale_fill_brewer(palette = "Pastel1") +
scale_color_brewer(palette = "Pastel1") +
labs(y = NULL, color = NULL, fill = NULL, x = "Military Expenditures per Country($)")+
theme_minimal()+
theme(legend.position = "right",
panel.background = element_rect(fill = "#e5f5f9",color = NA),
axis.text = element_text(family = "Roboto Slab", size = 10),
axis.title = element_text(family = "Roboto Slab"))
```

The interesting thing is Germany doesn’t appear so much different from Russia.

First of all, we will build the model with the region variable we created before, so we will see its effect on military expenditures change. Because of the *jags.model* function can not employ non-numeric data as predictors we have to use the integer type of region variable (** region_int**) we created before.

```
#modeling with region
set.seed(123)
model_with <-
lm(expenditure ~ civlib + gdp + region, data = df)
> model_with
#Call:
#lm(formula = expenditure ~ civlib + gdp + region, data = df)
#Coefficients:
#(Intercept) civlib gdp regionEast
# -3.052e+10 7.380e+10 5.367e-02 -4.726e+10
```

Because of the linear model above, we take the ** region** variable as

```
#the dataset for jags model
lt_with <- list(x = df[,c(3, 4, 7)], y = df[,5],
n = nrow(df))
#estimation coefficients
lt_with$b_guess <- model_with$coefficients
#Model string
jags.script_with <-
"#--------------------------------------------------
model{
# likelihood
for( i in 1:n) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- intercept + civlib*x[i,1] + gdp*x[i,2]+ regionEast*x[i,3]
}
# priors
intercept ~ dnorm(b_guess[1], 0.1)
civlib ~ dnorm(b_guess[2], 0.1)
gdp ~ dnorm(b_guess[3], 0.1)
regionEast ~ dnorm(b_guess[4], 0.1)
tau ~ dgamma(.01,.01)
# transform
sigma <- 1 / sqrt(tau)
}#--------------------------------------------------
"
#compiling
mod_with <- jags.model(textConnection(jags.script_with),
data = lt_with,
n.chains = 4, n.adapt = 2000)
#burn-in process
update(mod_with, 1000)
#posterior sampling
mod.samples_with <- coda.samples(
model = mod_with,
variable.names = c("intercept", "civlib", "gdp", "regionEast", "sigma"),
n.iter = 1000)
# descriptive statistics of posteriors
sm_with <- summary(mod.samples_with)
sm_with[1]
#$statistics
# Mean SD Naive SE Time-series SE
#civlib 7.379818e+10 3.127715e+00 4.945351e-02 4.885913e-02
#gdp 7.665518e-02 2.448805e-03 3.871901e-05 3.598913e-05
#intercept -3.051892e+10 3.197261e+00 5.055314e-02 5.056658e-02
#regionEast -4.726244e+10 3.099717e+00 4.901083e-02 4.900839e-02
#sigma 1.182377e+11 2.493767e+09 3.942992e+07 3.798331e+07
```

As you can see above, the GDP variable has no effect, while the eastern region(** regionEast**) is a powerful predictor. This time we will exclude the

```
#without region
model_without <-
lm(expenditure ~ civlib + gdp, data = df)
#the dataset for jags model
lt_without <- list(x = df[,c(3, 4)], y = df[,5],
n = nrow(df))
#estimation coefficients
lt_without$b_guess <- model_without$coefficients
# Model string
jags.script_without <-
"#--------------------------------------------------
model{
# likelihood
for( i in 1:n) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- intercept + civlib*x[i,1] + gdp*x[i,2]
}
# priors
intercept ~ dnorm(b_guess[1], 0.1)
civlib ~ dnorm(b_guess[2], 0.1)
gdp ~ dnorm(b_guess[3], 0.1)
tau ~ dgamma(.01,.01)
# transform
sigma <- 1 / sqrt(tau)
}#--------------------------------------------------
"
#compiling
mod_without <- jags.model(textConnection(jags.script_without),
data = lt_without,
n.chains = 4, n.adapt = 2000)
#burn-in process
update(mod_with, 1000)
#posterior sampling
mod.samples_without <- coda.samples(
model = mod_without,
variable.names = c("intercept", "civlib", "gdp", "sigma"),
n.iter = 1000)
# descriptive statistics of posteriors
sm_without <- summary(mod.samples_without)
#The plot of the effects of the terms on the change of expenditure
bind_rows(
sm_with$statistics %>%
as_tibble(rownames = "terms") %>%
rename("mean"= Mean,
"std.error" = `Time-series SE`) %>%
select(c(1,2,5)) %>%
mutate(regions = "include"),
sm_without$statistics %>%
as_tibble(rownames = "terms") %>%
rename("mean"= Mean,
"std.error" = `Time-series SE`) %>%
select(c(1,2,5)) %>%
mutate(regions = "not include")
) %>%
filter(terms != "intercept" , terms != "sigma") %>%
ggplot(aes(mean, terms, color = regions)) +
geom_vline(xintercept = 0, size = 1.5, lty = 2, color = "gray50") +
geom_errorbar(size = 2, alpha = 0.7,
aes(xmin = mean - 1.96 * std.error,
xmax = mean + 1.96 * std.error)) +
geom_point(size = 3) +
scale_x_continuous(labels = scales::dollar,
expand = expansion(mult = c(.1, .1))) +
scale_color_brewer(palette = "Dark2") +
labs(x="", y = "", color = "Region in Model?")+
theme_minimal()+
theme(legend.position="right",
panel.background = element_rect(fill = "#e5f5f9", color = NA))
```

According to the plot above, while being in the east decrease the military expenditures, the effect of civil liberties increases the military budget, which is unexpected for me. This anomaly can be explained by the United States’ extreme values, which we have seen in the first plot we created before. Also, the prediction intervals seem to be near zero because of the MCMC simulation we did with the high iteration number.

**References**

Because of the nuclear danger we’ve just mentioned above, I have wondered about the number of nuclear weapons the countries have nowadays and especially the change from the end of the cold war.

The dataset we are going to work on is from Our World in Data. In terms of the easiness of writing, we will change the names of some of the countries.

```
library(tidyverse)
library(fable)
library(sysfonts)
library(showtext)
library(plotly)
library(readxl)
library(rlang)
library(glue)
df <-
read_excel("nuclear.xlsx") %>%
mutate(country=case_when(
country == "United Kingdom" ~ "UK",
country == "United States" ~ "USA",
country == "North Korea" ~ "PRK",
country == country ~ country))
```

We will subset the dataset to dates that we want to examine are 1991, the end of the cold war, and the present day.

```
#Tidying the dataset for the plot
df_tidy <-
df %>%
filter(year==1991 | year==2022) %>%
pivot_wider(names_from = year, values_from = stockpile) %>%
mutate(
color= case_when (
`1991` - `2022` > 0 ~ "darkgrey",
`1991` - `2022` < 0 ~ "red"
)) %>%
na.omit() %>%
pivot_longer(c(`1991`, `2022`),
names_to = "year",
values_to = "stockpile") %>%
mutate(stockpile_pct=percent_rank(stockpile))
```

Now, we can draw the plot which shows the changes that countries in terms of the number of nuclear warheads at interested years. We will do that with *ggplotly* as dynamic plotting, so when we hoover the cursor on the plot and country names, we will be able to see the corresponding numbers. We will indicate red the ones on rising and dark grey the ones on the descent.

```
#Comparing the change of the numbers of nuclear wearheads
#load fonts(google)
font_add_google("Roboto Slab")
showtext_auto()
p <-
df_tidy %>%
ggplot(aes(x = year,
y = stockpile_pct,
group = country,
color = color,
fill = color,
text=stockpile)) +
geom_line(size= 1.5)+
geom_point(size = 5,stroke = 1) +
geom_text(
data= filter(df_tidy,year == "2022"),
aes(label = country),
hjust = c(0.1),
nudge_x = c(0.18),
family = "Roboto Slab",
size = 6
) +
coord_cartesian(clip = "off")+
scale_y_continuous(limits = c(0, 1),
breaks = scales::pretty_breaks(n = 5))+
scale_color_identity()+
scale_fill_identity()+
scale_size_identity()+
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(family = "Roboto Slab", size=20),
axis.title = element_blank(),
panel.background = element_rect(fill = "lightblue",color = NA)
)
#setting font family for ggplotly
font <- list(family= "Roboto Slab")
ggplotly(p, tooltip = "text") %>%
hide_legend() %>%
layout(font=font)
```

It’s interesting that Pakistan and India have entered the arms race since 1991 from zero point. Also, we can clearly see that while Russia and USA have decreased the numbers of warheads, China has increased the stockpiles.

Another important issue is how the stockpiles of each nuclear power country will change over the next decade. Hence, we will make a function that forecasts the next 10-years and shows the results on a plot for each country.

The algorithm we will use is the neural network autoregression (NNAR), and we prefer the function *NNETAR* from the fable package to implement it. We use the * ggplotly* function as we used before, again.

```
#The function that predicts 10-years ahead for each country.
fun_nuc <-
function(...){
#A function that automatically quotes the arguments for easy writing.
#(source: https://adv-r.hadley.nz/quasiquotation.html#quasi-motivation)
cement <- function(...) {
args <- ensyms(...)
paste(purrr::map(args, as_string), collapse = " ")
}
var_loc <- cement(...) #country value in quotation
#Modeling
set.seed(1234)
fit <-
df %>%
filter(country==var_loc) %>%
as_tsibble(index = "year") %>%
select(year, stockpile) %>%
model(NNETAR(stockpile ~ year))
#Forecasting with NNAR
pred <-
fit %>%
forecast(h=10)
#Plot variable
p_pred <-
df[df$country==var_loc,] %>%
ggplot(aes(x=year,y=stockpile))+
geom_line(size=1)+
geom_point(data=pred,
mapping = aes(y=round(.mean)),
color=df_tidy[["color"]][df_tidy$country == var_loc][1])+
scale_x_continuous(breaks = scales::pretty_breaks(n=7))+
scale_y_continuous(breaks = scales::pretty_breaks())+
xlab("")+
ylab("")+
labs(title = var_loc)+
theme_minimal()+
theme(panel.grid.minor = element_blank(),
plot.title = element_text(size=15,
colour = df_tidy[["color"]][df_tidy$country== var_loc][1],
hjust = 0.5))
#Interactive plotting
ggplotly(p_pred) %>%
style(
text=glue("stockpile: {round(pred$.mean)}\nyear: {pred$year}"),
traces = 2 #for the predicted time series
) %>%
layout(font = font)
}
```

```
fun_nuc(China)
```

```
fun_nuc(USA)
```

```
fun_nuc(Russia)
```

If we can hover on the line and points, we could see the corresponding year and the numbers of stockpiles thanks to the interactive plotting. It looks like the nuclear danger will move to Asia-Pacific when the crisis is over on the East-Europa; if we would survive.

]]>One of the first ensemble algorithms we’re going to use is **bagging**(bootstrap aggregating). The bagging produces many training datasets by bootstrapping the original training data. These datasets are used to form a bunch of models, which has a single algorithm usually preferred random forest; because they are **unstable learners**, which means small changes in the input cause significantly different predictions results.

The other ensemble algorithm we will use is, **boosting**. Unlike bagging, this method improves the performance of the model by adding better models to it, which means it forms the complementary learning algorithms. Besides that, the boosting gives learning algorithms more weight based on their past performance, which means a model that performs better has greater influence over the ensembles.

After the explanations of the algorithms we will use, we can build our dataset for the models. The dataset consists of the gasoline prices as Turkish Lira, Brent spot prices in US dollars, USD-TRY exchange rate between 2019 and 2022.

We will do some wrangling to adjust the dataset for our purpose. We create a three-level factor for the changes in gasoline prices, which are going to be our target variable. Also, we will add timestamps to construct regular time series data. In order to do that, we can use *tsibble *package.

```
library(tidyverse)
library(tsibble)
library(readxl)
df <- read_excel("gasoline.xlsx")
#Building the dataset
df_tidy <-
df %>%
mutate(
gasoline = case_when(
gasoline - lag(gasoline) < 0 ~ "down",
gasoline - lag(gasoline) > 0 ~ "up",
TRUE ~ "steady"
) %>% as.factor(),
xe_lag = lag(xe),
brent_lag = lag(brent),
date=as.Date(date)
) %>%
as_tsibble() %>%
fill_gaps() %>% #makes regular time series by filling the time gaps
#fills in NAs with previous values
fill(-date,.direction = "down") %>%
na.omit() %>%
as.data.frame()
```

Before the modeling, we will look at the ratio of the factor levels in the target variable, which leads us to know the NIR level for specifying significant accuracy results. We will build that with *treemap* function.

```
#Treemap of factors
library(treemap)
df_tidy %>%
count(gasoline) %>%
mutate(label=paste(gasoline,scales::percent(n/sum(n)),sep = "\n")) %>%
treemap(
index="label",
vSize="n",
type="index",
title="",
palette="Accent",
border.col=c("black"),
border.lwds=1,
fontcolor.labels="white",
fontface.labels=1,
inflate.labels=F
)
```

We can understand from the above diagram; any model we create must have a more accuracy rate than %65.7 to be significant. The **Kappa **statistic is an effective accuracy measurement in terms of this no information rate(%65.7) we’ve just mentioned.

- is the
**sum**of true positive and true negative rates (actual agreement). - is expected agreement:
- We will use
**unweighted kappa**as the measurement value. Because there is no correlation among the factors levels of the target variable(**varying degrees of agreement**).

Now, we can build our models and compare the kappa results for different seed values on a *plotly* plot.

```
library(adabag)#Boosting
library(ipred)#Bagging
library(caret)#Bagging control object
library(vcd)#Kappa
library(plotly)#interactive plot
#Bagging
ctrl <- trainControl(method = "cv", number = 10)
kappa_bagg <-
lapply(
1:20,
function(x){
set.seed(x)
train(gasoline ~ .,
data = df_tidy,
method = "treebag",
trControl = ctrl)$results[["Kappa"]]}
) %>%
unlist()
#Boosting
kappa_boost <-
lapply(
1:20,
function(x){
set.seed(x)
boosting.cv(gasoline ~ ., data = df_tidy) %>%
.$confusion %>%
Kappa() %>%
.$Unweighted %>%
.[[1]]}
) %>%
unlist()
#Kappa simulation on a plotly chart
kappa_plot <-
data.frame(
seed=rep(1:20, 2),
kappa= c(kappa_bagg, kappa_boost),
ensembles=c(rep("Bagging", 20), rep("Boosting", 20))
) %>%
ggplot(aes(x=seed, y=kappa, color=ensembles))+
geom_point(size=3)+
geom_line(size=1)+
theme_minimal()+
theme(panel.background = element_rect(fill = "midnightblue", color = NA),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
ggplotly(kappa_plot) %>%
layout(legend=list(orientation="v", y=0.5))
```

As can be seen above, boosting algorithm seems to be slightly better than boosting but generally, both algorithms also have very high accuracy results.

**References**

- Lantz, Brett (2015) Evaluating Model Performance.
*Machine Learning with R*. 323-326 - Lantz, Brett (2015) Improving Model Performance.
*Machine Learning with R*. 362-369

The market basket analysis has the **items**, which are in the brackets, to form the **itemset **for each **transaction**(e.g. ). The association rules create the formation of itemsets as shown below:

This association rule indicates that if butter and jam are purchased together, then the bread is most likely to be purchased too. Since the association rule is unsupervised the data does not need to be labeled. These rules are created from subsets of itemsets that on the left-hand side (**LHS**) denotes the condition that needs to be met, and the right-hand side (**RHS**) indicates the expected outcome of meeting that condition.

In order to mine association rules algorithms among massive transactional data, we will use the** apriori algorithm**. This algorithm works on the concept that frequent itemsets can only be frequent if both components(items) of itemsets are frequent separately. Hence an itemset that has rare items would be extracted from the search.

The apriori algorithm uses some thresholds measurements which are called support and confidence measures to reduce the number of rules. The **support **of itemset or rule measures the frequency of it in the data.

**N**is the number of transactions.- is the number of transactions which contain that itemset.

The **confidence **measures the power of a rule; the closer to 1, the stronger the relationship of a rule. The support of the itemset which contains both X and Y is divided by the support of itemset containing only X, as shown below:

The Apriori algorithm uses the minimum support threshold to determine all potential itemsets and builds rules with the minimum confidence level. Besides all of that, we will use a different measurement to evaluate the rank of importance of the association rules, which indicates true connections between items.

The metric we’ve just mentioned is the **Lift,** which measures how much more likely it is that the items in the itemset are found together to the rate being alone.

To examine all concepts, we mentioned so far, we will examine the grocery data from the arules package. Let’s take a look first three transactions of the market basket data. The data is stored in **sparse matrix** format, as seen below. This matrix format stores only the items which are present and removes the missing cells to save the memory.

```
library(arules)
library(tidyverse)
library(scales)
library(glue)
library(ggiraph)
library(ggiraphExtra)
data("Groceries")
Groceries
#transactions in sparse format with
# 9835 transactions (rows) and
# 169 items (columns)
Groceries %>%
.[1:3] %>%
inspect()
# items
#[1] {citrus fruit,
# semi-finished bread,
# margarine,
# ready soups}
#[2] {tropical fruit,
# yogurt,
# coffee}
#[3] {whole milk}
```

Before mining the rules, we will look at the supports(frequencies) rates of the items for each. We will get the ones greater than 0.1 for a readable chart.

```
#frequency/support plot
Groceries %>%
itemFrequency() %>%
as_tibble(rownames = "items") %>%
rename("support"="value") %>%
filter(support >= 0.1) %>%
arrange(-support) %>%
ggDonut(aes(donuts=items,count=support),
explode = c(2,4,6,8),
labelposition=0)
```

The** whole milk** appears to be the most frequent item in all transactions. Let’s see if this superiority is reflected in the association rules.

```
#Mining rules
groceryrules <-
Groceries %>%
apriori(parameter = list(
support =0.006,
minlen = 2,
confidence=0.25))
#Rules plot
gg_rules <-
groceryrules %>%
arules::sort(by="lift") %>%
DATAFRAME() %>%
as_tibble() %>%
head(10) %>%
mutate(ruleName = paste(LHS,"=>",RHS) %>% fct_reorder(lift),
support = support %>% percent(),
lift = lift %>% round(2)) %>%
select(ruleName, support, lift) %>%
ggplot(aes(x=ruleName,y=lift))+
geom_segment(aes(xend=ruleName, yend=0),
color="orange",
size=1) +
geom_point_interactive(aes(tooltip=glue("Support: {support}\nLift: {lift}"),
data_id=support),
size=3,
color="lightblue") +
coord_flip() +
theme_minimal() +
theme(
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.background = element_rect(fill = "#e0ddd7", color = NA),
plot.background = element_rect(fill = "#f5cecb", color = NA)
) +
xlab("") +
ylab("")
girafe(ggobj = gg_rules)
```

In the above code chunk, we have employed the apriori function to find the association rules, and we have sorted the rules by top 10 lift values. When we hoover on the blue points the pop-pup(tooltip) displays the lift and support values. This interactivity of the plot was provided by the ggiraph package.

The up top rule indicates the probability that those who buy herbs would also buy root vegetables is about four times more likely than buying root vegetables alone. The support value in the tooltip indicates the frequency of the rule, not the frequency of the itemsets.

**References**

- Lantz, Brett (2015). Machine Learning with R: Packt Publishing
- LOLLIPOP CHART

```
library(tidyverse)
library(tidymodels)
library(tsibble)
library(C50)
library(caret)
library(pROC)
data <- readr::read_csv("https://raw.githubusercontent.com/mesdi/trees/main/credit.csv")
```

The first algorithm we will talk about is the **C5.0 decision tree**. This method is based upon entropy to select optimal features to split upon. The high entropy means that there is little or no **homogeneity **within the set of class values. The algorithm aims to find splits that reduce the entropy. (**Lantz, 2015: 133**)

- The refers to the number of class levels
- refers to the rate of values in class level i

The algorithm calculates the difference in the entropy of the last two segments on a feature, F, to determine the change of homogeneity. This is called** information gain**. The higher the information gain the better the homogeneity in groups after the split. (**Lantz, 2015: 134**)

Now after some brief information, we can begin the fitting process.

```
#Building tibble dataset
df <-
data %>%
mutate_if(sapply(data,is_character),as_factor) %>%
mutate(default= default %>% factor(labels=c("no","yes")))
#Creating train and test set
set.seed(36894)
df_split <- initial_split(df, prop = 9/10, strata = "default")
df_train <- training(df_split)
df_test <- testing(df_split)
```

```
#C5.0 model
model_c50 <- C5.0(df_train[-17],df_train$default,trials = 10)
pred_c50 <- predict(model_c50, df_test)
confusionMatrix(pred_c50,df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 65 12
# yes 5 18
# Accuracy : 0.83
# 95% CI : (0.7418, 0.8977)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.002163
```

No information rate(**NIR**) means that the largest proportion of the class levels in our response variable. So, when we look at the above results, we can see this rate is %70 and, our accuracy result is higher than NIR at the %5 significance level(**P-Value: 0.002163** **< 0.05**). Nevertheless, the type 2 error rate is quite annoying (**12/30**).

The second tree algorithm we are going to use is decision_tree() from parsnip.

```
#Decision trees model
model_dt <-
decision_tree() %>%
set_mode("classification") %>%
set_engine("rpart") %>%
fit(default ~ ., data=df_train)
pred_dt <- predict(model_dt, df_test)
confusionMatrix(pred_dt$.pred_class, df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 65 17
# yes 5 13
# Accuracy : 0.78
# 95% CI : (0.6861, 0.8567)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.04787
```

It looks like the accuracy and type 2 error rate are much worse than the C5.0 function. The final algorithm is the boosted trees.

```
#Boosted trees model
model_bt <-
boost_tree(trees = 100) %>%
set_mode("classification") %>%
set_engine("C5.0") %>%
fit(default ~ ., data=df_train)
pred_bt <- predict(model_bt, df_test)
confusionMatrix(pred_bt$.pred_class, df_test$default)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 63 14
# yes 7 16
# Accuracy : 0.79
# 95% CI : (0.6971, 0.8651)
# No Information Rate : 0.7
# P-Value [Acc > NIR] : 0.02883
```

It is seen a slight improvement, but not like the first one.

Finally, we will be comparing ROC curves with the help of the ggroc function from the pROC package.

```
#Multiple ROC curve
roc_df <- tibble(
c50=pred_c50 %>% ordered(),
dt=pred_dt$.pred_class %>% ordered(),
bt=pred_bt$.pred_class %>% ordered(),
response=df_test$default
)
roc.list <- roc(response ~ dt + bt + c50, data = roc_df)
roc_values <-
lapply(roc.list,"[[","auc") %>%
as.numeric() %>%
round(2)
ggroc(roc.list) +
geom_line(size=2)+
geom_abline(slope=1,
intercept = 1,
linetype = "dashed",
alpha=0.5,
size=2,
color = "grey") +
scale_color_brewer(palette = "Set2",
labels=c(
paste0("DT: ", roc_values[1]),
paste0("BT: ", roc_values[2]),
paste0("C50: ",roc_values[3]))) +
annotate("text",
x=0.5,
y=0.5,
label="no predictive value",
angle=45)+
guides(color= guide_legend(title = "Models AUC Scores")) +
theme_minimal() +
coord_equal()
```

The above ROC curves plot and the AUC results confirm the accuracy results we found earlier. It is understood, except for the C50 model, the other two models seem to be inefficient.

]]>We will model the gas prices in terms of the exchange rate(**xe**), but this time, not only the current exchange rate but also, I want to add lagged values of xe to the model; Because the change of xe rates might not affect the prices immediately. Our model should be as below:

is ARIMA errors. is a lag level and can be determined by AICc.

The dataset we are going to use is that from 2019 to the present, gas prices and xe rates. In the code block below, we build tsibble data frame object and separate it into the train and test set.

```
library(tidyverse)
library(tidymodels)
library(readxl)
library(tsibble)
library(fable)
library(feasts)
df <- read_excel("gasoline.xlsx")
#Building tsibble dataset
df_tsbl <- df %>%
mutate(date=as.Date(date)) %>%
as_tsibble() %>%
fill_gaps() %>% #makes regular time series by filling the time gaps
#fills in NAs with previous values
fill(xe,gasoline,.direction = "down")
#Separating into train and test set without random selection
#Takes first %75 for the training and the rest for the test set
df_split <- initial_time_split(df_tsbl, prop = 3/4)
df_train <- training(df_split)
df_test <- testing(df_split)
```

First of all, we will check the stationary of dependent and independent variables of the model as always.

```
#Checking the stationary
df_train %>%
pivot_longer(gasoline:xe) %>%
ggplot(aes(x=date,y=value))+
geom_line()+
facet_grid(vars(name),scales = "free_y")+
xlab("")+ylab("")+
theme_minimal()
```

It seems that it is better we make differencing the gas and xe variables.

```
#Comparing the probable lagged models
model_lags_diff <- df_train %>%
#Excludes the first three days so all versions are in the same time period
mutate(gasoline=c(NA,NA,NA,gasoline[4:788])) %>%
model(
lag0 = ARIMA(gasoline ~ pdq(d = 1) + xe),
lag1 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe)),
lag2 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe) +
lag(xe, 2)),
lag3 = ARIMA(gasoline ~ pdq(d = 1) +
xe + lag(xe) +
lag(xe, 2) + lag(xe, 3))
)
glance(model_lags_diff) %>%
arrange(AICc) %>%
select(.model:AICc)
# A tibble: 4 x 5
# .model sigma2 log_lik AIC AICc
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 lag1 0.00310 1152. -2294. -2294.
#2 lag2 0.00311 1152. -2292. -2292.
#3 lag3 0.00311 1152. -2290. -2290.
#4 lag0 0.00313 1148. -2288. -2288.
```

By the AICc results, it is better to select the one-level lagged model.

```
#Creates the model variable to the above results
model_lag1 <- df_train %>%
model(ARIMA(gasoline ~ pdq(d = 1) + xe + lag(xe)))
```

```
#Checking the model residuals consistency with white noise
model_lag1 %>% gg_tsresiduals()
```

All the spikes seem to be within the threshold line to the ACF plot, so the residuals can be considered white noise. So, we are comfortable with the model we chose.

```
report(model_lag1)
#Series: gasoline
#Model: LM w/ ARIMA(0,1,0)(2,0,0)[7] errors
#Coefficients:
# sar1 sar2 xe lag(xe)
# 0.0920 0.1065 -0.0436 0.1060
#s.e. 0.0356 0.0359 0.0361 0.0368
#sigma^2 estimated as 0.003103: log likelihood=1155.84
#AIC=-2301.69 AICc=-2301.61 BIC=-2278.35
```

The above results indicate that the model has the weekly seasonal **AR(1)** and **AR(2)** components. Now, we will predict with the test set by using the above regression equations to assess the accuracy of our model.

```
#Lagged Forecasts
model_pred <- forecast(model_lag1,df_test)
#Accuracy of the lagged model
cor(df_test$gasoline,model_pred$.mean) %>% round(2)
#[1] 0.77
```

The result seems to be fine but we can try to improve it. In order to do that we will use bootstrapping and bagging methods.

```
set.seed(456)#getting the optimal and similar results for reproducibility
#Bootstrapped forecasts
sim_fc <- model_lag1 %>%
fabletools::generate(new_data=df_test,
times=10,
bootstrap_block_size=14) %>%
select(-.model)
#Drawing bootstrapped forecasts
sim_fc %>%
autoplot(.sim)+
autolayer(df_train,gasoline)+
guides(color="none")+
xlab("")+ylab("")+
theme_minimal()
```

```
#Bagged forecasts
bagged_fc <- sim_fc %>%
summarise(bagged_mean=mean(.sim))
#Accuracy of bagging model
cor(df_test$gasoline,bagged_fc$bagged_mean) %>% round(2)
#[1] 0.85
```

Quite a significant improvement for the accuracy of our model; it seems the bootstrapping simulation works in this process.

]]>The dataset we have built for this article consists of the number of formal education **students** and the housing **capacity** provided by Higher Education Loans and Dormitories Institution(**KYK**), which is a state organization.

We will determine whether there is a capacity shortage this year based on historical data. The model we are going to use is the **dynamic regression model with ARIMA errors**; Because we will model the dormitories’ **capacity **in terms of the number of **students **by the historical data between 1992-2020.

First of all, we will check the stationary of the variables. If one of them is not stationary, all the variables have to be differenced. This modeling process is called model in differences.

```
#Creating the tsibble variables and examining the stationary
library(tidyverse)
library(fable)
library(feasts)
library(tsibble)
library(readxl)
df <- read_excel("kyk.xlsx")
df_ts <- df %>%
as_tsibble(index = date)
train <- df_ts %>% filter_index(.~ "2020")
test <- df_ts %>% filter_index("2021"~.)
train %>%
pivot_longer(c(capacity, students),
names_to = "var", values_to = "value") %>%
ggplot(aes(x = date, y = value)) +
geom_line(size=1.5, color="blue") +
facet_grid(vars(var), scales = "free_y") +
scale_x_continuous(breaks = seq(1990,2020,5))+
scale_y_continuous(labels = scales::label_number_si()) +
labs(title = "The Number of Capacity and Students",y = "",x="") +
theme_minimal() +
theme(plot.title=element_text(face = "bold.italic",hjust = 0.5),
panel.background = element_rect(fill = "#f0f0f0", color = NA))
```

When we examine the above plots, it seems like the two variables need differencing as well. After we fit the model, we will check the residuals for stationary by drawing the innovations residuals.

```
#Modeling
fit <- train %>%
model(ARIMA(capacity ~ students + pdq(d=1)))
#Checking residuals for stationary
fit %>%
gg_tsresiduals()
```

In the ACF graph, we can clearly see that the spikes are within the thresholds which means the residuals are white noise. In the below code block, we have confirmed that the residuals have no patterns by using the Ljung-Box test in which we saw the **p-value** greater than 0.05 at %5 significance level.

```
#Ljung-Box independence test for white noise
augment(fit) %>%
features(.innov, ljung_box)
#.model lb_stat
# <chr> <dbl>
#1 ARIMA(capacity ~ students + pdq(d = 1)) 1.32
# lb_pvalue
# <dbl>
#1 0.251
```

We are writing the below command to examine the components of the model.

```
report(fit)
#Series: capacity
#Model: LM w/ ARIMA(0,1,1) errors
#Coefficients:
# ma1 students intercept
# 0.8672 -0.022 22964.621
#s.e. 0.1177 0.006 8369.892
```

The differenced model is shown as: , where is a **MA(1)** error. The insteresting thing is that the ** students **component has a negative impact on the

Before finding the expected value of the model, we will take a look at the relation between the number of capacity and students over time.

```
#Comparing students and capacity
ggplot(df_ts,aes(x=date))+
geom_bar(aes(y=students),stat = "identity",fill="skyblue")+
geom_line(aes(y=students),size=1.5,color="orange")+
geom_line(aes(y=(capacity/students)*10^7),size=1.5,color="red")+
scale_x_continuous(breaks = seq(1991,2021,5))+
scale_y_continuous(labels = scales::label_number_si(),
sec.axis = sec_axis(~./10^7,
labels = scales::percent_format(accuracy = 1),
name = "The rate of capacity per student"))+
xlab("")+ylab("The number of students")+
theme_minimal()+
theme(panel.background = element_rect(fill = "#f0f0f0", color = NA),
#Main y-axis
axis.title.y = element_text(color = "skyblue",
#setting positions of axis title
margin = margin(r=0.5,unit = "cm"),
hjust = 1),
#Second y-axis
axis.title.y.right = element_text(color = "red",
#setting positions of second axis title
margin = margin(l=0.5,unit = "cm"),
hjust = 0.01))
```

We have to keep in mind that approximately half of the higher education students study out of their hometown, according to the common opinion before the analyze the above graph. The orange line shows the trend of the number of students.

It is understood that while the number of students increased, the capacity per student decreased until 2014. Although the capacity rate has increased after this year, it is clear that the reason for this is the decrease in the number of students. However, it is still seen well below 50%.

Now that we’ve found our model, we can predict this year’s capacity value and see whether the government falls short of this expected value.

```
#Forecasting the current year
capacity_fc<- forecast(fit, new_data = test)
#Setting legends for separate columns for the plot
colors <- c("Actual"="red","Predicted"="blue")
#Plotting predicted and the actual value
ggplot(train,mapping=aes(x=date,y=capacity))+
geom_line(size=1.5)+
geom_point(capacity_fc,mapping=aes(x=date,y=.mean,color="Predicted"))+
geom_point(test,mapping=aes(x=date,y=capacity,color="Actual"))+
scale_x_continuous(breaks = seq(1991,2021,5))+
scale_y_continuous(breaks = seq(200000,750000,100000),
labels = scales::label_number_si())+
scale_color_manual(values = colors,name="Points")+
theme_minimal() +
theme(axis.title.y=element_text(vjust = 2),
panel.background = element_rect(fill = "#f0f0f0", color = NA))
```

To the above plot, the number provided by the government falls short than the expected value of the model. They seem right to complain about the capacity of dormitories considering the half of the students study out of their hometown.

]]>