WARNING: IN THE INTERVIEW AT 25/11/2020, THE MINISTER OF HEALTH OF THE REPUBLIC OF TURKEY CONFESSED THAT THE PEOPLE WHOSE TESTS(COVID-19) WERE POSITIVE BUT DID NOT SHOW THE SYMPTOMS, HAD NOT BEEN INCLUDED IN THE DAILY NEW CASES NUMBERS UNTIL NOW; IN THIS CASE THE DATASOURCE WE USED FOR TURKEY IS MISLEADING. I APOLOGIZE FOR THIS SITUATION THAT IS CAUSED BY REASONS BEYOND MY CONTROL.

In my previous article, we analyzed the COVID-19 data of Turkey and selected the cubic model for predicting the spread of disease. In this article, we will show in detail why we selected the cubic model for prediction and see whether our decision was right or not.

When we analyze the regression trend models we should consider overfitting and underfitting situations; underfitting indicates high bias and low variance while overfitting indicates low bias and high variance.

Bias is the difference between the expected value of fitted values and observed values:

$E(\hat y) - y$

The variance of fitted values is the expected value of squared deviation from the mean of fitted values:

$E[(\hat y-E(\hat y))^2]$

The adjusted coefficient of determination is used in the different degrees of polynomial trend regression models comparing. In the below formula p denotes the number of explanatory terms and n denotes the number of observations.

$\bar R^2=1-\frac{SSE} {SST} (\frac {n-1} {n-p-1})$

$SSE$ is the residual sum of squares:

$\sum (y-\hat y)^2$

$SST$ is the total sum of squares:

$\sum (y-\bar y)^2$

When we examine the above formulas, we can notice the similarity between SSE and bias. We can easily say that if bias decreases, SSE will decrease and $\bar R^2$ will increase. So we will use $\bar R^2$ instead of bias to balance with variance and find the optimal degree of the polynomial regression.

The dataset and data frame(tur) we are going to use can be found from the previous article. First of all, we will create all the polynomial regression models which we’re going to compare.

a <- 1:15
models <- list()

for (i in seq_along(a)){

models[[i]] <-  assign(paste0('model_', i),lm(new_cases ~ poly(index, i), data=tur))

}


The variances of fitted values of all the degrees of polynomial regression models:

variance <- c()

for (i in seq_along(a)) {

variance[i] <- mean((models[[i]][["fitted.values"]]-mean(models[[i]][["fitted.values"]]))^2)

}


To create an adjusted R-squared object we first create a summary object of the trend regression models; because the adj.r.squared feature is calculated in summary function.

models_summ <- list()
for (i in seq_along(a)) {

models_summ[[i]] <- summary(models[[i]])

}


Before analyzing numeric results of variance and $\bar R^2$ we will show all the trend regression lines in separate plots for comparing models.

# Facetting models by degree for finding the best fit
library(ggplot2)

dat <- do.call(rbind, lapply(1:15, function(d) {
x <- tur\$index
preds <- predict(lm(new_cases ~ poly(x, d), data=tur), newdata=data.frame(x=x))
data.frame(cbind(y=preds, x=x, degree=d))
}))

ggplot(dat, aes(x,y)) +
geom_point(data=tur, aes(index, new_cases)) +
geom_line(color="steelblue", lwd=1.1)+
facet_wrap(~ degree,nrow = 3)


When we examine the above plots, we should pay attention to the curved of the tails; because it indicates overfitting which shows extreme sensitivity to the observed data points. In light of this approach, the second and third-degree of models appear to be more convenient to the data.

Let’s examine $\bar R^2$-variance tradeoff on the plot we created below.

library(gridExtra)

geom_line(size=2,colour="orange")+
scale_x_continuous(breaks = 1:15)+
theme_bw(base_line_size = 2)

geom_line(size=2,colour="steelblue")+

As we can see above, adjusted R-squared and variance have very similar trend lines. The more adjusted R-squared means the more complexity and low bias, but we have to take into account the variance; otherwise, we fall into the overfitting trap. So we need to look for low bias (or high $\bar R^2$) and low variance as much as possible for optimal selection.