Integrating Python Forecasting with R’s Tidyverse

In this article, we executed a successful integration of a non-standard Python forecasting model into the R Tidyverse/Tidymodels framework, primarily leveraging the reticulate package.

1. The Objective (The Challenge)

The goal was to utilize a powerful, custom Python model (nnetsauce‘s MTS wrapper around a cyb$BoosterRegressor) and integrate its outputs—predictions and prediction intervals—into the R ecosystem, allowing for consistent data manipulation with dplyr and high-quality visualization using ggplot2 (and interactive plotly).

2. The Integration Mechanism

The integration was achieved through the following steps:

  • reticulate Bridge: We used the reticulate package to seamlessly call the Python model, pass R data (tbl_train), fit the model (regr$fit()), and execute the forecast (regr$predict()).
  • Data Conversion: The raw forecast output (predictions and confidence intervals) from Python was carefully extracted and converted into a standard R tibble (forecast_data_tbl).
  • Tidymodels Simulation (Initial Attempt): We attempted to force this tibble into a modeltime structure by using dummy models and calibration, which failed due to rigid internal validation checks in plot_modeltime_forecast().
  • Tidyverse Final Solution: We abandoned the rigid modeltime visualization functions and adopted the core Tidyverse approach:
    1. Data Manipulation: We used dplyr and tidyr::pivot_longer() to restructure the forecast data into the long format, which is optimal for ggplot2.
    2. Visualization: We used pure ggplot2 and plotly to manually draw the lines, ribbons, and apply the desired STOXX aesthetic, successfully bypassing all package incompatibility errors.

3. Key Findings

The process highlighted a critical statistical finding:

  • Metric vs. Visual Discrepancy: The Python model’s point forecast quality was excellent (low MAPE), but its internal prediction interval calculation (type_pi="gaussian") was severely flawed (high Out-of-Band Error).
  • Solution: To ensure the visualization reflected the low MAPE score, we calculated the true MAPE on the test set and manually generated a realistic, MAPE-based confidence interval to be used in the final plot.

In essence, we created a robust R-centric pipeline for post-processing, validation, and visualization of forecasts generated by a custom Python machine learning model.

library(tidyverse)
library(tidyquant)
library(reticulate)
library(tidymodels)
library(timetk)
library(modeltime)
library(plotly)

# nnetsauce_env ortamını kullanmasını söyleyin
use_python("C:/Users/selcu/AppData/Local/r-miniconda/envs/nnetsauce_env/python.exe", required = TRUE) 


# Import Python packages
np <- import("numpy")
ns <- import("nnetsauce")
cyb <- import("cybooster")
sklearn <- import("sklearn")


#iShares US Aerospace & Defense ETF (ITA)
df_ita <- 
  tq_get("ITA") %>% 
  select(date, close) %>% 
  filter(date >= last(date) - months(36))

#Split Data 
split <- time_series_split(df_ita,
                           assess = "15 days",
                           cumulative = TRUE)


tbl_train <- training(split)
tbl_test <- testing(split)

# 1. import DecisionTreeRegressor from sklearn
skl_tree <- import("sklearn.tree")

# Define the Base Estimator (The core learning component, factored for clarity)
# This uses a Decision Tree with a max depth of 3 as the weak learner for the boosting model.
base_estimator_instance <- skl_tree$DecisionTreeRegressor(max_depth = 3L)

# Define the Booster Object (The main regression engine)
# cyb$BoosterRegressor is the ensemble model that will be fitted.
booster_regressor <- cyb$BoosterRegressor(
  base_estimator_instance, # The weak learner used for boosting
  n_estimators = 100L        # Number of boosting stages (number of trees to build)
)

# Define the MTS (Multi-variate Time Series) Model Wrapper
# ns$MTS wraps the booster model to handle time series features like lags and prediction intervals.
regr <- ns$MTS(
  obj = booster_regressor, # The fitted regression model (BoosterRegressor)
  
  # Time Series Parameters:
  lags = 20L,              # Number of past observations (lags) to use as predictors (X)
  type_pi = "gaussian",    # Method for computing Prediction Intervals (PI). 
  # 'gaussian' assumes errors are normally distributed.
  # (NOTE: We found this to be too narrow, requiring manual adjustment with MAPE.)
  
  # Execution Control:
  show_progress = TRUE     # Displays the training progress (useful for monitoring)
)


#Converting the tibble to data frame for the fitting process
df_train <- 
  tbl_train %>% 
  as.data.frame() %>% 
  mutate(date = as.character(date))

df <- df_train[, -1, drop = FALSE]
rownames(df) <- df_train$date


#Fit the model
regr$fit(df)

#Step 1: Extract Python Predictions and Create the Base R Data Table

# --- Define Python Function and Retrieve Predictions ---
# (Ensure this function has been run in your R session previously.)
reticulate::py_run_string("
def get_full_data_11(preds_obj):
    mean_list = preds_obj[0]['close'].tolist() 
    lower_list = preds_obj[1]['close'].tolist()
    upper_list = preds_obj[2]['close'].tolist()
    return {'mean': mean_list, 'lower': lower_list, 'upper': upper_list}
")

preds_raw <- regr$predict(h = 11L)
full_data_list <- py$get_full_data_11(preds_raw)

preds_mean_vector  <- unlist(full_data_list$mean)
preds_lower_vector <- unlist(full_data_list$lower)
preds_upper_vector <- unlist(full_data_list$upper)

# Align data sets based on prediction length
vector_length <- length(preds_mean_vector)
df_test_final <- tbl_test %>% dplyr::slice(1:vector_length) 

# --- Create the Base Forecast Table ---
forecast_data_tbl <- tibble(
  .index = df_test_final$date,
  .value = df_test_final$close,
  .prediction = preds_mean_vector,
  .conf_lo    = preds_lower_vector, # These are the initially incorrect (too narrow) CIs
  .conf_hi    = preds_upper_vector, # These are the initially incorrect (too narrow) CIs
    .model_desc = "nnetsauce_BoosterRegressor"
)

#Step 2: Calculate the Actual MAPE and Create Realistic Confidence Intervals
library(dplyr)
library(tibble)
library(tidyr) 

# --- Step 2.1: Calculate the Actual MAPE on the Test Set ---

# Create a verification table containing error metrics
check_tbl <- forecast_data_tbl %>%
  mutate(
    Error = .value - .prediction,
    # MAPE formula: abs(Actual - Prediction) / Actual * 100
    Absolute_Percentage_Error = abs(Error) / .value * 100 
  )

# Actual MAPE (Calculated from the Test Set)
actual_mape_test <- mean(check_tbl$Absolute_Percentage_Error, na.rm = TRUE) / 100 
# (Converting the percentage to decimal format: e.g., 2.84% -> 0.0284)

print(paste0("Calculated MAPE (decimal format): ", actual_mape_test))


# --- Step 2.2: Create New, Realistic Intervals Based on MAPE ---

Z_SCORE_95 <- 1.96 # Z-Score for 95% Confidence Interval

# Create the New Realistic Prediction Interval Table
forecast_data_tbl_realistic <- check_tbl %>%
  mutate(
    # Error Margin = Prediction * MAPE * Z-Score
    Error_Amount = .prediction * actual_mape_test * Z_SCORE_95,
    
    # New Confidence Intervals
    .conf_lo_new = .prediction - Error_Amount, 
    .conf_hi_new = .prediction + Error_Amount
  ) %>%
  # Rename/replace columns to use the new, wider intervals
  mutate(
    .conf_lo = .conf_lo_new,
    .conf_hi = .conf_hi_new
  ) %>%
  select(-c(Error_Amount, .conf_lo_new, .conf_hi_new, Error, Absolute_Percentage_Error))

#Visualize Confidence Intervals

# Visualization Settings
min_y <- min(forecast_data_tbl_realistic$.conf_lo) * 0.98 
max_y <- max(forecast_data_tbl_realistic$.conf_hi) * 1.02 
mape_value_for_title <- round(actual_mape_test * 100, 2)
COLOR_PREDICTION <- "red" # Dark Grey/Blue (Prediction)
COLOR_ACTUAL <- "black" # Red (Actual Value)
COLOR_RIBBON <- "dimgrey" 

# Pivot the Base Table to Long Format (required for ggplot)
forecast_tbl_long_final <- forecast_data_tbl_realistic %>%
  pivot_longer(cols = c(.value, .prediction), 
               names_to = ".key", 
               values_to = "Y_Value") %>%
  mutate(
    .conf_lo_plot = ifelse(.key == ".prediction", .conf_lo, NA_real_),
    .conf_hi_plot = ifelse(.key == ".prediction", .conf_hi, NA_real_)
  )

# Create the ggplot Object
p <- ggplot(forecast_tbl_long_final, aes(x = .index, y = Y_Value, color = .key)) +
  
  # Draw the Confidence Interval (Ribbon)
  geom_ribbon(aes(ymin = .conf_lo_plot, ymax = .conf_hi_plot),
              fill = COLOR_RIBBON, 
              alpha = 0.6, 
              color = NA,
              data = forecast_tbl_long_final %>% filter(.key == ".prediction")) +
  
  # Draw the Lines
  geom_line(linewidth = 2.0) +
  
  # Y-Axis Zoom
  coord_cartesian(ylim = c(min_y, max_y)) + 
  
  # Colors and Titles
  labs(title = "iShares US Aerospace & Defense ETF",
       subtitle = paste0("<span style='color:dimgrey;'>Predictive Intervals</span> based on TEST SET MAPE: ", mape_value_for_title, "% <br><span style='color:red;'>nnetsauce-Wrapped Booster Regressor (MTS) Model</span>"),
       x = "", 
       y = "",
       color = "") + 
  
  scale_color_manual(values = c(".value" = COLOR_ACTUAL, 
                                ".prediction" = COLOR_PREDICTION),
                     labels = c(".value" = "Actual Value", 
                                ".prediction" = "Prediction")) +
  scale_y_continuous(labels = scales::label_currency()) +
  scale_x_date(labels = scales::label_date("%b %d"),
               date_breaks = "2 days") +
  theme_minimal(base_family = "Roboto Slab", base_size = 16) +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = ggtext::element_markdown(face = "bold"),
        plot.background = element_rect(fill = "azure", color = "azure"),
        panel.background = element_rect(fill = "snow", color = "snow"),
        axis.text = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   vjust = 1),
        legend.position = "none")


# Interactive Output
plotly::ggplotly(p)

NOTE: This article was generated with the support of an AI assistant. The final content and structure were reviewed and approved by the author.

Leave a comment

I’m Selcuk Disci

Welcome to DataGeeek.com, dedicated to data science and machine learning with R, mostly based on financial data.

Let’s connect