r – Training on entire dataset in AutoML function of h2o

I am using h2o.automl function in R and here you can find the function below;

h2o.automl(
  x = x_name,
  y = y_name,
  training_frame = as.h2o(train),
  leaderboard_frame = as.h2o(test),
  max_runtime_secs = 20*60,
  exclude_algos = c("XGBoost")
)

So, I’m confused about the last final fit on the entire dataset after getting the leader model from this function. In this case, cross-validation will be applied to the training data to find the best models and leaderboard_frame is only used for scoring. So the test subset is not used in any training process? After finding the best model for training with cross-validation folds, does h2o.automl fit a model on the entire dataset?

Because I would like to use this model operationally and use the entire dataset as well since I do not want to lose any information/data on the operational model. What about if I don’t give any leaderboard_frame? I know that the performance on the cross-validation folds will be shown in this case, but will h2o.automl model fit a final model to the entire dataset after finding the best hyperparameters and models by using cross-validation folds?

In other words, in a Kaggle competition, how can I use the h2o.automl to make sure to use the entire dataset to predict unseen data? By the way, it is a time-series forecasting competition and the time of the year has also a very crucial effect on the model. They’ve given a 10-year-long hourly time-series data and June is the month that the competition hosts would like you to predict. I would like my model to perform better in June by using h2o.automl, what do you suggest in this case?

One last question; for having a July-specific model, would you train the model by filtering out the July months from the training dataset and finding the best hyperparameters that perform well in July months? Or would you include the July months in the data? In this case, what would be your train/test/validation and cross-validation subsets? Since I would like to use h2o.automl function, can you please apply your answer to the h2o.automl?

Read more here: Source link