Cross-Validation and Validation Set Approach – September, 22 2023

I watched a video today about cross-validation and bootstrap. I learned that we can estimate a model’s test error using the training error as a rough indicator. Typically, the test error is higher than the training error because the model faces unseen data during testing. To refine this estimate, we can apply methods like the cp statistic, AIC, and BIC, which adjust the training error mathematically to better reflect the test error.

The video also introduced the Validation Set Approach. It involves splitting the data into two parts: the training set and the validation set. The model is trained on the training set, and then we use this trained model to predict outcomes for the validation set. The resulting validation set error gives us an estimate of the test error.

However, there are some downsides to this approach. The validation set error can vary significantly depending on how we randomly split the data, making it less stable. Additionally, since we only train the model on a subset of the data (the training set), it might not capture the full complexity and diversity of the dataset. This can lead to an overestimate of the test error when we eventually fit the model to the entire dataset.

In summary, while the Validation Set Approach is a useful way to estimate test error, it has limitations due to variability and potential model underfitting. Care should be taken when interpreting its results, especially when applying the model to the entire dataset.

Leave a Reply Cancel reply