In Continuation to my previous post, I performed multi-linear regression using three datasets: %diabetes, %inactivity, and %obesity. From my analysis, I have identified 354 common count codes present across all three sheets. The results of our regression yielded a standard error of 0.59, a multiple R-value of 0.58, and an R-squared value of 0.34. These metrics suggest that our model might not be highly reliable or effective in explaining the observed outcomes.
Upon closer inspection of the data, we noticed an anomaly. Certain values, which aren’t duplicates, were treated as such in the regression model. To address this, I am considering implementing cross-validation. My plan is to label each of these “duplicate” data points uniquely, ensuring that our model recognizes them as distinct variables.
Later, to enhance the robustness of our model, I will be employing k-fold validation. I will divide our dataset into five segments. In this approach, 1/5th of the data will serve as the training set, while the model will be tested against the remaining 4/5ths. By rotating the training set and averaging the results from these five models, I aim to obtain a more accurate estimate of the test error.