1. Training Error:
- Definition: The training error is the error of your model on the data it was trained on. It’s the difference between the predicted values and the actual values in the training dataset.
- Ideal Value: Ideally, the training error should be low, indicating that the model has learned well from the training data.
2. Testing Error:
- Definition: The testing error is the error of your model on unseen data (i.e., data it wasn’t trained on). It’s the difference between the predicted values and the actual values in the testing dataset.
- Ideal Value: Ideally, the testing error should also be low. However, a more important consideration is that the testing error should be close to the training error, indicating good generalisation.
3. Cross-Validation:
- Definition: Cross-validation is a technique for assessing how well the results of a model will generalize to an independent dataset. It involves splitting the data into subsets, training the model on some subsets (training data), and evaluating it on other subsets (validation data).
- Values: The values obtained from cross-validation provide insights into how well the model is likely to perform on unseen data. It’s often used to select model parameters that minimize the testing error.
The results from the 5-fold cross-validation for the polynomial regression model are as follows:
- Average Mean Squared Error (MSE): 3.276
- Standard Deviation of MSE: 0.363
The Mean Squared Error (MSE) is a measure of the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual values. A lower MSE value indicates a better fit of the model to the data.
The standard deviation of the MSE gives us an idea of the variability in the MSE across the different folds in the cross-validation. A lower standard deviation suggests that the model’s performance is more consistent across different subsets of the data.
These values provide a quantitative measure of how well the polynomial regression model is likely to perform on unseen data.
Polynomial Regression Model:
explore the relationship between the percentage of obese individuals (% OBESE
) and the percentage of diabetic individuals (% DIABETIC
) using a polynomial regression model.
The polynomial regression model is expressed by the equation:
Here:
- The intercept () is approximately −58.13
- The coefficient for the linear term () is approximately 7.65
- The coefficient for the quadratic term () is approximately −0.22
The negative coefficient for the quadratic term suggests that there is a turning point in the relationship between the percentage of obese individuals and the percentage of diabetic individuals.
After plotting the graph, The polynomial curve suggests a non-linear relationship between the percentage of obese individuals () and the percentage of diabetic individuals (). As the percentage of obese individuals increases, the percentage of diabetic individuals also increases, but at a decreasing rate, indicating a kind of plateauing effect at higher percentages of obesity.