Back To Basics – What is Training and Testing Errors and Cross Validation?.. along with a simple Polynomial Regression model test

1. Training Error:

  • Definition: The training error is the error of your model on the data it was trained on. It’s the difference between the predicted values and the actual values in the training dataset.
  • Ideal Value: Ideally, the training error should be low, indicating that the model has learned well from the training data.

2. Testing Error:

  • Definition: The testing error is the error of your model on unseen data (i.e., data it wasn’t trained on). It’s the difference between the predicted values and the actual values in the testing dataset.
  • Ideal Value: Ideally, the testing error should also be low. However, a more important consideration is that the testing error should be close to the training error, indicating good generalisation.

3. Cross-Validation:

  • Definition: Cross-validation is a technique for assessing how well the results of a model will generalize to an independent dataset. It involves splitting the data into subsets, training the model on some subsets (training data), and evaluating it on other subsets (validation data).
  • Values: The values obtained from cross-validation provide insights into how well the model is likely to perform on unseen data. It’s often used to select model parameters that minimize the testing error.

The results from the 5-fold cross-validation for the polynomial regression model are as follows:

  • Average Mean Squared Error (MSE): 3.276
  • Standard Deviation of MSE: 0.363

The Mean Squared Error (MSE) is a measure of the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual values. A lower MSE value indicates a better fit of the model to the data.

The standard deviation of the MSE gives us an idea of the variability in the MSE across the different folds in the cross-validation. A lower standard deviation suggests that the model’s performance is more consistent across different subsets of the data.

These values provide a quantitative measure of how well the polynomial regression model is likely to perform on unseen data.

Polynomial Regression Model:

explore the relationship between the percentage of obese individuals (% OBESE) and the percentage of diabetic individuals (% DIABETIC) using a polynomial regression model.

The polynomial regression model is expressed by the equation:

Here:

  • The intercept () is approximately −58.13
  • The coefficient for the linear term () is approximately 7.65
  • The coefficient for the quadratic term () is approximately −0.22

The negative coefficient for the quadratic term suggests that there is a turning point in the relationship between the percentage of obese individuals and the percentage of diabetic individuals.

After plotting the graph, The polynomial curve suggests a non-linear relationship between the percentage of obese individuals () and the percentage of diabetic individuals (). As the percentage of obese individuals increases, the percentage of diabetic individuals also increases, but at a decreasing rate, indicating a kind of plateauing effect at higher percentages of obesity.

K-Fold Cross Validation on Diabetes dataset

In this post we are going to perform polynomial regression as Diabetics as response variable on the other hand  inactivity and obesity as predictor variables.   The error is measured twice with and without cross fold validation.

  1. Model Training (Pre k-fold):
    • A polynomial regression model of degree 2 was trained on the entire dataset.
    • The Mean Squared Error (MSE) was calculated on the entire dataset to obtain a pre k-fold error estimate of 0.326
  2. Cross-validation (Post k-fold):
    • K-fold cross-validation with = 5k was applied to the entire dataset to obtain a more robust estimate of the model’s performance.
    • The average MSE across the 5 folds was calculated to obtain a post k-fold error estimate of 0.364

Results and Discussion:

  • The pre k-fold error was lower compared to the post k-fold error (0.326 vs 0.364). This suggests that the model may have initially overfit to the specific structure of the entire dataset.
  • The post k-fold error, which is higher, indicates a more realistic estimate of the model’s performance across different subsets of the data. This suggests that the model’s performance has some variance across different subsets of the data.
  • K-fold cross-validation provides a more robust and conservative estimate of the model’s performance, which is crucial for understanding the model’s generalisation ability.

The increase in error from pre k-fold to post k-fold does not necessarily indicate that k-fold cross-validation is not good for this data. Instead, it provides a more realistic estimate of the model’s performance across different subsets of the data. Here are a few points to consider:

  1. Overfitting:
    • Without cross-validation, the model might be overfitting to the specific structure of the entire dataset. The lower error in the pre k-fold scenario might be because the model has learned the noise in the data rather than the underlying pattern.
  2. Robustness:
    • Cross-validation provides a more robust estimate of the model’s performance. It does this by training and evaluating the model on different subsets of the data, which gives a better understanding of how the model performs across a variety of data scenarios.
  3. Variance:
    • The higher error in the post k-fold scenario suggests that the model’s performance has some variance across different subsets of the data. This is valuable information as it indicates that the model’s performance is not consistent across the dataset.
  4. Generalisation:
    • Cross-validation helps to assess the model’s ability to generalize to unseen data, which is crucial for building models that perform well in real-world scenarios.
  5. Better Model Selection:
    • Cross-validation can be particularly useful in model selection, where different models or hyper parameters are being compared. It provides a more reliable basis for choosing the model that is likely to perform best on unseen data.

 

Conclusion: The analysis illustrates the importance of cross-validation in providing a more reliable estimate of model performance. Although the post k-fold error was higher, it offers a more realistic insight into how well the model may perform across various data scenarios, thus underscoring the value of cross-validation in model evaluation and selection.

Unveiling the Mystique: Prediction Error Estimation and Validation Techniques in Model Building

In the realm of data science, building a model is only half the battle won. The crux lies in understanding its performance and reliability when exposed to unseen data. This is where the concepts of estimating prediction error and validation techniques enter the fray.

Estimating Prediction Error and Validation Set Approach:

Prediction error refers to the discrepancy between the predicted values generated by a model and the actual values. It’s crucial to estimate this error to understand how well the model will perform on unseen data. One common approach to estimate prediction error is by splitting the available data into two parts: a training set and a validation set.

  • Training Set: This part of the data is used to train the model.
  • Validation Set: This part of the data is held back and not shown to the model during training. It is used to evaluate the model’s performance and estimate the prediction error.

The validation set approach provides an unbiased estimate of the model’s performance as it’s evaluated on data it hasn’t seen during training.

K-fold Cross-Validation: The Validation Set Approach is a solid starting point but might not be data efficient, especially when the dataset is limited. This is where K-fold Cross-Validation (CV) steps in, providing a more reliable performance estimate. In K-fold CV, data is divided into equal-sized “folds.” The model is trained times, each time using folds for training and the remaining fold for validation. This mechanism ensures that each data point is used for validation exactly once. However, a meticulous approach is essential. The Right Way is to ensure that data preprocessing steps are applied separately within each fold to prevent data leakage, and parameter tuning should be based solely on the validation sets. On the flip side, The Wrong Way could involve preprocessing the entire dataset upfront or tuning parameters based on the test set, both of which can lead to misleading performance estimates.

Cross-Validation: The Right and Wrong Ways: Implementing cross-validation correctly is a cornerstone for an unbiased evaluation of the model’s performance. A correct implementation entails fitting all preprocessing steps on the training data and tuning parameters based on validation sets. In contrast, incorrect implementations might involve preprocessing the entire dataset before splitting or tuning parameters based on the test set, which can lead to optimistic, and often misleading, performance estimates.

I have performed the use on a sample of Crime rate data prediction from the internet. The model gave an MSE of 45.0, which might have suggested that the model is performing somewhat poorly on unseen data. However, after applying K-fold Cross-Validation(K=5), you find that the average MSE across all folds is 38.2, which might suggest that the model’s performance is slightly better than initially thought. Moreover, the variation in MSE across the folds in K-fold CV can also provide insight into the consistency of the model’s performance across different subsets of the data.

In conclusion, the journey from model building to deployment is laden with critical assessments, where validation techniques play a pivotal role. By astutely employing techniques like the Validation Set Approach and K-fold CV, and adhering to the principles of correct cross-validation, data scientists can significantly enhance the reliability and efficacy of their models. Through these validation lenses, one can not only measure but also improve the model’s performance, steering it closer to the coveted goal of accurately interpreting unseen data.

 

 

Understanding Polynomial Regression and Step Function

What is Polynomial Regression?:

Polynomial regression is an extension of linear regression where we model the relationship between a dependent variable and one or more independent variables using a polynomial. Unlike linear regression, which models the relationship using a straight line, polynomial regression models the relationship using a curve.

The general form of a polynomial regression equation of degree is:

where:

  • is the dependent variable,
  • is the independent variable,
  • 0,1,…, are the coefficients, and
  • is the error term.

Polynomial regression is used when the relationship between the variables is nonlinear and can be captured better with a curved line.

What is Step Function (Piecewise Constant Regression)?:

Step functions allow us to model the relationship between a dependent variable and one or more independent variables using step-like or piecewise constant functions. In this approach, the range of the independent variable(s) is divided into disjoint segments, and a separate constant value is fit to the dependent variable within each segment.

The general form of a step function model is:

where:

  • is the dependent variable,
  • is the independent variable,
  • is an indicator function that equals 1 if the condition inside the parentheses is true and 0 otherwise,
  • 1,2,…, are the disjoint segments, and
  • is the error term.

Step functions are used when the relationship between the variables is better captured by different constants in different ranges of the independent variable.

Certainly! Both these plots provide different ways of understanding the relationships between the rates of obesity and inactivity with the rate of diabetes across different counties.

Polynomial Regression Analysis on the Diabetes Data:
– Our analysis suggested that a linear model (polynomial of degree 1) provided the lowest test error, indicating that the relationships between obesity/inactivity rates and diabetes rates are relatively linear within the examined range.
– However, the analysis didn’t provide a clear indication of a non-linear (polynomial) relationship as higher-degree polynomials did not yield a lower test error.

 Step Function (Piecewise Constant Regression) Analysis on Diabetes data:
– The plots represent piecewise constant models (step functions) that divide the range of obesity and inactivity rates into distinct segments. Within each segment, the rate of diabetes is approximated as a constant value.
– The breakpoints between segments are determined to minimize the within-segment variance in diabetes rates. This suggests that there could be specific thresholds in obesity and inactivity rates where the rate of diabetes changes.
– For example, in the first plot, as the obesity rate increases, we see a step-wise increase in the diabetes rate. Similarly, in the second plot, as the inactivity rate increases, the diabetes rate also increases in a step-wise manner.

Interpretation
– These analyses help to visualize and understand how obesity and physical inactivity may relate to diabetes rates across different counties.
– While the polynomial regression suggested a linear relationship, the step function analysis revealed that there might be specific ranges of obesity and inactivity rates that correspond to different levels of diabetes rates.
– The step function analysis provides a more segmented view which could be indicative of thresholds beyond which the rate of diabetes significantly increases.

Addressing High Variability and High Kurtosis and Non-Normality in the data

High variability (often indicated by a large variance or standard deviation) and high kurtosis (indicating heavy tails or outliers in the distribution) can pose challenges when performing hypothesis testing, as they might violate the assumptions of certain tests. Here are some strategies to deal with these challenges:

  1. Transformation:
    • If the data is positively skewed, consider applying a transformation like the square root, logarithm, or inverse to make the distribution more symmetric. For negatively skewed data, you might consider a squared transformation.
    • The Box-Cox transformation is a general family of power transformations that can stabilize variance and make the data more normally distributed.
  2. Use Non-Parametric Tests:
    • When the assumptions of parametric tests (like the t-test) are violated, consider using non-parametric tests. For comparing two groups, the Mann-Whitney U test can be used instead of an independent t-test. For related samples, the Wilcoxon signed-rank test can be used.
  3. Robust Statistical Methods:
    • Some statistical methods are designed to be “robust” against violations of assumptions. For example, instead of the standard t-test, you can use the Yuen’s t-test, which is robust against non-normality and heteroscedasticity.
  4. Bootstrap Methods:
    • Bootstrap resampling involves repeatedly sampling from the observed dataset (with replacement) and recalculating the test statistic for each sample. This method can provide an empirical distribution of the test statistic under the null hypothesis, which can be used to compute p-values.

The t-test is a parametric hypothesis test used to determine if there is a significant difference between the means of two groups. It’s one of the most commonly used hypothesis tests and comes with its own set of assumptions, which include:

  1. Independence of observations: The observations between and within groups are assumed to be independent of each other.
  2. Normality: The data for each of the two groups should be approximately normally distributed.
  3. Homogeneity of variances: The variances of the two groups should be approximately equal, though the t-test is somewhat robust to violations of this assumption, especially with equal sample sizes.

When dealing with high variability and high kurtosis:

  1. Impact on Normality Assumption: High kurtosis, especially leptokurtosis (kurtosis greater than 3), suggests that data might have heavy tails or sharp peaks, which is an indication of non-normality. Since the t-test assumes the data to be normally distributed, this can be a violation.
  2. Impact on Variance Assumption: High variability might also indicate potential issues with the assumption of homogeneity of variances, especially if the variability is significantly different between the two groups.

Given these challenges, here’s how the t-test fits into the picture:

  1. Welch’s t-test: If there’s a concern about the equality of variances, you can use Welch’s t-test, which is an adaptation of the student’s t-test and does not assume equal variances.
  2. Transformations: As mentioned, transformations (e.g., logarithmic, square root) can be used to stabilize variance and make the data more normally distributed, making it more suitable for a t-test.
  3. Non-parametric Alternatives: If the data is non-normal and transformations don’t help, consider using non-parametric tests like the Mann-Whitney U test instead of the t-test.
  4. Bootstrap Methods: For data with high variability and kurtosis, bootstrapping can be used to estimate the sampling distribution of the mean difference, and a t-statistic can be computed based on this empirical distribution.
  5. Effect Size: Regardless of the test used, always report the effect size (like Cohen’s d for t-test) as it provides a measure of the magnitude of the difference and is not as dependent on sample size as p-values.
  6. Diagnostic Checks: Before performing a t-test, always check its assumptions using diagnostic tools. For normality, use Q-Q plots or tests like Shapiro-Wilk. For homogeneity of variances, use Levene’s test.

In conclusion, while the t-test is a powerful tool for comparing means, its assumptions must be met for the results to be valid. High variability and kurtosis can challenge these assumptions, but with the right strategies and alternative methods, you can ensure robust and reliable results.

Exploring the Relationship Between Diabetes and Social Vulnerability: A Data-Driven Approach

In this post, we’ll delve into a dataset that represents the diagnosed diabetes percentage in various counties and its relationship with the Social Vulnerability Index (SVI). By understanding this relationship, we can gain insights into how social factors might influence health outcomes. We’ll be using various statistical and machine learning techniques to dissect this relationship.

Interactions and Nonlinearity:

  • We checked for interactions between ‘Diagnosed Diabetes Percentage’ and ‘Overall SVI’. This involves understanding if the effect of one variable changes based on the value of the other.
  • This means we want to see if the effect of ‘Diagnosed Diabetes Percentage’ on a dependent variable  changes based on the value of ‘Overall SVI’ and vice versa

Moving Beyond Linearity

  • ‘Diagnosed Diabetes Percentage’ and ‘Overall SVI’. From the plot, there seems to be a positive correlation between the two variables, but the relationship doesn’t appear to be strictly linear.

 

Polynomial Regression:

  • We introduced squared terms for both predictors to capture any quadratic relationships.
  • The mean squared error (MSE) for the polynomial regression model is . This indicates a perfect fit, which is unusual and could suggest overfitting.

Step Functions:

  • The ‘Diagnosed Diabetes Percentage’ was divided into intervals (bins), and a separate constant was fit for each interval.
  •  The step function model also resulted in an MSE of 0.0, reinforcing concerns about overfitting or data quality.

Our exploration suggests a non-linear relationship between ‘Diagnosed Diabetes Percentage’ and ‘Overall SVI’. However, the unusually perfect fits from our models warrant caution. In real-world scenarios, deeper data diagnostics, validation on separate datasets, and domain expertise are crucial to validate findings.

Data Analytics on Diabetes and using Multiple Linear Regression

I performed some data analytics for insights and performed some hypothesis testing and presented the results.

From our state wise analysis, here are the states with the highest and lowest average percentages for each category:

  1. Diabetes:
    • Highest average percentage: South Carolina
    • Lowest average percentage: Colorado
  2. Obesity:
    • Highest average percentage: Washington
    • Lowest average percentage: Wyoming
  3. Inactivity:
    • Highest average percentage: Florida
    • Lowest average percentage: Colorado

The correlation coefficients between the percentages of diabetes, obesity, and inactivity at the county level are as follows:

  1. Correlation Diabetes and Obesity: =
    • This indicates a moderate positive correlation between the percentage of the population with diabetes and the percentage of the population that is obese.
  2. Correlation Diabetes and Inactivity: =
    • This indicates a strong positive correlation between the percentage of the population with diabetes and the percentage of the population that is inactive.
  3. Correlation Obesity and Inactivity:
    • This indicates a moderate positive correlation between the percentage of the population that is obese and the percentage of the population that is inactive.

These correlations suggest that counties with higher percentages of inactivity and obesity tend to have higher percentages of diabetes. This aligns with existing knowledge about the risk factors for type 2 diabetes. But as the correlation are less than 0.7 we can say there is no strong correlation but presence of positive correlation.

Hypotheses:

  1. Diabetes and Obesity:
    • Null Hypothesis (Ho): There is no difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.
    • Distribution Choice: Two-sample t-test (assuming independent samples and approximately normal distribution)
  2. Diabetes and Inactivity:
    • Null Hypothesis (Ho): There is no difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Distribution Choice: Two-sample t-test
  3. Obesity and Inactivity:
    • Null Hypothesis (): There is no difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Distribution Choice: Two-sample t-test

For each hypothesis, we’ll use a significance level () of 0.05. If the p-value is less than , we will reject the null hypothesis.

For the hypothesis test comparing the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates:

Test Statistics:

P-value:

Since the p-value () is less than our significance level (), we reject the null hypothesis. This means there is a significant difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.

For the hypothesis test comparing the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates:

Test Statistics:

P-value: 2.38×10−15

Since the p-value is much less than our significance level (), we reject the null hypothesis. This suggests a significant difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.

For the hypothesis test comparing the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates:

Test Statistics:

P-value:

Given that the p-value is significantly less than our chosen significance level (), we reject the null hypothesis. This indicates a significant difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.

Summary:

  • All three hypotheses tests resulted in rejecting the null hypothesis, suggesting significant associations between diabetes, obesity, and inactivity at the county level.
  • The tests were conducted using the two-sample t-test, which assumes that the samples are approximately normally distributed and independent.

 

Breusch-Pagan Test and Hypothesis Testing for Heteroscedasticity

Today I learnt about Hypothesis testing and calculating p value or more importantly what is P value. and the best and simplest explanation for that is

Probability of an event happening if the null hypothesis was true.”

I watched few videos on Youtube regarding Hypothesis Testing and Breusch-Pagan Test and I have implemented Breusch-Pagan Test with null hypothesis(Ho)as the model is homoscedastic, and to check the P values

The Breusch-Pagan test is a diagnostic tool used to determine the presence of heteroscedasticity in a regression model. Heteroscedasticity refers to a situation where the variance of the residuals (or errors) from a regression model is not constant across all levels of the independent variables. The null hypothesis () for the Breusch-Pagan test is that the error variances are homoscedastic (constant across all levels of the independent variables), whereas the alternative hypothesis () posits that the error variances are heteroscedastic (not constant). The -value associated with the test statistic provides the probability of observing the given data (or something more extreme) if the null hypothesis were true. A low -value indicates evidence against the null hypothesis, suggesting heteroscedasticity.

For our regression model with %Inactivity as the predictor and %Diabetic as the response variable, the Breusch-Pagan test was applied. The Lagrange Multiplier (LM) statistic was found to be significant with a P-value close to 0 (approximately 3.607×10−13), indicating strong evidence against the null hypothesis of homoscedasticity. This suggests the presence of heteroscedasticity in the model. The F-test associated with the Breusch-Pagan test yielded unusual results, with a  -value of 1, which typically would not suggest heteroscedasticity. However, given the strong evidence from the LM test, it’s recommended to consider the model as having heteroscedastic errors.

This may necessitate adjustments or alternative modelling techniques to ensure valid inference and predictions.

Regression analysis on Diabetes Dataset

The provided excel file consists of %diabetes data for the year 2018 for the counties of United States along with %obesity and %inactivity.

According to my initial analysis of the data, there was 88.45% of missing data in the obesity and 56.39% of missing data in Inactivity compared to the diabetic data.

For %Inactive Skewness is 1.96 . This value indicates a right-skewed distribution, implying that there are more counties with lower inactivity percentages and a few counties with very high inactivity percentages. Kurtosis is 4.27. A value greater than 3 suggests that the distribution has heavier tails and a sharper peak than the normal distribution.

For %Obese after missing data the Skewness is 0.45. This value suggests a slight right skew, but it’s closer to a symmetric distribution. Kurtosis is -0.86. A value less than 3 indicates that the distribution is platykurtic, meaning it has lighter tails and a less sharp peak compared to the normal distribution.

Prior to performing analysis on the data and calculation of different values i read about the topic of simple regression and watched a few videos on the topics of assumptions of multiple linear regression and how they function.

As all the three variables are numeric variables, I initially performed two different linear regression analysis on the dataset using %diabetic as my dependent variable and  %inactivity and %obesity as my independent variables for two types of models. Despite data loss i removed the null values and continued the regression analysis for existing counties. 

For the model with %inactive the R^2 is 0.1951 , F-statistic: 331.6 with a p-value of < 0.001, indicating that the model is statistically significant and only 14.8% of the variance in diabetes prevalence can be explained by obesity levels in the dataset.  For the model with %obese the R^2 is  0.1484, F-statistic: 62.95 with a p-value of < 0.001, indicating that the model is statistically significant and only about 14.8% of the variance in diabetes prevalence can be explained by obesity levels in the dataset. The levels are variability explained is not so good so I checked the assumptions of regression with this data.

Both models exhibit potential violations of the homoscedasticity and normality assumptions. The cone-shaped patterns in the residuals plots suggest that transforming the data or considering a different type of regression model may be necessary. The deviations from normality, especially in the tails, also suggest potential outliers or the need for transformation.

Because of these broken assumptions we are not able to get good regression models. To get better predictions we can look for more data or use alternative prediction models like Generalized Linear Models (GLM), Random Forest, Gradient boosted Trees, Neural Networks etc.