Data Analytics on Diabetes and using Multiple Linear Regression

I performed some data analytics for insights and performed some hypothesis testing and presented the results.

From our state wise analysis, here are the states with the highest and lowest average percentages for each category:

  1. Diabetes:
    • Highest average percentage: South Carolina
    • Lowest average percentage: Colorado
  2. Obesity:
    • Highest average percentage: Washington
    • Lowest average percentage: Wyoming
  3. Inactivity:
    • Highest average percentage: Florida
    • Lowest average percentage: Colorado

The correlation coefficients between the percentages of diabetes, obesity, and inactivity at the county level are as follows:

  1. Correlation Diabetes and Obesity: =
    • This indicates a moderate positive correlation between the percentage of the population with diabetes and the percentage of the population that is obese.
  2. Correlation Diabetes and Inactivity: =
    • This indicates a strong positive correlation between the percentage of the population with diabetes and the percentage of the population that is inactive.
  3. Correlation Obesity and Inactivity:
    • This indicates a moderate positive correlation between the percentage of the population that is obese and the percentage of the population that is inactive.

These correlations suggest that counties with higher percentages of inactivity and obesity tend to have higher percentages of diabetes. This aligns with existing knowledge about the risk factors for type 2 diabetes. But as the correlation are less than 0.7 we can say there is no strong correlation but presence of positive correlation.

Hypotheses:

  1. Diabetes and Obesity:
    • Null Hypothesis (Ho): There is no difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.
    • Distribution Choice: Two-sample t-test (assuming independent samples and approximately normal distribution)
  2. Diabetes and Inactivity:
    • Null Hypothesis (Ho): There is no difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Distribution Choice: Two-sample t-test
  3. Obesity and Inactivity:
    • Null Hypothesis (): There is no difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Distribution Choice: Two-sample t-test

For each hypothesis, we’ll use a significance level () of 0.05. If the p-value is less than , we will reject the null hypothesis.

For the hypothesis test comparing the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates:

Test Statistics:

P-value:

Since the p-value () is less than our significance level (), we reject the null hypothesis. This means there is a significant difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.

For the hypothesis test comparing the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates:

Test Statistics:

P-value: 2.38×10−15

Since the p-value is much less than our significance level (), we reject the null hypothesis. This suggests a significant difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.

For the hypothesis test comparing the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates:

Test Statistics:

P-value:

Given that the p-value is significantly less than our chosen significance level (), we reject the null hypothesis. This indicates a significant difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.

Summary:

  • All three hypotheses tests resulted in rejecting the null hypothesis, suggesting significant associations between diabetes, obesity, and inactivity at the county level.
  • The tests were conducted using the two-sample t-test, which assumes that the samples are approximately normally distributed and independent.

 

Breusch-Pagan Test and Hypothesis Testing for Heteroscedasticity

Today I learnt about Hypothesis testing and calculating p value or more importantly what is P value. and the best and simplest explanation for that is

Probability of an event happening if the null hypothesis was true.”

I watched few videos on Youtube regarding Hypothesis Testing and Breusch-Pagan Test and I have implemented Breusch-Pagan Test with null hypothesis(Ho)as the model is homoscedastic, and to check the P values

The Breusch-Pagan test is a diagnostic tool used to determine the presence of heteroscedasticity in a regression model. Heteroscedasticity refers to a situation where the variance of the residuals (or errors) from a regression model is not constant across all levels of the independent variables. The null hypothesis () for the Breusch-Pagan test is that the error variances are homoscedastic (constant across all levels of the independent variables), whereas the alternative hypothesis () posits that the error variances are heteroscedastic (not constant). The -value associated with the test statistic provides the probability of observing the given data (or something more extreme) if the null hypothesis were true. A low -value indicates evidence against the null hypothesis, suggesting heteroscedasticity.

For our regression model with %Inactivity as the predictor and %Diabetic as the response variable, the Breusch-Pagan test was applied. The Lagrange Multiplier (LM) statistic was found to be significant with a P-value close to 0 (approximately 3.607×10−13), indicating strong evidence against the null hypothesis of homoscedasticity. This suggests the presence of heteroscedasticity in the model. The F-test associated with the Breusch-Pagan test yielded unusual results, with a  -value of 1, which typically would not suggest heteroscedasticity. However, given the strong evidence from the LM test, it’s recommended to consider the model as having heteroscedastic errors.

This may necessitate adjustments or alternative modelling techniques to ensure valid inference and predictions.

Regression analysis on Diabetes Dataset

The provided excel file consists of %diabetes data for the year 2018 for the counties of United States along with %obesity and %inactivity.

According to my initial analysis of the data, there was 88.45% of missing data in the obesity and 56.39% of missing data in Inactivity compared to the diabetic data.

For %Inactive Skewness is 1.96 . This value indicates a right-skewed distribution, implying that there are more counties with lower inactivity percentages and a few counties with very high inactivity percentages. Kurtosis is 4.27. A value greater than 3 suggests that the distribution has heavier tails and a sharper peak than the normal distribution.

For %Obese after missing data the Skewness is 0.45. This value suggests a slight right skew, but it’s closer to a symmetric distribution. Kurtosis is -0.86. A value less than 3 indicates that the distribution is platykurtic, meaning it has lighter tails and a less sharp peak compared to the normal distribution.

Prior to performing analysis on the data and calculation of different values i read about the topic of simple regression and watched a few videos on the topics of assumptions of multiple linear regression and how they function.

As all the three variables are numeric variables, I initially performed two different linear regression analysis on the dataset using %diabetic as my dependent variable and  %inactivity and %obesity as my independent variables for two types of models. Despite data loss i removed the null values and continued the regression analysis for existing counties. 

For the model with %inactive the R^2 is 0.1951 , F-statistic: 331.6 with a p-value of < 0.001, indicating that the model is statistically significant and only 14.8% of the variance in diabetes prevalence can be explained by obesity levels in the dataset.  For the model with %obese the R^2 is  0.1484, F-statistic: 62.95 with a p-value of < 0.001, indicating that the model is statistically significant and only about 14.8% of the variance in diabetes prevalence can be explained by obesity levels in the dataset. The levels are variability explained is not so good so I checked the assumptions of regression with this data.

Both models exhibit potential violations of the homoscedasticity and normality assumptions. The cone-shaped patterns in the residuals plots suggest that transforming the data or considering a different type of regression model may be necessary. The deviations from normality, especially in the tails, also suggest potential outliers or the need for transformation.

Because of these broken assumptions we are not able to get good regression models. To get better predictions we can look for more data or use alternative prediction models like Generalized Linear Models (GLM), Random Forest, Gradient boosted Trees, Neural Networks etc.