Data Analytics on Diabetes and using Multiple Linear Regression

I performed some data analytics for insights and performed some hypothesis testing and presented the results.

From our state wise analysis, here are the states with the highest and lowest average percentages for each category:

  1. Diabetes:
    • Highest average percentage: South Carolina
    • Lowest average percentage: Colorado
  2. Obesity:
    • Highest average percentage: Washington
    • Lowest average percentage: Wyoming
  3. Inactivity:
    • Highest average percentage: Florida
    • Lowest average percentage: Colorado

The correlation coefficients between the percentages of diabetes, obesity, and inactivity at the county level are as follows:

  1. Correlation Diabetes and Obesity: =
    • This indicates a moderate positive correlation between the percentage of the population with diabetes and the percentage of the population that is obese.
  2. Correlation Diabetes and Inactivity: =
    • This indicates a strong positive correlation between the percentage of the population with diabetes and the percentage of the population that is inactive.
  3. Correlation Obesity and Inactivity:
    • This indicates a moderate positive correlation between the percentage of the population that is obese and the percentage of the population that is inactive.

These correlations suggest that counties with higher percentages of inactivity and obesity tend to have higher percentages of diabetes. This aligns with existing knowledge about the risk factors for type 2 diabetes. But as the correlation are less than 0.7 we can say there is no strong correlation but presence of positive correlation.

Hypotheses:

  1. Diabetes and Obesity:
    • Null Hypothesis (Ho): There is no difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.
    • Distribution Choice: Two-sample t-test (assuming independent samples and approximately normal distribution)
  2. Diabetes and Inactivity:
    • Null Hypothesis (Ho): There is no difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Distribution Choice: Two-sample t-test
  3. Obesity and Inactivity:
    • Null Hypothesis (): There is no difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Alternative Hypothesis (): There is a significant difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.
    • Distribution Choice: Two-sample t-test

For each hypothesis, we’ll use a significance level () of 0.05. If the p-value is less than , we will reject the null hypothesis.

For the hypothesis test comparing the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates:

Test Statistics:

P-value:

Since the p-value () is less than our significance level (), we reject the null hypothesis. This means there is a significant difference in the mean percentage of diabetic population between counties with above-average obesity rates and those with below-average obesity rates.

For the hypothesis test comparing the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates:

Test Statistics:

P-value: 2.38×10−15

Since the p-value is much less than our significance level (), we reject the null hypothesis. This suggests a significant difference in the mean percentage of diabetic population between counties with above-average inactivity rates and those with below-average inactivity rates.

For the hypothesis test comparing the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates:

Test Statistics:

P-value:

Given that the p-value is significantly less than our chosen significance level (), we reject the null hypothesis. This indicates a significant difference in the mean percentage of obese population between counties with above-average inactivity rates and those with below-average inactivity rates.

Summary:

  • All three hypotheses tests resulted in rejecting the null hypothesis, suggesting significant associations between diabetes, obesity, and inactivity at the county level.
  • The tests were conducted using the two-sample t-test, which assumes that the samples are approximately normally distributed and independent.

 

Leave a Reply

Your email address will not be published. Required fields are marked *