Regression analysis on Diabetes Dataset

The provided excel file consists of %diabetes data for the year 2018 for the counties of United States along with %obesity and %inactivity.

According to my initial analysis of the data, there was 88.45% of missing data in the obesity and 56.39% of missing data in Inactivity compared to the diabetic data.

For %Inactive Skewness is 1.96 . This value indicates a right-skewed distribution, implying that there are more counties with lower inactivity percentages and a few counties with very high inactivity percentages. Kurtosis is 4.27. A value greater than 3 suggests that the distribution has heavier tails and a sharper peak than the normal distribution.

For %Obese after missing data the Skewness is 0.45. This value suggests a slight right skew, but it’s closer to a symmetric distribution. Kurtosis is -0.86. A value less than 3 indicates that the distribution is platykurtic, meaning it has lighter tails and a less sharp peak compared to the normal distribution.

Prior to performing analysis on the data and calculation of different values i read about the topic of simple regression and watched a few videos on the topics of assumptions of multiple linear regression and how they function.

As all the three variables are numeric variables, I initially performed two different linear regression analysis on the dataset using %diabetic as my dependent variable and %inactivity and %obesity as my independent variables for two types of models. Despite data loss i removed the null values and continued the regression analysis for existing counties.

For the model with %inactive the R^2 is 0.1951 , F-statistic: 331.6 with a p-value of < 0.001, indicating that the model is statistically significant and only 14.8% of the variance in diabetes prevalence can be explained by obesity levels in the dataset. For the model with %obese the R^2 is 0.1484, F-statistic: 62.95 with a p-value of < 0.001, indicating that the model is statistically significant and only about 14.8% of the variance in diabetes prevalence can be explained by obesity levels in the dataset. The levels are variability explained is not so good so I checked the assumptions of regression with this data.

Both models exhibit potential violations of the homoscedasticity and normality assumptions. The cone-shaped patterns in the residuals plots suggest that transforming the data or considering a different type of regression model may be necessary. The deviations from normality, especially in the tails, also suggest potential outliers or the need for transformation.

Because of these broken assumptions we are not able to get good regression models. To get better predictions we can look for more data or use alternative prediction models like Generalized Linear Models (GLM), Random Forest, Gradient boosted Trees, Neural Networks etc.

Published by mgangakhedkar

Leave a Reply Cancel reply