Analysing Police Shootings Data: Logistic Regression and Permutation Tests

The data on police shootings, obtained from the Washington Post, provides valuable insights into various aspects of these unfortunate events. But first lets begin by understanding some methods and what they do.

What is Logistic Regression?

Logistic regression is a statistical method for analysing datasets where the outcome variable is binary (e.g., 0/1, Yes/No, True/False). It predicts the probability that a given instance belongs to a particular category.

In our analysis of police shootings data, we used logistic regression to predict if an individual was armed based on several factors: age, gender, race, flee status, and signs of mental illness.

Logistic Regression on Predicting Armed Status: Using age, gender, race, flee status, and signs of mental illness as predictors, our model achieved an accuracy of 93%. However, it showed a strong bias towards predicting that individuals were armed, indicating that there is potential need for further refinement or balancing techniques.

Why not Linear Regression?

While linear regression predicts a continuous outcome, logistic regression predicts the probability of an event occurring. It ensures that the predicted probabilities are between 0 and 1 using the logistic function (or sigmoid function).

Results & Implications:

Our model achieved a high accuracy of 93%. However, its bias towards predicting that individuals were armed indicates a potential  imbalance.

In real-world scenarios, it’s crucial not just to consider accuracy but also other metrics like precision, recall, and the F1-score. Especially in sensitive contexts like police shootings, false negatives or false positives can have serious implications.

 

Permutation Tests: Empirical Hypothesis Testing

Permutation tests are a non-parametric method to test hypotheses. By shuffling labels and recalculating the test statistic, we can estimate a p-value based on the proportion of reshuffled datasets that provide as extreme (or more extreme) results as the original observed data.

In our analysis, the permutation test indicated a significant relationship between race and the likelihood of being armed during a police encounter.

Hypothesis Testing with Permutation Tests: We used a permutation test to examine the relationship between race and being armed. Our p-value of approximately 0.0163 suggests a significant relationship.

Enhancing Model Evaluation: Integrating Bootstrapping with Cross-Validation in Multiple Linear Regression

1. Bootstrapping:

Bootstrapping involves randomly sampling from your dataset with replacement to create many “resamples” of your data. Each resample is used to estimate the model and derive statistics. Bootstrapping provides a way to assess the variability of your model estimates, giving insight into the stability and robustness of your model.

2. Cross-Validation:

Cross-validation (CV) is a technique used to assess the predictive performance of a model. The most common form is k-fold CV, where the data is divided into subsets (or “folds”). The model is trained on folds and tested on the remaining fold. This process is repeated times, each time with a different fold as the test set. The results from all tests are then averaged to produce a single performance metric.

Combining Bootstrapping and Cross-Validation:

The combination involves performing cross-validation within each bootstrap sample. Here’s a step-by-step breakdown:

  1. Bootstrap Sample: Draw a random sample with replacement from your dataset.
  2. Cross-Validation on the Bootstrap Sample: Perform k-fold cross-validation on this bootstrap sample.
  3. Aggregate CV Results: After the iterations of CV, average the performance metrics to get a single performance measure for this bootstrap sample.
  4. Repeat: Repeat steps 1-3 for many bootstrap samples.
  5. Analyze: After all bootstrap iterations, you’ll have a distribution of the cross-validation performance metric. This distribution provides insights into the variability and robustness of your model’s performance.

Why Combine Both?

  1. Model Stability: By bootstrapping the data and then performing cross-validation, you can assess how sensitive the model’s performance is to different samples from the dataset. If performance varies greatly across bootstrap samples, the model might be unstable.
  2. Performance Distribution: Instead of a single CV performance metric, you get a distribution, which gives a more comprehensive view of expected model performance.
  3. Model Complexity: For multiple linear regression, you can assess how different combinations of predictors impact model performance across different samples. This can inform decisions about feature selection or model simplification.

Challenges:

  • Computationally Intensive: This approach can be computationally demanding since you’re performing cross-validation many times depending on the size of your dataset.
  • Data Requirements: You need a reasonably sized dataset. If your dataset is too small, bootstrapping might not provide meaningful variability in samples.

In conclusion, combining bootstrapping with cross-validation offers a robust method for evaluating the performance and stability of a multiple linear regression model. However, it’s essential to be aware of the computational demands and ensure that your dataset is suitable for this approach.

Bootstrapping: A Simple Yet Powerful Statistical Tool

What is Bootstrapping?

Bootstrapping is a resampling method that involves taking repeated samples from your original dataset (with replacement) and recalculating the statistic of interest for each sample. This process allows you to simulate the variability of your statistic, as if you were able to conduct the survey many times over.

How Does Bootstrapping Work?

  1. Sample with Replacement: From your original dataset, draw random observations, allowing for the same observation to be selected more than once. This new set is called a bootstrap sample.
  2. Compute the Statistic: Calculate the statistic of interest (e.g., mean, median) for this bootstrap sample.
  3. Repeat: Do this many times (e.g., 10,000 times) to build a distribution of your statistic.
  4. Analyse: From this distribution, derive insights into the central tendency and variability of your statistic.

Why Use Bootstrapping?

  • Flexibility: Bootstrapping makes minimal assumptions about the data. This makes it ideal for datasets that don’t follow well-known distributions.
  • Simplicity: Traditional statistical methods often require complex calculations and assumptions. Bootstrapping offers a computational alternative that’s easy to understand and implement.
  • Versatility: You can use bootstrapping for a wide range of statistics, from means and medians to more complex metrics.

A Real-World Analogy

Imagine you have a big jar of multi-colored jellybeans, and you want to know the proportion of red jellybeans. You take a handful, count the red ones, and note the proportion. Then, you put them back and shake the jar. You take another handful and repeat the process. After many handfuls, you’ll have a good idea of the variability in the proportion of red jellybeans. This process mirrors bootstrapping!

Limitations

While bootstrapping is powerful, it’s not a silver bullet. It provides an estimate based on your sample data, so if your original sample is biased, your bootstrap results will be too. Additionally, bootstrapping can be computationally intensive, especially with large datasets.

 

Exploring the Relationship Between Obesity and Diabetes Through Bootstrapping

With our CDC dataset in hand, we wanted to understand the difference between the average obesity rate and the average diabetes rate across counties. To do this, we:

  1. Drew a random sample (with replacement) from our dataset.
  2. Calculated the difference between the average diabetes rate and the average obesity rate for this sample.
  3. Repeated this process 10,000 times!

The Results

Our bootstrapping procedure revealed a 95% confidence interval for the difference between the average obesity rate and the average diabetes rate to be approximately (-11.24%, -11.03%). In layman’s terms, this means that we are 95% confident that, on average, the obesity rate is between 11.24% and 11.03% higher than the diabetes rate in the counties from our dataset.

This data suggests a significantly higher prevalence of obesity compared to diabetes in the studied counties. It highlights the importance of understanding and addressing obesity as a public health concern, given its potential implications for various health conditions, including diabetes.

Bootstrapping has provided a window into the relationship between obesity and diabetes in our dataset, emphasising its utility as a statistical tool. For researchers, policymakers, and health professionals, insights like these underscore the importance of data-driven decision-making in public health.

Back To Basics – What is Training and Testing Errors and Cross Validation?.. along with a simple Polynomial Regression model test

1. Training Error:

  • Definition: The training error is the error of your model on the data it was trained on. It’s the difference between the predicted values and the actual values in the training dataset.
  • Ideal Value: Ideally, the training error should be low, indicating that the model has learned well from the training data.

2. Testing Error:

  • Definition: The testing error is the error of your model on unseen data (i.e., data it wasn’t trained on). It’s the difference between the predicted values and the actual values in the testing dataset.
  • Ideal Value: Ideally, the testing error should also be low. However, a more important consideration is that the testing error should be close to the training error, indicating good generalisation.

3. Cross-Validation:

  • Definition: Cross-validation is a technique for assessing how well the results of a model will generalize to an independent dataset. It involves splitting the data into subsets, training the model on some subsets (training data), and evaluating it on other subsets (validation data).
  • Values: The values obtained from cross-validation provide insights into how well the model is likely to perform on unseen data. It’s often used to select model parameters that minimize the testing error.

The results from the 5-fold cross-validation for the polynomial regression model are as follows:

  • Average Mean Squared Error (MSE): 3.276
  • Standard Deviation of MSE: 0.363

The Mean Squared Error (MSE) is a measure of the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual values. A lower MSE value indicates a better fit of the model to the data.

The standard deviation of the MSE gives us an idea of the variability in the MSE across the different folds in the cross-validation. A lower standard deviation suggests that the model’s performance is more consistent across different subsets of the data.

These values provide a quantitative measure of how well the polynomial regression model is likely to perform on unseen data.

Polynomial Regression Model:

explore the relationship between the percentage of obese individuals (% OBESE) and the percentage of diabetic individuals (% DIABETIC) using a polynomial regression model.

The polynomial regression model is expressed by the equation:

Here:

  • The intercept () is approximately −58.13
  • The coefficient for the linear term () is approximately 7.65
  • The coefficient for the quadratic term () is approximately −0.22

The negative coefficient for the quadratic term suggests that there is a turning point in the relationship between the percentage of obese individuals and the percentage of diabetic individuals.

After plotting the graph, The polynomial curve suggests a non-linear relationship between the percentage of obese individuals () and the percentage of diabetic individuals (). As the percentage of obese individuals increases, the percentage of diabetic individuals also increases, but at a decreasing rate, indicating a kind of plateauing effect at higher percentages of obesity.

K-Fold Cross Validation on Diabetes dataset

In this post we are going to perform polynomial regression as Diabetics as response variable on the other hand  inactivity and obesity as predictor variables.   The error is measured twice with and without cross fold validation.

  1. Model Training (Pre k-fold):
    • A polynomial regression model of degree 2 was trained on the entire dataset.
    • The Mean Squared Error (MSE) was calculated on the entire dataset to obtain a pre k-fold error estimate of 0.326
  2. Cross-validation (Post k-fold):
    • K-fold cross-validation with = 5k was applied to the entire dataset to obtain a more robust estimate of the model’s performance.
    • The average MSE across the 5 folds was calculated to obtain a post k-fold error estimate of 0.364

Results and Discussion:

  • The pre k-fold error was lower compared to the post k-fold error (0.326 vs 0.364). This suggests that the model may have initially overfit to the specific structure of the entire dataset.
  • The post k-fold error, which is higher, indicates a more realistic estimate of the model’s performance across different subsets of the data. This suggests that the model’s performance has some variance across different subsets of the data.
  • K-fold cross-validation provides a more robust and conservative estimate of the model’s performance, which is crucial for understanding the model’s generalisation ability.

The increase in error from pre k-fold to post k-fold does not necessarily indicate that k-fold cross-validation is not good for this data. Instead, it provides a more realistic estimate of the model’s performance across different subsets of the data. Here are a few points to consider:

  1. Overfitting:
    • Without cross-validation, the model might be overfitting to the specific structure of the entire dataset. The lower error in the pre k-fold scenario might be because the model has learned the noise in the data rather than the underlying pattern.
  2. Robustness:
    • Cross-validation provides a more robust estimate of the model’s performance. It does this by training and evaluating the model on different subsets of the data, which gives a better understanding of how the model performs across a variety of data scenarios.
  3. Variance:
    • The higher error in the post k-fold scenario suggests that the model’s performance has some variance across different subsets of the data. This is valuable information as it indicates that the model’s performance is not consistent across the dataset.
  4. Generalisation:
    • Cross-validation helps to assess the model’s ability to generalize to unseen data, which is crucial for building models that perform well in real-world scenarios.
  5. Better Model Selection:
    • Cross-validation can be particularly useful in model selection, where different models or hyper parameters are being compared. It provides a more reliable basis for choosing the model that is likely to perform best on unseen data.

 

Conclusion: The analysis illustrates the importance of cross-validation in providing a more reliable estimate of model performance. Although the post k-fold error was higher, it offers a more realistic insight into how well the model may perform across various data scenarios, thus underscoring the value of cross-validation in model evaluation and selection.

Unveiling the Mystique: Prediction Error Estimation and Validation Techniques in Model Building

In the realm of data science, building a model is only half the battle won. The crux lies in understanding its performance and reliability when exposed to unseen data. This is where the concepts of estimating prediction error and validation techniques enter the fray.

Estimating Prediction Error and Validation Set Approach:

Prediction error refers to the discrepancy between the predicted values generated by a model and the actual values. It’s crucial to estimate this error to understand how well the model will perform on unseen data. One common approach to estimate prediction error is by splitting the available data into two parts: a training set and a validation set.

  • Training Set: This part of the data is used to train the model.
  • Validation Set: This part of the data is held back and not shown to the model during training. It is used to evaluate the model’s performance and estimate the prediction error.

The validation set approach provides an unbiased estimate of the model’s performance as it’s evaluated on data it hasn’t seen during training.

K-fold Cross-Validation: The Validation Set Approach is a solid starting point but might not be data efficient, especially when the dataset is limited. This is where K-fold Cross-Validation (CV) steps in, providing a more reliable performance estimate. In K-fold CV, data is divided into equal-sized “folds.” The model is trained times, each time using folds for training and the remaining fold for validation. This mechanism ensures that each data point is used for validation exactly once. However, a meticulous approach is essential. The Right Way is to ensure that data preprocessing steps are applied separately within each fold to prevent data leakage, and parameter tuning should be based solely on the validation sets. On the flip side, The Wrong Way could involve preprocessing the entire dataset upfront or tuning parameters based on the test set, both of which can lead to misleading performance estimates.

Cross-Validation: The Right and Wrong Ways: Implementing cross-validation correctly is a cornerstone for an unbiased evaluation of the model’s performance. A correct implementation entails fitting all preprocessing steps on the training data and tuning parameters based on validation sets. In contrast, incorrect implementations might involve preprocessing the entire dataset before splitting or tuning parameters based on the test set, which can lead to optimistic, and often misleading, performance estimates.

I have performed the use on a sample of Crime rate data prediction from the internet. The model gave an MSE of 45.0, which might have suggested that the model is performing somewhat poorly on unseen data. However, after applying K-fold Cross-Validation(K=5), you find that the average MSE across all folds is 38.2, which might suggest that the model’s performance is slightly better than initially thought. Moreover, the variation in MSE across the folds in K-fold CV can also provide insight into the consistency of the model’s performance across different subsets of the data.

In conclusion, the journey from model building to deployment is laden with critical assessments, where validation techniques play a pivotal role. By astutely employing techniques like the Validation Set Approach and K-fold CV, and adhering to the principles of correct cross-validation, data scientists can significantly enhance the reliability and efficacy of their models. Through these validation lenses, one can not only measure but also improve the model’s performance, steering it closer to the coveted goal of accurately interpreting unseen data.

 

 

Understanding Polynomial Regression and Step Function

What is Polynomial Regression?:

Polynomial regression is an extension of linear regression where we model the relationship between a dependent variable and one or more independent variables using a polynomial. Unlike linear regression, which models the relationship using a straight line, polynomial regression models the relationship using a curve.

The general form of a polynomial regression equation of degree is:

where:

  • is the dependent variable,
  • is the independent variable,
  • 0,1,…, are the coefficients, and
  • is the error term.

Polynomial regression is used when the relationship between the variables is nonlinear and can be captured better with a curved line.

What is Step Function (Piecewise Constant Regression)?:

Step functions allow us to model the relationship between a dependent variable and one or more independent variables using step-like or piecewise constant functions. In this approach, the range of the independent variable(s) is divided into disjoint segments, and a separate constant value is fit to the dependent variable within each segment.

The general form of a step function model is:

where:

  • is the dependent variable,
  • is the independent variable,
  • is an indicator function that equals 1 if the condition inside the parentheses is true and 0 otherwise,
  • 1,2,…, are the disjoint segments, and
  • is the error term.

Step functions are used when the relationship between the variables is better captured by different constants in different ranges of the independent variable.

Certainly! Both these plots provide different ways of understanding the relationships between the rates of obesity and inactivity with the rate of diabetes across different counties.

Polynomial Regression Analysis on the Diabetes Data:
– Our analysis suggested that a linear model (polynomial of degree 1) provided the lowest test error, indicating that the relationships between obesity/inactivity rates and diabetes rates are relatively linear within the examined range.
– However, the analysis didn’t provide a clear indication of a non-linear (polynomial) relationship as higher-degree polynomials did not yield a lower test error.

 Step Function (Piecewise Constant Regression) Analysis on Diabetes data:
– The plots represent piecewise constant models (step functions) that divide the range of obesity and inactivity rates into distinct segments. Within each segment, the rate of diabetes is approximated as a constant value.
– The breakpoints between segments are determined to minimize the within-segment variance in diabetes rates. This suggests that there could be specific thresholds in obesity and inactivity rates where the rate of diabetes changes.
– For example, in the first plot, as the obesity rate increases, we see a step-wise increase in the diabetes rate. Similarly, in the second plot, as the inactivity rate increases, the diabetes rate also increases in a step-wise manner.

Interpretation
– These analyses help to visualize and understand how obesity and physical inactivity may relate to diabetes rates across different counties.
– While the polynomial regression suggested a linear relationship, the step function analysis revealed that there might be specific ranges of obesity and inactivity rates that correspond to different levels of diabetes rates.
– The step function analysis provides a more segmented view which could be indicative of thresholds beyond which the rate of diabetes significantly increases.

Addressing High Variability and High Kurtosis and Non-Normality in the data

High variability (often indicated by a large variance or standard deviation) and high kurtosis (indicating heavy tails or outliers in the distribution) can pose challenges when performing hypothesis testing, as they might violate the assumptions of certain tests. Here are some strategies to deal with these challenges:

  1. Transformation:
    • If the data is positively skewed, consider applying a transformation like the square root, logarithm, or inverse to make the distribution more symmetric. For negatively skewed data, you might consider a squared transformation.
    • The Box-Cox transformation is a general family of power transformations that can stabilize variance and make the data more normally distributed.
  2. Use Non-Parametric Tests:
    • When the assumptions of parametric tests (like the t-test) are violated, consider using non-parametric tests. For comparing two groups, the Mann-Whitney U test can be used instead of an independent t-test. For related samples, the Wilcoxon signed-rank test can be used.
  3. Robust Statistical Methods:
    • Some statistical methods are designed to be “robust” against violations of assumptions. For example, instead of the standard t-test, you can use the Yuen’s t-test, which is robust against non-normality and heteroscedasticity.
  4. Bootstrap Methods:
    • Bootstrap resampling involves repeatedly sampling from the observed dataset (with replacement) and recalculating the test statistic for each sample. This method can provide an empirical distribution of the test statistic under the null hypothesis, which can be used to compute p-values.

The t-test is a parametric hypothesis test used to determine if there is a significant difference between the means of two groups. It’s one of the most commonly used hypothesis tests and comes with its own set of assumptions, which include:

  1. Independence of observations: The observations between and within groups are assumed to be independent of each other.
  2. Normality: The data for each of the two groups should be approximately normally distributed.
  3. Homogeneity of variances: The variances of the two groups should be approximately equal, though the t-test is somewhat robust to violations of this assumption, especially with equal sample sizes.

When dealing with high variability and high kurtosis:

  1. Impact on Normality Assumption: High kurtosis, especially leptokurtosis (kurtosis greater than 3), suggests that data might have heavy tails or sharp peaks, which is an indication of non-normality. Since the t-test assumes the data to be normally distributed, this can be a violation.
  2. Impact on Variance Assumption: High variability might also indicate potential issues with the assumption of homogeneity of variances, especially if the variability is significantly different between the two groups.

Given these challenges, here’s how the t-test fits into the picture:

  1. Welch’s t-test: If there’s a concern about the equality of variances, you can use Welch’s t-test, which is an adaptation of the student’s t-test and does not assume equal variances.
  2. Transformations: As mentioned, transformations (e.g., logarithmic, square root) can be used to stabilize variance and make the data more normally distributed, making it more suitable for a t-test.
  3. Non-parametric Alternatives: If the data is non-normal and transformations don’t help, consider using non-parametric tests like the Mann-Whitney U test instead of the t-test.
  4. Bootstrap Methods: For data with high variability and kurtosis, bootstrapping can be used to estimate the sampling distribution of the mean difference, and a t-statistic can be computed based on this empirical distribution.
  5. Effect Size: Regardless of the test used, always report the effect size (like Cohen’s d for t-test) as it provides a measure of the magnitude of the difference and is not as dependent on sample size as p-values.
  6. Diagnostic Checks: Before performing a t-test, always check its assumptions using diagnostic tools. For normality, use Q-Q plots or tests like Shapiro-Wilk. For homogeneity of variances, use Levene’s test.

In conclusion, while the t-test is a powerful tool for comparing means, its assumptions must be met for the results to be valid. High variability and kurtosis can challenge these assumptions, but with the right strategies and alternative methods, you can ensure robust and reliable results.

Exploring the Relationship Between Diabetes and Social Vulnerability: A Data-Driven Approach

In this post, we’ll delve into a dataset that represents the diagnosed diabetes percentage in various counties and its relationship with the Social Vulnerability Index (SVI). By understanding this relationship, we can gain insights into how social factors might influence health outcomes. We’ll be using various statistical and machine learning techniques to dissect this relationship.

Interactions and Nonlinearity:

  • We checked for interactions between ‘Diagnosed Diabetes Percentage’ and ‘Overall SVI’. This involves understanding if the effect of one variable changes based on the value of the other.
  • This means we want to see if the effect of ‘Diagnosed Diabetes Percentage’ on a dependent variable  changes based on the value of ‘Overall SVI’ and vice versa

Moving Beyond Linearity

  • ‘Diagnosed Diabetes Percentage’ and ‘Overall SVI’. From the plot, there seems to be a positive correlation between the two variables, but the relationship doesn’t appear to be strictly linear.

 

Polynomial Regression:

  • We introduced squared terms for both predictors to capture any quadratic relationships.
  • The mean squared error (MSE) for the polynomial regression model is . This indicates a perfect fit, which is unusual and could suggest overfitting.

Step Functions:

  • The ‘Diagnosed Diabetes Percentage’ was divided into intervals (bins), and a separate constant was fit for each interval.
  •  The step function model also resulted in an MSE of 0.0, reinforcing concerns about overfitting or data quality.

Our exploration suggests a non-linear relationship between ‘Diagnosed Diabetes Percentage’ and ‘Overall SVI’. However, the unusually perfect fits from our models warrant caution. In real-world scenarios, deeper data diagnostics, validation on separate datasets, and domain expertise are crucial to validate findings.