Exploring Age Patterns in Police Shootings

When we talk about the victims of police shootings, we notice that their ages vary quite a lot, not just overall, but also within different racial groups. This variation is important because it helps us understand who is affected the most and how we can work towards making things better.

race Median Mean Standard_Deviation Variance Kurtosis Skewness
A 35 35.96 11.40956 130.1781 -0.57208 0.332968
B 31 32.92812 11.2556 126.6884 0.902238 0.97427
H 33 33.59083 10.59493 112.2525 0.830593 0.814415
N 32 32.65049 8.907331 79.34055 -0.06095 0.571045
O 31 33.47368 11.79627 139.152 -0.47048 0.582493
W 38 40.12546 13.04995 170.3013 -0.09073 0.535878

Asian Victims (Race A)

Asian victims tend to be around 36 years old on average, but their ages can vary from much younger to much older. The range is quite wide. We are 95% certain that the average age of Asian victims is between 34 and 38 years old. This gives us a pretty good idea, but there’s still a lot of variety.

Black Victims (Race B)

For Black victims, the average age is around 33 years old. However, just like with Asian victims, there’s a lot of variation in age. The confidence interval here is between 32 and 33 years old, which is a bit narrower, showing us that the ages of Black victims are a bit more clustered together.

Hispanic Victims (Race H)

Hispanic victims have an average age of about 34 years old. Their ages vary with a pattern similar to that of Black victims, and we can say with 95% certainty that the average age falls between 33 and 34 years old.

Native American Victims (Race N)

Looking at Native American victims, we see an average age of around 33 years old, with a confidence interval between 31 and 34 years old. This shows us that there’s a bit more variety in age for Native American victims compared to other groups.

Other Races (Race O)

The category of “Other Races” includes a variety of different racial backgrounds. Here, the average age is about 34 years old, but the ages vary quite a bit, with a confidence interval between 28 and 39 years old. This wide interval indicates a lot of diversity in age within this group.

White Victims (Race W)

White victims tend to be older on average, around 40 years old. The ages vary, and we are 95% sure that the average age falls between 40 and 41 years old, which is a relatively narrow range.

Age Variance and Confidence Intervals:

Here’s a table showing the age variance and the 95% confidence intervals for the average age of victims from different races:

race variance 95% CI
A 130.178125 33.99107025 37.92892975
B 126.6884342 32.40315232 33.45307956
H 112.2524847 32.98268723 34.19897062
N 79.34055265 30.94672303 34.35424785
O 139.1520468 28.16943317 38.77793525
W 170.3012843 39.68020815 40.57071663

The “Age Variance” column shows how much the ages vary within each racial group. The “95% CI Lower Bound” and “95% CI Upper Bound” columns give us a range where we are 95% confident that the true average age of victims from each racial group falls.

Understanding the Bigger Picture

What does all of this tell us? It shows that age patterns in police shootings are complex and vary significantly across different races. By understanding these patterns, we can start asking important questions about why these variations exist.

EDA on Clusters

When I was analyzing the numbers from the states of Arizona (AZ) and Georgia (GA), I observe some interesting patterns. In Arizona, there were 230 instances recorded for category 0, followed by 64 for category 1. Category 3 came in third with 24 instances, and lastly, category 2 had only 1 instance.

On the other hand, in Georgia, category 2 led with 150 instances. Category 0 had 55 instances, while category 3 had 38. Category 1 recorded the least in Georgia with 26 instances.

These figures provide a snapshot of the distribution of instances across different categories for both states. While Arizona saw a dominant presence in category 0, Georgia had category 2 as its leading category.

Further I will be analyzing the output from clustering for other states.

DBSCAN and K-Mean

K-means is a clustering algorithm that aims to partition a set of data points into a specified number of groups or “clusters.” The process starts by randomly selecting “k” initial points called “centroids.” Every data point is then assigned to the nearest centroid, and based on these assignments, new centroids are recalculated as the average of all points in the cluster. This process of assigning points to the closest centroid and recalculating centroids is repeated until the centroids no longer change significantly. The result is “k” clusters where data points in the same cluster are closer to each other than to points in other clusters. The user needs to specify the number “k” in advance, which represents the number of desired clusters.

DBSCAN is a clustering algorithm that groups data points based on their proximity and density. Instead of requiring the user to specify the number of clusters in advance (like k-means), DBSCAN examines the data to find areas of high density and separates them from sparse regions. It works by defining a neighborhood around each data point, and if enough points are close together (indicating high density), they are considered part of the same cluster. Data points in low-density regions, which don’t belong to any cluster, are treated as noise. This makes DBSCAN especially useful for discovering clusters of varying shapes and sizes, and for handling noisy data.

Potential pitfalls:

DBSCAN:

  1. Requires selecting density parameters.
  2. Poor choice can miss clusters or merge separate ones.
  3. Struggles when clusters have different densities.
  4. Might classify sparse clusters as noise.
  5. Performance can degrade in high-dimensional data.
  6. Distance measures become less meaningful.
  7. Points close to two clusters might be arbitrarily assigned.

K-means:

  1. Need to specify number of clusters beforehand.
  2. Wrong choice can lead to poor clustering results.
  3. Random initialization can affect the final clusters.
  4. Might end up in local optima based on initial points.
  5. Assumes clusters are spherical and roughly of the same size.
  6. Struggles with elongated or irregularly shaped clusters.
  7. Sensitive to outliers, which can distort cluster centroids.

EDA on police shooting

A detailed analysis of the data from 2015 to 2023 offers some insights into this pressing issue. Over this period, the data reveals a fairly consistent trend in the number of incidents each month, with minor fluctuations. This consistency underscores the persistence of the issue over time.

A dive into the racial distribution of these incidents presents a more nuanced picture. Whites, who constitute a significant portion of the U.S. population, account for approximately 50.89% of the fatal police shootings. However, the figures for the Black community are particularly striking. Despite making up around 13% of the U.S. population, they represent a disproportionate 27.23% of the fatal police shootings. Hispanics follow, accounting for roughly 17.98%, while Asians, Native Americans, and others make up a smaller fraction, with 1.99%, 1.62%, and 0.29% respectively.

The data thus sheds light on the pressing need for a more comprehensive understanding and potential reforms in policing, especially considering the stark disparities in how different racial groups are affected.

Fatal Police Shootings of Black vs. White Individuals

Today, we have done analysis to discern any potential discrepancies in the shooting of white and black individuals. Our initial search involved extracting key statistical measures for both datasets: minimum, maximum, mean, median, standard deviation, skewness, and kurtosis. These metrics provided a foundational understanding, which we further visualized using histograms.

Upon including age into our analysis, we noticed a deviation from the normal distribution in the age profiles of both black and white people killed by the police.

Given the non-normality of the data, we questioned the appropriateness of employing the t-test for calculating the p-value. Recognizing this limitation that the data is not normally disturbed, we used the Monte Carlo method to estimate the p-value. Our results suggested that the observed average age difference between black and white victims of police shootings is highly improbable to have occurred merely by chance.

The magnitude of this difference, we utilized Cohen’s d method. The resultant value of 0.577 indicates a medium effect size, pointing to a significant disparity between the two groups.

However, an important question persists: how can we incorporate data from all races to ensure a holistic understanding?

EDA on project – 2

Today, I used data with location points to make a map. Some of these points were outside the main USA area. I wanted the map to only show the main part of the USA, so I set some borders and removed the points that were outside. I used a tool in Python called Basemap to do this, and plotted the same in the graph using matplotlib.pyplot.

Below are my few observations:

  1. Higher Incidents in the East: More incidents are observed in the eastern half of the USA, likely due to higher population density.
  2. Urban Concentration: Major cities and metropolitan areas, especially on the west coast (like Los Angeles) and in the south (like Houston), have a notable number of incidents.
  3. Central USA is Sparse: Fewer incidents are seen in the central Great Plains region, possibly due to fewer large cities and lower population density.
  4. Dense Northeast: The northeast, including areas around New York and Pennsylvania, shows a high concentration of incidents.
  5. Natural Regions: Mountainous and forested areas have fewer incidents, reflecting lower populations.

Project – 2 Further analysis of the data

In continuation of my previous blog, I have done the analysis on the latest sheet which includes the latitude and longitude data. I looked at the data and found some information is missing. There are 5,362 empty spots, which is almost 4% of all the data. The “race” column has the most missing information with 1,517 empty spots. Other columns like “flee”, “latitude”, and “longitude” also have a lot of missing information. This might make analyzing the data or making predictions with it a bit tricky and we might need to fill in the gaps carefully.

In the data we have, people’s ages range from 2 to 92 years old, with an average (mean) age of 37.29 years. Latitude and longitude numbers tell us where events happened all over the U.S and few outside of the county which should be eliminated. The data covers 2,963 different days, with the day having the most events (9) being February 1, 2018. Talking about the type of threat, “shoot” was mentioned most, in 2,461 incidents. Regarding whether people were running away (“fleeing”) during the incidents, 4,703 times they were not. Lastly, in 5,082 incidents, a gun was involved.

Project – 2, analyzing data from the Washington Post

As, we have stated working on the project – 2 that inspect instances of police shootings in the United States, with our data coming from a repository managed by the Washington Post. This data has records starting from January 2, 2015, and it is continually updated with new entries every week. A bit of a challenge has arisen since we have identified that there is approximately 23% of the data missing, which might make our analysis a bit tricky. Regardless, we are aiming to explore this available data thoroughly to uncover any trends, patterns, or noteworthy insights about these events. As we move forward, we will be seeking answers to a set of questions which will help shape our understanding of the occurrences and potentially inform policy and practice in the future. Our goal is to navigate through the available information, making the best use of it to understand more about the circumstances, patterns, and potential root causes of fatal police shootings across the country.

  1. What specific analyses and explorations are intended to be conducted on the data related to fatal police shootings?
  2. What strategies and methodologies should be employed to address and manage the missing data within the dataset?
  3. Based on the available data, what predictive models or forecasts might be developed regarding fatal police shootings in the future?
  4. Who constitutes the primary audience for the findings from this data analysis, and how might the insights derived be of utility to them?

Report writing – October, 04 2023

As we approach the concluding phase of Project 1, I have commenced the gathering of information from the team and begun the compilation process for the report. This involves collecting various elements such as graphs, code snippets, charts, and results to ensure they are systematically organized and accurately placed within the report. Our aim is to submit a preliminary copy for review before proceeding to the final submission.

Bootstrapping – October, 2 2023

Bootstrapping is a statistical method that helps to estimate the variability of a statistic by creating numerous re-sampled versions of a dataset, and is especially handy with small sample sizes. Essentially, it involves repeatedly drawing samples, with replacement, from a given dataset, and calculating a statistic (e.g., mean, median) or model parameter for each sample. This is done thousands of times to build a distribution of the statistic, which can then be analyzed to estimate its standard error, confidence intervals, and other properties. In model development, bootstrapping aids in understanding and reducing variability and bias in predictions, enhancing model stability and reliability. By repeatedly training and validating models on different bootstrap samples, we gain insights into the model’s robustness and generalizability, allowing for informed statistical inferences without additional data collection. This technique serves as a practical tool for exploring sample space and deriving meaningful statistical insights when dealing with limited data.

I plan to apply a specific technique to our data in order to estimate the sampling distribution, with the aim of investigating whether this approach will enhance model stability.