Category: Uncategorized
Project 1 – Resubmission
Anomaly detection
Today, we’re focusing to detect anomaly within the economic indicator’s dataset. Anomaly detection is a powerful statistical technique used to identify unusual patterns that do not conform to expected behavior. These outliers can often provide critical insights.
In Essence, think of anomaly detection as the process of finding the needles in the haystack. In the context of our economic data, these ‘needles’ could be unusual spikes or dips in indicators like unemployment rates or hotel occupancy. Identifying these anomalies is crucial because they could signal significant economic events, shifts, or even errors in data collection.
We’re trying to employ the Isolation Forest method, a sophisticated algorithm well-suited for pinpointing anomalies in complex datasets. This technique is especially effective in handling large, multidimensional data, making it ideal for our purpose.
Clustering Analysis
Today, we’re planning to do clustering analysis to discover distinct phases within the economy. This is hard yet accessible approach will allow us to group similar periods in economic indicators, painting a clearer picture of our economic cycles.
Clustering is like grouping different moments in time based on their economic characteristics. Imagine categorizing years or months when the economy showed similar trends in unemployment, hotel occupancy, and housing prices. This method can help us identify periods of growth, stability, or recession in a more structured manner.
By applying clustering techniques to our dataset, we can uncover patterns that might not be immediately evident. It’s like finding hidden stories in the data – stories about when and how our economy thrived or faced challenges. This method goes beyond looking at individual indicators by revealing how they interact over time.
We plan to use a selection of key economic indicators for this analysis. The idea is to see how these indicators cluster together at different times, giving us insight into the economic conditions of those periods. We’ll be looking for correlations and patterns that define distinct economic phases.å
Seasonal trends
In this blog, we take a closer look at the seasonal rhythms of the housing market and tourism sector, founded through an analysis of median housing prices and hotel occupancy rates.
The first graph gives a picture of how hotel occupancy rates ebb and flow throughout the year. These patterns are a window into the tourism industry’s seasonal heartbeat, where certain months show higher occupancy, possibly due to holidays or favorable weather, while others dip, reflecting off-peak times.
The second graph showcases the average monthly trends in median housing prices. Unlike hotel occupancy, the housing market’s seasonality is subtler but still telling. We might observe higher activity and prices during specific times of the year, aligned with general buying patterns or economic cycles.
Understanding these seasonal trends is more than an academic exercise. For businesses in the tourism industry, these insights are crucial for planning and strategizing. Similarly, for real estate professionals and homebuyers, knowing when the market tends to peak or cool can inform smarter decisions.
This analysis reminds us that both the housing market and tourism sector dance to the rhythm of seasonal patterns. It highlights the importance of timing in both industries and offers a nuanced view of how different times of the year can shape economic activity.
Foreclosure Trends
In our latest analysis, we’re focusing on foreclosure trends, a critical yet often overlooked aspect of the economy. By examining the changes in foreclosure petitions and deeds, we gain insights into the housing market’s stability and the broader economic situation.
The graph of foreclosure trends shows the ups and downs in petitions and deeds over time. An increase in foreclosures typically points to economic stress, like job losses, affecting homeowners’ ability to pay mortgages. On the flip side, a decrease suggests a healthier economy and stable housing market.
These trends are closely linked to other economic factors. For instance, higher unemployment can lead to more foreclosures, and shifts in housing prices can influence homeowners’ financial decisions.
High foreclosure rates impact more than just numbers; they affect community stability and reflect real challenges faced by individuals and families.
This analysis not only sheds light on the housing market but also offers a unique perspective on the economy’s overall health. Stay tuned as we continue to explore and decipher these economic patterns.
The Economy’s Domino Effect
Today, we’re analyzing the economy indicator, where various elements are interlinked in fascinating ways. For instance, I’m analyzing how passenger traffic at Logan Airport can serve as a barometer for hotel occupancy rates, offering insights into the state of tourism and business travel. The interplay between the job market and the housing sector is another compelling area of focus. A thriving job market often fuels a strong demand in housing, whereas a sluggish job market can lead to a downturn in real estate activities. Additionally, the influence of major development projects on local economies is particularly noteworthy, illustrating how such initiatives can spur job growth and energize the housing market. This post is about untangling these economic threads, revealing how shifts in one sector can ripple through to others, painting a comprehensive picture of our economic landscape.
Housing Market Trends
In this blog, we go through the housing market, focusing on how median housing prices have changed over time. This journey is more than just about prices; it’s a reflection of the economy.
The graph of median housing prices we analyzed is like a roadmap showing the market’s highs and lows. When prices rise, it often signals a strong demand for homes, hinting at a robust economy with confident buyers. On the flip side, dips or plateaus in prices can suggest a cooling market, possibly due to economic challenges or changing buyer sentiments.
But these trends don’t exist in isolation. They’re intertwined with various economic threads like employment rates, interest rates, and overall economic health. For instance, a booming job market might boost people’s ability to buy homes, pushing prices up. Similarly, changes in interest rates can either encourage or discourage buyers, affecting prices.
Interestingly, we also noticed potential seasonal variations in the housing market. Certain times of the year may experience more activity, influencing prices subtly.
Understanding these nuances in housing prices is crucial. It tells us not just about the real estate market but also gives insights into broader economic conditions. This analysis is invaluable for buyers, sellers, investors, and policymakers, helping them make informed decisions in a landscape that’s always evolving.
Trend Analysis
Today, we’re taking a closer look at the Boston’s economy through a trend analysis of key economic indicators. It’s a bit like being an economic detective, where we piece together clues to understand the bigger picture.
We focused on three main clues: the unemployment rate, hotel occupancy rates, and median housing prices. Each of these tells us something different. The unemployment rate is like a thermometer for the job market, showing us how many people are out of work. When this number goes down, it usually means more people have jobs, which is great news!
Next, we looked at how full hotels are, which is our hotel occupancy rate. This rate gives us a sneak peek into tourism and business travel. High occupancy often means more visitors and bustling business activities, while lower numbers might suggest the opposite.
Lastly, we delved into the median housing prices. This indicator is a bit like a window into the real estate market. Rising prices can indicate a high demand for homes, possibly signaling a strong economy. On the flip side, if prices drop or stagnate, it might mean the market is cooling down.
By analyzing these trends, we can get a sense of how the economy is faring.
Economic Indicator
For project 3 we have chosen a dataset called the economic indicator.
The dataset contains various economic indicators, organized by year and month. Below is a summary of what each column represents:
- Year and Month: The time frame for the data, with separate columns for the year and month.
- logan_passengers: The number of passengers at Logan Airport.
- logan_intl_flights: The number of international flights at Logan Airport.
- hotel_occup_rate: The occupancy rate of hotels.
- hotel_avg_daily_rate: The average daily rate for hotel stays.
- total_jobs: The total number of jobs.
- unemp_rate: The unemployment rate.
- labor_force_part_rate: The labor force participation rate.
- pipeline_unit: Information related to housing or development projects, possibly the number of units.
- pipeline_total_dev_cost: The total development cost for projects in the pipeline.
- pipeline_sqft: The total square footage of development projects in the pipeline.
- pipeline_const_jobs: The number of construction jobs created by pipeline projects.
- foreclosure_pet: The number of foreclosure petitions.
- foreclosure_deeds: The number of foreclosure deeds.
- med_housing_price: The median housing price.
- housing_sales_vol: The volume of housing sales.
- new_housing_const_permits: The number of new housing construction permits issued.
- new-affordable_housing_permits: The number of permits issued for new affordable housing.
This dataset offers a comprehensive view of various economic factors including transportation (air travel), hospitality (hotels), employment, real estate, and housing market indicators. Each of these metrics can provide insights into the economic health and trends of the region or area being studied.
SARIMA Model
The SARIMA model stands as a foundation in the world of time series analysis. An extension of the ARIMA model, SARIMA (Seasonal Autoregressive Integrated Moving Average) brings an added layer of sophistication to forecasting, particularly useful in handling seasonal data.
SARIMA is a statistical model used to predict future points in a time series. It’s particularly adept at handling data with seasonal patterns – like monthly sales data with peaks during holidays, or daily temperatures varying across seasons. The model extends ARIMA by integrating seasonality, making it more versatile.
Components:
The SARIMA model can be understood through its components: Seasonal (S), Autoregressive (AR), Integrated (I), and Moving Average (MA).
- Seasonal: This component models the seasonality in data, capturing regular patterns that repeat over a specific period.
- Autoregressive (AR): This part of the model captures the relationship between an observation and a specified number of lagged observations.
- Integrated (I): Integration involves differencing the time series to make it stationary, a necessary step for many time series models.
- Moving Average (MA): This component models the relationship between an observation and a residual error from a moving average model applied to lagged observations.
Stationary and Non-Stationary data in time series analysis
Time series analysis is a fascinating area of statistics and data science, where we study data that changes over time. Two key concepts in this field are ‘stationary’ and ‘non-stationary’ data. Let’s break these down in a way that balances simplicity with some technical insight.
Stationary data in a time series means the data behaves consistently over time. The average value (mean), the variability (variance), and how the data correlates with itself over time (autocorrelation) stay the same. For data scientists and statisticians, stationary data is easier to analyze and predict. Many statistical methods work best when the data is stationary because they assume the underlying patterns in the data don’t change.
We can spot stationary data by looking at graphs over time or using specific statistical tests, like the Augmented Dickey-Fuller test. If the data’s properties look consistent over time, it’s likely stationary.
Non-stationary data is the opposite. Here, the data changes its behavior over time – its mean, variance, or autocorrelation shift.
Non-stationary data can be tricky. It can fool you into seeing trends or patterns that don’t actually help predict future behavior. It’s like trying to guess the river’s flow in summer based on winter observations.
To analyze non-stationary data correctly, experts often transform the data to make it stationary. They might remove trends or seasonal effects or use techniques like differencing, where you focus on how much the data changes from one time point to the next, rather than the data itself.
Time Series Analysis
Time series analysis is an integral part of data science that involves examining sequences of data points collected over time. This method is pivotal in various fields, from economics to meteorology, helping to predict future trends based on historical data. This blog aims to simplify time series analysis, making it accessible to beginners while retaining its technical essence.
Time series analysis deals with analyzing data points recorded at different times. It’s used to extract meaningful statistics, identify patterns, and forecast future trends. This analysis is crucial in many areas, such as predicting market trends, weather forecasting, and strategic business planning.
Key Concepts:
Essential concepts include trend analysis (identifying long-term movement), seasonality (recognizing patterns or cycles), noise (separating random variability), and stationarity (assuming statistical properties remain constant over time).
Techniques:
- Descriptive Analysis: Involves visual inspection of data to identify trends, seasonality, and outliers.
- Moving Averages: This technique smooths out short-term fluctuations, highlighting longer-term trends or cycles.
- ARIMA Models: Widely used for forecasting, especially when data shows a clear trend or seasonal pattern.
- Machine Learning Approaches: Techniques like Random Forests and Neural Networks are increasingly used for complex time series forecasting.
Project 3 kickoff
As we embark on Project 3, we are faced with a wealth of choices, with 246 datasets available on the Analyze Boston website. Our team is currently engaged in sifting through these options to find the one that best suits our project’s needs. This selection process is critical as it lays the groundwork for our upcoming analysis. Once we’ve chosen a dataset, our next step will be to dive deep into its contents, searching for a unique and intriguing question that emerges naturally from the data. This question will guide our exploration and analysis, driving us to uncover new insights and understandings. It’s a thrilling phase in our project, promising both challenges and discoveries.
Project 2 – Report
Logistic Regression
Logistic regression is a statistical method used primarily for binary classification tasks, where outcomes are dichotomous (like yes/no or true/false). Unlike linear regression that predicts a continuous outcome, logistic regression predicts the probability of a given input belonging to a certain class. This is achieved by using the logistic (or sigmoid) function to convert the output of a linear equation into a probability value between 0 and 1. Common applications include predicting the likelihood of a patient having a disease in the medical field, customer churn in marketing, and credit scoring in finance. While logistic regression is straightforward to implement and interpret, and works well for linearly separable data, it assumes a linear relationship between variables and might not perform well with complex, non-linear data. Despite its limitations, logistic regression remains a popular choice due to its simplicity and effectiveness in various scenarios.
Furthermore, logistic regression’s strength lies in its interpretability and the ease with which it can be implemented. It’s particularly beneficial in fields were understanding the influence of each variable on the outcome is crucial. For instance, in healthcare, it helps in understanding how different medical indicators contribute to the likelihood of a disease. However, its reliance on the assumption of linearity between independent variables and the log odds can be a limitation. In cases where the relationship between variables is more complex, advanced techniques like neural networks or random forests might be more appropriate. Despite these limitations, logistic regression’s ability to provide clear, actionable insights with relatively simple computation makes it a valuable tool in the arsenal of data analysts and researchers.
Decision Tree
In today’s class, I’ve learnt about decision trees. Decision trees are essentially a graphical representation of decision-making processes. Think of them as a series of questions and choices that lead to a final conclusion. At the tree’s outset, you encounter the initial question, and as you answer each question, you progress down the branches until you arrive at the ultimate decision.
The construction of a decision tree entails selecting the most informative questions to ask at each juncture. These questions are based on various attributes or features of the data, and their selection is guided by statistical measures like information gain, Gini impurity, or entropy. The goal is to optimize the decision-making process by selecting the most relevant attributes at each node.
However, decision trees have limitations, especially in scenarios where the data exhibits a wide spread or deviation from the mean. In our recent Project 2, we encountered a dataset in which the mean was considerably distant from the majority of data points, making the decision tree method less efficient. This highlights the importance of considering the distribution and characteristics of the data when choosing the appropriate statistical method for analysis. Decision trees are a valuable tool, but their efficacy is contingent on the nature of the data they are applied to, and sometimes alternative statistical methods may be more suitable for handling such situations.
Threat levels vs age
We have created box plots to visualize the age distribution within each threat level category (Attack, other threat & undetermined).
The box plot above shows the age distribution within each threat level category in fatal police shootings.
Observations:
- The median age appears to be relatively consistent across different threat levels.
- The “attack” threat level has a slightly wider interquartile range (IQR), indicating more variability in age.
- The “undetermined” category has a higher median age and a narrower IQR compared to the other categories, suggesting that individuals in this category tend to be older.
Examine the relationship between threat level and signs of mental illness. We have create a bar plot to visualize the prevalence of signs of mental illness within each threat level category.
The bar plot above displays the prevalence of signs of mental illness within each threat level category in fatal police shootings.
Observations:
- The “attack” threat level has the lowest proportion of individuals showing signs of mental illness.
- The “undetermined” category has the highest proportion of individuals showing signs of mental illness, followed closely by the “other” category.
- This pattern suggests that incidents categorized as “undetermined” or “other” are more likely to involve individuals with signs of mental illness.
These visualizations provide a comprehensive understanding of how threat levels relate to race, age, and signs of mental illness in fatal police shootings.
Threat Level Analysis
In this analysis, we will:
- Analyze the Distribution of Threat Levels: Understand how different threat levels are distributed in the dataset.
- Examine the Relationship between Threat Level and Other Variables: Investigate how threat levels relate to other variables such as race, age, and signs of mental illness.
First part of the analysis: Analyzing the Distribution of Threat Levels.
The bar chart above shows the distribution of threat levels in fatal police shootings.
Observations:
- The majority of incidents are categorized under the “attack” threat level, indicating situations where the police perceived an active threat.
- The “other” threat level category includes a significant number of incidents, suggesting situations that may not have involved a direct attack but still resulted in a fatal shooting.
- The “undetermined” category has the least number of incidents.
Next, we examine the relationship between threat level and other variables. We will start by investigating how threat levels relate to race, age, and signs of mental illness.
Let’s start with the relationship between Threat Level and Race. We will create a cross-tabulation and visualize it to understand this relationship better.
The heatmap above visualizes the relationship between threat level and race in fatal police shootings, with values representing the percentage distribution of races within each threat level category.
Observations:
- Across all threat level categories, White (W) and Black (B) individuals constitute the majority of cases.
- The distribution of races appears to be relatively consistent across different threat levels.
- There is a slight increase in the percentage of White individuals in the “undetermined” category, which might indicate cases where the circumstances were less clear.
Correlations in Police Shootings Data
In our analysis of fatal police shootings, we’ve explored how factors like age, race, and threat levels correlate with signs of mental illness.
Age and Mental Illness: Our analysis revealed a significant correlation between age and signs of mental illness. The t-test showed a distinct age difference between individuals with and without signs of mental illness, with a t-statistic of 8.51 and a p-value near zero. This indicates a clear link between age and mental illness in these incidents.
Race and Mental Illness: Addressing race, we encountered initial data issues but corrected them to perform a chi-square test. The results showed a significant association between race and signs of mental illness, with a chi-square statistic of 171.23 and a p-value of 3.98×10−353.98×10−35.
Threat Level and Mental Illness: We also found a significant relationship between threat level and signs of mental illness, with a chi-square statistic of 24.48 and a p-value of 4.82×10−64.82×10−6.
Conclusion: Our analyses have illuminated significant correlations between age, race, threat level, and signs of mental illness in fatal police shootings. These insights pave the way for further investigation and a deeper understanding of these critical incidents. Our next step will be to analyze threat level distributions and their relationships with other variables.
Exploring Age Patterns in Police Shootings
When we talk about the victims of police shootings, we notice that their ages vary quite a lot, not just overall, but also within different racial groups. This variation is important because it helps us understand who is affected the most and how we can work towards making things better.
race | Median | Mean | Standard_Deviation | Variance | Kurtosis | Skewness |
A | 35 | 35.96 | 11.40956 | 130.1781 | -0.57208 | 0.332968 |
B | 31 | 32.92812 | 11.2556 | 126.6884 | 0.902238 | 0.97427 |
H | 33 | 33.59083 | 10.59493 | 112.2525 | 0.830593 | 0.814415 |
N | 32 | 32.65049 | 8.907331 | 79.34055 | -0.06095 | 0.571045 |
O | 31 | 33.47368 | 11.79627 | 139.152 | -0.47048 | 0.582493 |
W | 38 | 40.12546 | 13.04995 | 170.3013 | -0.09073 | 0.535878 |
Asian Victims (Race A)
Asian victims tend to be around 36 years old on average, but their ages can vary from much younger to much older. The range is quite wide. We are 95% certain that the average age of Asian victims is between 34 and 38 years old. This gives us a pretty good idea, but there’s still a lot of variety.
Black Victims (Race B)
For Black victims, the average age is around 33 years old. However, just like with Asian victims, there’s a lot of variation in age. The confidence interval here is between 32 and 33 years old, which is a bit narrower, showing us that the ages of Black victims are a bit more clustered together.
Hispanic Victims (Race H)
Hispanic victims have an average age of about 34 years old. Their ages vary with a pattern similar to that of Black victims, and we can say with 95% certainty that the average age falls between 33 and 34 years old.
Native American Victims (Race N)
Looking at Native American victims, we see an average age of around 33 years old, with a confidence interval between 31 and 34 years old. This shows us that there’s a bit more variety in age for Native American victims compared to other groups.
Other Races (Race O)
The category of “Other Races” includes a variety of different racial backgrounds. Here, the average age is about 34 years old, but the ages vary quite a bit, with a confidence interval between 28 and 39 years old. This wide interval indicates a lot of diversity in age within this group.
White Victims (Race W)
White victims tend to be older on average, around 40 years old. The ages vary, and we are 95% sure that the average age falls between 40 and 41 years old, which is a relatively narrow range.
Age Variance and Confidence Intervals:
Here’s a table showing the age variance and the 95% confidence intervals for the average age of victims from different races:
race | variance | 95% CI | |
A | 130.178125 | 33.99107025 | 37.92892975 |
B | 126.6884342 | 32.40315232 | 33.45307956 |
H | 112.2524847 | 32.98268723 | 34.19897062 |
N | 79.34055265 | 30.94672303 | 34.35424785 |
O | 139.1520468 | 28.16943317 | 38.77793525 |
W | 170.3012843 | 39.68020815 | 40.57071663 |
The “Age Variance” column shows how much the ages vary within each racial group. The “95% CI Lower Bound” and “95% CI Upper Bound” columns give us a range where we are 95% confident that the true average age of victims from each racial group falls.
Understanding the Bigger Picture
What does all of this tell us? It shows that age patterns in police shootings are complex and vary significantly across different races. By understanding these patterns, we can start asking important questions about why these variations exist.
EDA on Clusters
When I was analyzing the numbers from the states of Arizona (AZ) and Georgia (GA), I observe some interesting patterns. In Arizona, there were 230 instances recorded for category 0, followed by 64 for category 1. Category 3 came in third with 24 instances, and lastly, category 2 had only 1 instance.
On the other hand, in Georgia, category 2 led with 150 instances. Category 0 had 55 instances, while category 3 had 38. Category 1 recorded the least in Georgia with 26 instances.
These figures provide a snapshot of the distribution of instances across different categories for both states. While Arizona saw a dominant presence in category 0, Georgia had category 2 as its leading category.
Further I will be analyzing the output from clustering for other states.
DBSCAN and K-Mean
K-means is a clustering algorithm that aims to partition a set of data points into a specified number of groups or “clusters.” The process starts by randomly selecting “k” initial points called “centroids.” Every data point is then assigned to the nearest centroid, and based on these assignments, new centroids are recalculated as the average of all points in the cluster. This process of assigning points to the closest centroid and recalculating centroids is repeated until the centroids no longer change significantly. The result is “k” clusters where data points in the same cluster are closer to each other than to points in other clusters. The user needs to specify the number “k” in advance, which represents the number of desired clusters.
DBSCAN is a clustering algorithm that groups data points based on their proximity and density. Instead of requiring the user to specify the number of clusters in advance (like k-means), DBSCAN examines the data to find areas of high density and separates them from sparse regions. It works by defining a neighborhood around each data point, and if enough points are close together (indicating high density), they are considered part of the same cluster. Data points in low-density regions, which don’t belong to any cluster, are treated as noise. This makes DBSCAN especially useful for discovering clusters of varying shapes and sizes, and for handling noisy data.
Potential pitfalls:
DBSCAN:
- Requires selecting density parameters.
- Poor choice can miss clusters or merge separate ones.
- Struggles when clusters have different densities.
- Might classify sparse clusters as noise.
- Performance can degrade in high-dimensional data.
- Distance measures become less meaningful.
- Points close to two clusters might be arbitrarily assigned.
K-means:
- Need to specify number of clusters beforehand.
- Wrong choice can lead to poor clustering results.
- Random initialization can affect the final clusters.
- Might end up in local optima based on initial points.
- Assumes clusters are spherical and roughly of the same size.
- Struggles with elongated or irregularly shaped clusters.
- Sensitive to outliers, which can distort cluster centroids.
EDA on police shooting
A detailed analysis of the data from 2015 to 2023 offers some insights into this pressing issue. Over this period, the data reveals a fairly consistent trend in the number of incidents each month, with minor fluctuations. This consistency underscores the persistence of the issue over time.
A dive into the racial distribution of these incidents presents a more nuanced picture. Whites, who constitute a significant portion of the U.S. population, account for approximately 50.89% of the fatal police shootings. However, the figures for the Black community are particularly striking. Despite making up around 13% of the U.S. population, they represent a disproportionate 27.23% of the fatal police shootings. Hispanics follow, accounting for roughly 17.98%, while Asians, Native Americans, and others make up a smaller fraction, with 1.99%, 1.62%, and 0.29% respectively.
The data thus sheds light on the pressing need for a more comprehensive understanding and potential reforms in policing, especially considering the stark disparities in how different racial groups are affected.
Fatal Police Shootings of Black vs. White Individuals
Today, we have done analysis to discern any potential discrepancies in the shooting of white and black individuals. Our initial search involved extracting key statistical measures for both datasets: minimum, maximum, mean, median, standard deviation, skewness, and kurtosis. These metrics provided a foundational understanding, which we further visualized using histograms.
Upon including age into our analysis, we noticed a deviation from the normal distribution in the age profiles of both black and white people killed by the police.
Given the non-normality of the data, we questioned the appropriateness of employing the t-test for calculating the p-value. Recognizing this limitation that the data is not normally disturbed, we used the Monte Carlo method to estimate the p-value. Our results suggested that the observed average age difference between black and white victims of police shootings is highly improbable to have occurred merely by chance.
The magnitude of this difference, we utilized Cohen’s d method. The resultant value of 0.577 indicates a medium effect size, pointing to a significant disparity between the two groups.
However, an important question persists: how can we incorporate data from all races to ensure a holistic understanding?
EDA on project – 2
Today, I used data with location points to make a map. Some of these points were outside the main USA area. I wanted the map to only show the main part of the USA, so I set some borders and removed the points that were outside. I used a tool in Python called Basemap to do this, and plotted the same in the graph using matplotlib.pyplot.
Below are my few observations:
- Higher Incidents in the East: More incidents are observed in the eastern half of the USA, likely due to higher population density.
- Urban Concentration: Major cities and metropolitan areas, especially on the west coast (like Los Angeles) and in the south (like Houston), have a notable number of incidents.
- Central USA is Sparse: Fewer incidents are seen in the central Great Plains region, possibly due to fewer large cities and lower population density.
- Dense Northeast: The northeast, including areas around New York and Pennsylvania, shows a high concentration of incidents.
- Natural Regions: Mountainous and forested areas have fewer incidents, reflecting lower populations.
Project – 2 Further analysis of the data
In continuation of my previous blog, I have done the analysis on the latest sheet which includes the latitude and longitude data. I looked at the data and found some information is missing. There are 5,362 empty spots, which is almost 4% of all the data. The “race” column has the most missing information with 1,517 empty spots. Other columns like “flee”, “latitude”, and “longitude” also have a lot of missing information. This might make analyzing the data or making predictions with it a bit tricky and we might need to fill in the gaps carefully.
In the data we have, people’s ages range from 2 to 92 years old, with an average (mean) age of 37.29 years. Latitude and longitude numbers tell us where events happened all over the U.S and few outside of the county which should be eliminated. The data covers 2,963 different days, with the day having the most events (9) being February 1, 2018. Talking about the type of threat, “shoot” was mentioned most, in 2,461 incidents. Regarding whether people were running away (“fleeing”) during the incidents, 4,703 times they were not. Lastly, in 5,082 incidents, a gun was involved.
Project – 2, analyzing data from the Washington Post
As, we have stated working on the project – 2 that inspect instances of police shootings in the United States, with our data coming from a repository managed by the Washington Post. This data has records starting from January 2, 2015, and it is continually updated with new entries every week. A bit of a challenge has arisen since we have identified that there is approximately 23% of the data missing, which might make our analysis a bit tricky. Regardless, we are aiming to explore this available data thoroughly to uncover any trends, patterns, or noteworthy insights about these events. As we move forward, we will be seeking answers to a set of questions which will help shape our understanding of the occurrences and potentially inform policy and practice in the future. Our goal is to navigate through the available information, making the best use of it to understand more about the circumstances, patterns, and potential root causes of fatal police shootings across the country.
- What specific analyses and explorations are intended to be conducted on the data related to fatal police shootings?
- What strategies and methodologies should be employed to address and manage the missing data within the dataset?
- Based on the available data, what predictive models or forecasts might be developed regarding fatal police shootings in the future?
- Who constitutes the primary audience for the findings from this data analysis, and how might the insights derived be of utility to them?
Report – Project 1
Link for the code : diabetes.ipynb
Report writing – October, 04 2023
As we approach the concluding phase of Project 1, I have commenced the gathering of information from the team and begun the compilation process for the report. This involves collecting various elements such as graphs, code snippets, charts, and results to ensure they are systematically organized and accurately placed within the report. Our aim is to submit a preliminary copy for review before proceeding to the final submission.
Bootstrapping – October, 2 2023
Bootstrapping is a statistical method that helps to estimate the variability of a statistic by creating numerous re-sampled versions of a dataset, and is especially handy with small sample sizes. Essentially, it involves repeatedly drawing samples, with replacement, from a given dataset, and calculating a statistic (e.g., mean, median) or model parameter for each sample. This is done thousands of times to build a distribution of the statistic, which can then be analyzed to estimate its standard error, confidence intervals, and other properties. In model development, bootstrapping aids in understanding and reducing variability and bias in predictions, enhancing model stability and reliability. By repeatedly training and validating models on different bootstrap samples, we gain insights into the model’s robustness and generalizability, allowing for informed statistical inferences without additional data collection. This technique serves as a practical tool for exploring sample space and deriving meaningful statistical insights when dealing with limited data.
I plan to apply a specific technique to our data in order to estimate the sampling distribution, with the aim of investigating whether this approach will enhance model stability.
Cross Validation – September, 29 2023
- Does the Number of Folds Matter in Cross Validation?
K-Fold Cross Validation is an essential technique in machine learning, particularly when the performance of a model shows significant variance based on the train-test split. While many practitioners often use 5 or 10 folds, there is no strict rule or norm dictating these numbers. In essence, one can use as many folds as deemed appropriate for the specific dataset and problem at hand.
- Experimentation with Different Fold Numbers:
I conducted an experiment using 5-fold cross validation and obtained an R-Squared value of 34%.
To delve deeper into the behavior of the model, I aim to experiment with varying numbers of folds. The objective is to observe any changes in the R-Squared value as the number of folds changes.
Beyond this, I intend to take unequal sets for training and testing, then perform regression to see the resulting R-Squared values. By doing so, I hope to gain a deeper understanding of how the model behaves under different conditions.
Cross-validation and K-fold – September, 27 2023
In Continuation to my previous post, I performed multi-linear regression using three datasets: %diabetes, %inactivity, and %obesity. From my analysis, I have identified 354 common count codes present across all three sheets. The results of our regression yielded a standard error of 0.59, a multiple R-value of 0.58, and an R-squared value of 0.34. These metrics suggest that our model might not be highly reliable or effective in explaining the observed outcomes.
Upon closer inspection of the data, we noticed an anomaly. Certain values, which aren’t duplicates, were treated as such in the regression model. To address this, I am considering implementing cross-validation. My plan is to label each of these “duplicate” data points uniquely, ensuring that our model recognizes them as distinct variables.
Later, to enhance the robustness of our model, I will be employing k-fold validation. I will divide our dataset into five segments. In this approach, 1/5th of the data will serve as the training set, while the model will be tested against the remaining 4/5ths. By rotating the training set and averaging the results from these five models, I aim to obtain a more accurate estimate of the test error.
Multiple Linear Regression in project 1 – September, 25 2023
Update regarding project 1:
Today, I have conducted multiple linear regression analysis on the provided data, specifically focusing on variables common to all sheets, namely “%diabetes,” “%inactivity,” and “%obesity.” The results of this analysis are summarized as follows:
- Multiple R: 0.583729108
- R Square: 0.340739671
- Adjusted R Square: 0.336983202
- Standard Error: 0.593139754
- Observations: 354
The multiple R value indicates the correlation between the independent variables and the dependent variable. The R Square value represents the proportion of variance in the dependent variable that can be explained by the independent variables. The Adjusted R Square value adjusts the R Square for the number of predictors. The Standard Error provides an estimate of the standard deviation of the errors in the regression model, and the number of Observations reflects the sample size used in the analysis.
Cross-Validation and Validation Set Approach – September, 22 2023
I watched a video today about cross-validation and bootstrap. I learned that we can estimate a model’s test error using the training error as a rough indicator. Typically, the test error is higher than the training error because the model faces unseen data during testing. To refine this estimate, we can apply methods like the cp statistic, AIC, and BIC, which adjust the training error mathematically to better reflect the test error.
The video also introduced the Validation Set Approach. It involves splitting the data into two parts: the training set and the validation set. The model is trained on the training set, and then we use this trained model to predict outcomes for the validation set. The resulting validation set error gives us an estimate of the test error.
However, there are some downsides to this approach. The validation set error can vary significantly depending on how we randomly split the data, making it less stable. Additionally, since we only train the model on a subset of the data (the training set), it might not capture the full complexity and diversity of the dataset. This can lead to an overestimate of the test error when we eventually fit the model to the entire dataset.
In summary, while the Validation Set Approach is a useful way to estimate test error, it has limitations due to variability and potential model underfitting. Care should be taken when interpreting its results, especially when applying the model to the entire dataset.
September, 20 2023 – Crab molt model and T-test
In our recent class, we learnt about the Crab Molt Model, a potent linear modeling approach designed for situations where two variables exhibit non-normal distribution, skewness, high variance, and high kurtosis. The central aim of this model is to predict pre-molt size based on post-molt size.
We learnt the concept of statistical significance, particularly focusing on differences in means. Using data from “Stat Labs: Mathematical Statistics Through Applications,” Chapter 7, page 139, we constructed a model and generated a linear plot. While plotting graphs for post-molt and pre-molt sizes, we observed a notable difference in means. Intriguingly, the size and shape of these graphs bore a striking similarity, differing by just 14.68 units.
To assess the statistical significance of this observed difference, we initially considered utilizing the common t-test, typically used for comparing means in two-group scenarios. However, our project introduced a complexity: it involved three variables, rendering the t-test inappropriate for our analysis.
The Crab Molt Model and the exploration of mean differences by the t-test, are the tools for deciphering data complexities. Nonetheless, in the face of intricate, multi-variable scenarios, embracing advanced statistical methodologies becomes crucial for uncovering meaningful insights and advancing our understanding of statistical significance.
The t-test is not applicable in our Project 1, which involves three variables, as it is designed for comparisons between two variables. Instead, we need to explore advanced techniques like ANOVA or regression analysis to assess the significance of differences in means in our complex scenario.
Heteroskedasticity and linear 3D Model – September, 18 2023
Linear regression is a powerful tool in data analysis, but it relies on some crucial assumptions. One of these is homoskedasticity, which means the variance of errors should be constant across different levels of independent variables. If this assumption doesn’t hold, our regression results may not be reliable. This is where Python’s Breusch-Pagan test, available in the statsmodels library.
To detect Heteroskedasticity with the Breusch-Pagan Test in python I used the following steps:
-
- Import the necessary libraries, including statsmodels.
- Fit an Ordinary Least Squares (OLS) regression model to your data, specifying the dependent and independent variables.
- Use the het_breuschpagan function from statsmodels to perform the Breusch-Pagan test on the residuals of the regression model.
- The p-value obtained from the Breusch-Pagan test is crucial for identifying heteroskedasticity. If this p-value is below a chosen significance level, typically 0.05, it suggests that heteroskedasticity may be affecting the reliability of your regression analysis.
Linear 3D models:
Whenever we must examinate the relationship between 3 variables in our case %diabetes, %inactivity and %obesity. We can use this model to visualize and understand the intricate relationships among three variables in a three-dimensional space. Variables don’t always act independently. Sometimes, one variable’s effect on the outcome depends on the value of another variable. These interactions can significantly influence your model’s predictions and are crucial to consider.
September, 15 2023
In data analysis, the first step is to ensure the collection of clear and accurate data. We obtained data on %diabetics, %inactivity, and %obesity from a project sheet and employed Python’s NumPy for essential statistical computations, such as medians, means, and standard deviations. These calculations provide us with a foundational understanding of the dataset.
Our primary objective was to unveil the relationship between the percentage of diabetics (%diabetics) and the percentage of inactive individuals (%inactivity). To achieve this, we constructed a scatterplot representing each region as a data point. This visual aid played a crucial role in assessing the connection between these two variables. Subsequently, we utilized the scatterplot to compute the R-squared value, a metric that quantifies the strength of this relationship. A higher R-squared value signifies a more robust connection, potentially shedding light on the significant contribution of inactivity to diabetes rates. We also meticulously examined residuals to validate our model and ensure the absence of anomalies or outliers.
Furthermore, through the inclusion of histograms and density plots. These visualizations provided valuable insights into how the data was distributed across our dataset. With these powerful analytical tools, we aimed to gain a comprehensive understanding of the intricate relationship between the percentage of diabetics and the percentage of inactive individuals. In essence, our systematic approach encompassed precise data collection, thorough statistical analysis using NumPy, and insightful visualizations, all contributing to unraveling the connection between %diabetics and %inactivity and enhancing our comprehension of diabetes rates.
September, 13 2023
Have you ever wondered how scientists and researchers determine if the patterns they find in data are real or just random chance? Enter the P-value, a nifty statistical tool that helps us separate the signal from the noise in data analysis.
P-value, short for “probability value,” is a number that tells us the likelihood of something happening by chance. Imagine you are flipping a coin, and you suspect it is rigged to land on heads more often. The P-value helps us figure out if the evidence supports your suspicion or if the results could easily occur randomly.
So, why is P-value important? It helps us decide whether the patterns we see in data are likely due to a real cause or just chance. A small P-value, usually less than 0.05, suggests that our findings are probably not random. This gives us confidence that we are onto something meaningful.
In essence, the P-value is your data analysis sidekick. It tells you if your findings are worth getting excited about or if they could just be a lucky fluke. Remember, though, while a low P-value is a good sign, it is not the only thing to consider in data analysis. Always look at the bigger picture, and use P-values wisely to unlock the secrets hidden within your data.
September, 11 2023
Today’s class was about simple linear regression, a technique for understanding relationships between variables. Our key takeaway was the importance of collecting descriptive data to make the analysis process smoother. Once we have gathered this information, we can use it to create informative graphs, and these graphs help us calculate crucial statistics like the median, mean, standard deviation, skewness, and kurtosis.
To put this into practice, I took the data provided in our project sheet and organized it into a new sheet, focusing on %diabetics, %inactivity, and %obesity. I then calculated how the percentage of diabetics relates to both the percentage of inactivity and the percentage of obesity, individually. I used a Python library called NumPy to compute key statistics such as the median, mean, standard deviation, skewness, and kurtosis.
Next on the agenda is creating graphs using these statistics. Specifically, I will be looking at the correlation between %diabetes and %inactivity, visualizing this relationship with a scatterplot. This scatterplot will also help me determine the R-squared value, which is vital in gauging the strength of the connection between these two variables. After that, I’ll dive into analyzing the residuals to gain deeper insights into how well our linear model is performing and whether it’s a valid representation of the data.
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!