The Economy’s Domino Effect

Today, we’re analyzing the economy indicator, where various elements are interlinked in fascinating ways. For instance, I’m analyzing how passenger traffic at Logan Airport can serve as a barometer for hotel occupancy rates, offering insights into the state of tourism and business travel. The interplay between the job market and the housing sector is another compelling area of focus. A thriving job market often fuels a strong demand in housing, whereas a sluggish job market can lead to a downturn in real estate activities. Additionally, the influence of major development projects on local economies is particularly noteworthy, illustrating how such initiatives can spur job growth and energize the housing market. This post is about untangling these economic threads, revealing how shifts in one sector can ripple through to others, painting a comprehensive picture of our economic landscape.

Housing Market Trends

In this blog, we go through the housing market, focusing on how median housing prices have changed over time. This journey is more than just about prices; it’s a reflection of the economy.

The graph of median housing prices we analyzed is like a roadmap showing the market’s highs and lows. When prices rise, it often signals a strong demand for homes, hinting at a robust economy with confident buyers. On the flip side, dips or plateaus in prices can suggest a cooling market, possibly due to economic challenges or changing buyer sentiments.

But these trends don’t exist in isolation. They’re intertwined with various economic threads like employment rates, interest rates, and overall economic health. For instance, a booming job market might boost people’s ability to buy homes, pushing prices up. Similarly, changes in interest rates can either encourage or discourage buyers, affecting prices.

Interestingly, we also noticed potential seasonal variations in the housing market. Certain times of the year may experience more activity, influencing prices subtly.

Understanding these nuances in housing prices is crucial. It tells us not just about the real estate market but also gives insights into broader economic conditions. This analysis is invaluable for buyers, sellers, investors, and policymakers, helping them make informed decisions in a landscape that’s always evolving.

Trend Analysis

Today, we’re taking a closer look at the Boston’s economy through a trend analysis of key economic indicators. It’s a bit like being an economic detective, where we piece together clues to understand the bigger picture.

We focused on three main clues: the unemployment rate, hotel occupancy rates, and median housing prices. Each of these tells us something different. The unemployment rate is like a thermometer for the job market, showing us how many people are out of work. When this number goes down, it usually means more people have jobs, which is great news!

Next, we looked at how full hotels are, which is our hotel occupancy rate. This rate gives us a sneak peek into tourism and business travel. High occupancy often means more visitors and bustling business activities, while lower numbers might suggest the opposite.

Lastly, we delved into the median housing prices. This indicator is a bit like a window into the real estate market. Rising prices can indicate a high demand for homes, possibly signaling a strong economy. On the flip side, if prices drop or stagnate, it might mean the market is cooling down.

By analyzing these trends, we can get a sense of how the economy is faring.

Economic Indicator

For project 3 we have chosen a dataset called the economic indicator.

The dataset contains various economic indicators, organized by year and month. Below is a summary of what each column represents:

  • Year and Month: The time frame for the data, with separate columns for the year and month.
  • logan_passengers: The number of passengers at Logan Airport.
  • logan_intl_flights: The number of international flights at Logan Airport.
  • hotel_occup_rate: The occupancy rate of hotels.
  • hotel_avg_daily_rate: The average daily rate for hotel stays.
  • total_jobs: The total number of jobs.
  • unemp_rate: The unemployment rate.
  • labor_force_part_rate: The labor force participation rate.
  • pipeline_unit: Information related to housing or development projects, possibly the number of units.
  • pipeline_total_dev_cost: The total development cost for projects in the pipeline.
  • pipeline_sqft: The total square footage of development projects in the pipeline.
  • pipeline_const_jobs: The number of construction jobs created by pipeline projects.
  • foreclosure_pet: The number of foreclosure petitions.
  • foreclosure_deeds: The number of foreclosure deeds.
  • med_housing_price: The median housing price.
  • housing_sales_vol: The volume of housing sales.
  • new_housing_const_permits: The number of new housing construction permits issued.
  • new-affordable_housing_permits: The number of permits issued for new affordable housing.

This dataset offers a comprehensive view of various economic factors including transportation (air travel), hospitality (hotels), employment, real estate, and housing market indicators. Each of these metrics can provide insights into the economic health and trends of the region or area being studied.

SARIMA Model

The SARIMA model stands as a foundation in the world of time series analysis. An extension of the ARIMA model, SARIMA (Seasonal Autoregressive Integrated Moving Average) brings an added layer of sophistication to forecasting, particularly useful in handling seasonal data.

SARIMA is a statistical model used to predict future points in a time series. It’s particularly adept at handling data with seasonal patterns – like monthly sales data with peaks during holidays, or daily temperatures varying across seasons. The model extends ARIMA by integrating seasonality, making it more versatile.

Components:

The SARIMA model can be understood through its components: Seasonal (S), Autoregressive (AR), Integrated (I), and Moving Average (MA).

  • Seasonal: This component models the seasonality in data, capturing regular patterns that repeat over a specific period.
  • Autoregressive (AR): This part of the model captures the relationship between an observation and a specified number of lagged observations.
  • Integrated (I): Integration involves differencing the time series to make it stationary, a necessary step for many time series models.
  • Moving Average (MA): This component models the relationship between an observation and a residual error from a moving average model applied to lagged observations.

Stationary and Non-Stationary data in time series analysis

Time series analysis is a fascinating area of statistics and data science, where we study data that changes over time. Two key concepts in this field are ‘stationary’ and ‘non-stationary’ data. Let’s break these down in a way that balances simplicity with some technical insight.

Stationary data in a time series means the data behaves consistently over time. The average value (mean), the variability (variance), and how the data correlates with itself over time (autocorrelation) stay the same. For data scientists and statisticians, stationary data is easier to analyze and predict. Many statistical methods work best when the data is stationary because they assume the underlying patterns in the data don’t change.

We can spot stationary data by looking at graphs over time or using specific statistical tests, like the Augmented Dickey-Fuller test. If the data’s properties look consistent over time, it’s likely stationary.

Non-stationary data is the opposite. Here, the data changes its behavior over time – its mean, variance, or autocorrelation shift.

Non-stationary data can be tricky. It can fool you into seeing trends or patterns that don’t actually help predict future behavior. It’s like trying to guess the river’s flow in summer based on winter observations.

To analyze non-stationary data correctly, experts often transform the data to make it stationary. They might remove trends or seasonal effects or use techniques like differencing, where you focus on how much the data changes from one time point to the next, rather than the data itself.

Time Series Analysis

Time series analysis is an integral part of data science that involves examining sequences of data points collected over time. This method is pivotal in various fields, from economics to meteorology, helping to predict future trends based on historical data. This blog aims to simplify time series analysis, making it accessible to beginners while retaining its technical essence.

Time series analysis deals with analyzing data points recorded at different times. It’s used to extract meaningful statistics, identify patterns, and forecast future trends. This analysis is crucial in many areas, such as predicting market trends, weather forecasting, and strategic business planning.

Key Concepts:

Essential concepts include trend analysis (identifying long-term movement), seasonality (recognizing patterns or cycles), noise (separating random variability), and stationarity (assuming statistical properties remain constant over time).

Techniques:

  • Descriptive Analysis: Involves visual inspection of data to identify trends, seasonality, and outliers.
  • Moving Averages: This technique smooths out short-term fluctuations, highlighting longer-term trends or cycles.
  • ARIMA Models: Widely used for forecasting, especially when data shows a clear trend or seasonal pattern.
  • Machine Learning Approaches: Techniques like Random Forests and Neural Networks are increasingly used for complex time series forecasting.

Project 3 kickoff

As we embark on Project 3, we are faced with a wealth of choices, with 246 datasets available on the Analyze Boston website. Our team is currently engaged in sifting through these options to find the one that best suits our project’s needs. This selection process is critical as it lays the groundwork for our upcoming analysis. Once we’ve chosen a dataset, our next step will be to dive deep into its contents, searching for a unique and intriguing question that emerges naturally from the data. This question will guide our exploration and analysis, driving us to uncover new insights and understandings. It’s a thrilling phase in our project, promising both challenges and discoveries.

Logistic Regression

Logistic regression is a statistical method used primarily for binary classification tasks, where outcomes are dichotomous (like yes/no or true/false). Unlike linear regression that predicts a continuous outcome, logistic regression predicts the probability of a given input belonging to a certain class. This is achieved by using the logistic (or sigmoid) function to convert the output of a linear equation into a probability value between 0 and 1. Common applications include predicting the likelihood of a patient having a disease in the medical field, customer churn in marketing, and credit scoring in finance. While logistic regression is straightforward to implement and interpret, and works well for linearly separable data, it assumes a linear relationship between variables and might not perform well with complex, non-linear data. Despite its limitations, logistic regression remains a popular choice due to its simplicity and effectiveness in various scenarios.

Furthermore, logistic regression’s strength lies in its interpretability and the ease with which it can be implemented. It’s particularly beneficial in fields were understanding the influence of each variable on the outcome is crucial. For instance, in healthcare, it helps in understanding how different medical indicators contribute to the likelihood of a disease. However, its reliance on the assumption of linearity between independent variables and the log odds can be a limitation. In cases where the relationship between variables is more complex, advanced techniques like neural networks or random forests might be more appropriate. Despite these limitations, logistic regression’s ability to provide clear, actionable insights with relatively simple computation makes it a valuable tool in the arsenal of data analysts and researchers.

Decision Tree

In today’s class, I’ve learnt about decision trees. Decision trees are essentially a graphical representation of decision-making processes. Think of them as a series of questions and choices that lead to a final conclusion. At the tree’s outset, you encounter the initial question, and as you answer each question, you progress down the branches until you arrive at the ultimate decision.

The construction of a decision tree entails selecting the most informative questions to ask at each juncture. These questions are based on various attributes or features of the data, and their selection is guided by statistical measures like information gain, Gini impurity, or entropy. The goal is to optimize the decision-making process by selecting the most relevant attributes at each node.

However, decision trees have limitations, especially in scenarios where the data exhibits a wide spread or deviation from the mean. In our recent Project 2, we encountered a dataset in which the mean was considerably distant from the majority of data points, making the decision tree method less efficient. This highlights the importance of considering the distribution and characteristics of the data when choosing the appropriate statistical method for analysis. Decision trees are a valuable tool, but their efficacy is contingent on the nature of the data they are applied to, and sometimes alternative statistical methods may be more suitable for handling such situations.

Threat levels vs age

We have created box plots to visualize the age distribution within each threat level category (Attack, other threat & undetermined).​

The box plot above shows the age distribution within each threat level category in fatal police shootings.

Observations:

  • The median age appears to be relatively consistent across different threat levels.
  • The “attack” threat level has a slightly wider interquartile range (IQR), indicating more variability in age.
  • The “undetermined” category has a higher median age and a narrower IQR compared to the other categories, suggesting that individuals in this category tend to be older.

Examine the relationship between threat level and signs of mental illness. We have create a bar plot to visualize the prevalence of signs of mental illness within each threat level category.​​

The bar plot above displays the prevalence of signs of mental illness within each threat level category in fatal police shootings.

Observations:

  • The “attack” threat level has the lowest proportion of individuals showing signs of mental illness.
  • The “undetermined” category has the highest proportion of individuals showing signs of mental illness, followed closely by the “other” category.
  • This pattern suggests that incidents categorized as “undetermined” or “other” are more likely to involve individuals with signs of mental illness.

These visualizations provide a comprehensive understanding of how threat levels relate to race, age, and signs of mental illness in fatal police shootings.

Threat Level Analysis

In this analysis, we will:

  1. Analyze the Distribution of Threat Levels: Understand how different threat levels are distributed in the dataset.
  2. Examine the Relationship between Threat Level and Other Variables: Investigate how threat levels relate to other variables such as race, age, and signs of mental illness.

First part of the analysis: Analyzing the Distribution of Threat Levels.

The bar chart above shows the distribution of threat levels in fatal police shootings.

Observations:

  • The majority of incidents are categorized under the “attack” threat level, indicating situations where the police perceived an active threat.
  • The “other” threat level category includes a significant number of incidents, suggesting situations that may not have involved a direct attack but still resulted in a fatal shooting.
  • The “undetermined” category has the least number of incidents.

Next, we examine the relationship between threat level and other variables. We will start by investigating how threat levels relate to race, age, and signs of mental illness.

Let’s start with the relationship between Threat Level and Race. We will create a cross-tabulation and visualize it to understand this relationship better.​

The heatmap above visualizes the relationship between threat level and race in fatal police shootings, with values representing the percentage distribution of races within each threat level category.

Observations:

  • Across all threat level categories, White (W) and Black (B) individuals constitute the majority of cases.
  • The distribution of races appears to be relatively consistent across different threat levels.
  • There is a slight increase in the percentage of White individuals in the “undetermined” category, which might indicate cases where the circumstances were less clear.

Correlations in Police Shootings Data

In our analysis of fatal police shootings, we’ve explored how factors like age, race, and threat levels correlate with signs of mental illness.

Age and Mental Illness: Our analysis revealed a significant correlation between age and signs of mental illness. The t-test showed a distinct age difference between individuals with and without signs of mental illness, with a t-statistic of 8.51 and a p-value near zero. This indicates a clear link between age and mental illness in these incidents.

Race and Mental Illness: Addressing race, we encountered initial data issues but corrected them to perform a chi-square test. The results showed a significant association between race and signs of mental illness, with a chi-square statistic of 171.23 and a p-value of 3.98×10−353.98×10−35.

Threat Level and Mental Illness: We also found a significant relationship between threat level and signs of mental illness, with a chi-square statistic of 24.48 and a p-value of 4.82×10−64.82×10−6.

Conclusion: Our analyses have illuminated significant correlations between age, race, threat level, and signs of mental illness in fatal police shootings. These insights pave the way for further investigation and a deeper understanding of these critical incidents. Our next step will be to analyze threat level distributions and their relationships with other variables.