Today’s class was about simple linear regression, a technique for understanding relationships between variables. Our key takeaway was the importance of collecting descriptive data to make the analysis process smoother. Once we have gathered this information, we can use it to create informative graphs, and these graphs help us calculate crucial statistics like the median, mean, standard deviation, skewness, and kurtosis.
To put this into practice, I took the data provided in our project sheet and organized it into a new sheet, focusing on %diabetics, %inactivity, and %obesity. I then calculated how the percentage of diabetics relates to both the percentage of inactivity and the percentage of obesity, individually. I used a Python library called NumPy to compute key statistics such as the median, mean, standard deviation, skewness, and kurtosis.
Next on the agenda is creating graphs using these statistics. Specifically, I will be looking at the correlation between %diabetes and %inactivity, visualizing this relationship with a scatterplot. This scatterplot will also help me determine the R-squared value, which is vital in gauging the strength of the connection between these two variables. After that, I’ll dive into analyzing the residuals to gain deeper insights into how well our linear model is performing and whether it’s a valid representation of the data.