In this lab we will learn how to work with linear models in RStudio.
In this lab we will learn how to:
As we did in the previous lab, start a new project:
Lab_7
would be a good name.Lab_7.Rproj
. This is the project configuration file, and you don’t need to do anything with it.Now you need to start a new RMarkdown report file.
Lab7Report
. (Do not use a filename extension.)#
) is already in use in this document for the report title. Use second-level headings (##
) for the main sections of the report. (If you need subsections, use ###
.)library(hanoverbase)
. Use the chunk options dialog to disable warnings, disable messages, and “show nothing (run code)”.In order to add a data file to the project itself, we start by uploading an Excel file:
Upload Files
dialog.guns.xlsx
in your Files pane.Now we need to import the actual data from the Excel file into the RStudio project into the report:
guns.xlsx
file. Choose the option Import Dataset. This should bring up the Import Excel Data dialog.View
line)and then click Cancel to close that window.hanoverbase
chunk). And paste the two lines of code that you copied into it. Edit the long filename string, which probably with something like "~/..."
to leave only the actual file name there, "guns.xlsx"
.View
command in the console to open up the new dataset.We created the file guns.xlsx
from data provided at www.openintro.org/stat/data. The five variables in the data are as follows:
You are now ready to start working on your report. The sections below give you questions to answer and commands to try. Here are a few reminders:
Insert
pulldown) for your R commands. Use the R Cheatsheet for help as needed.To get warmed up and familiarize ourselves with the variables, we will start with some one-variable investigations.
Make a favstats
summary and a histogram for the own_rate
variable; you should adjust the number of breaks in order to get a good view.
Also make a labeled dotplot to show the countries in the dataset, sorted by gun ownership rate:
If the dotplot looks too squeezed in the generated report, put it in its own chunk and use the chunk’s options menu and the option Use custom figure size to specify the desired height for the chunk.
You should see two clusters in the dotplot, a couple of high outliers, and one extremely high outlier.
Describe the distribution of own_rate
. What are the outliers and what are their gun ownership rates (number of guns per 100 population)?
Do the same for mort_rate
. Describe what you find.
Do the same for hdi
. Describe what you find.
We wonder if countries with high gun ownership rates also have high gun-related mortality rates. We can investigate this with a scatterplot. We start by making a scatterplot for mort_rate
(y) vs. own_rate
(x). We attach the name graph1
to the plot so that we can recall the plot in the future:
Based on the scatterplot we just made, describe the overall pattern (including direction, if any) of the data and identify any unusual points.
We’ll do some modeling without removing any observations to start with. Let’s start with a smooth fit curve. Recall that to see all of the available color names in R, you can run the command colors()
in the console.
Does it look like a linear model is a good fit for these data? Explain.
Let’s see what a linear model looks like on this plot:
In order to find the equation of our linear model, we need to calculate the slope and intercept. And to assess how well the model fits the data, we should calculate the square of the correlation (R-square linear).
fit1 <- lm(mort_rate~own_rate, data=guns)
coefficients(fit1)
r1 <- cor(mort_rate~own_rate, data=guns)
c("r"=r1, "rsquared"=r1^2)
The last two numbers in the printout are the correlation \(r\) and its square \(r^2\) respetively. The two previous numbers are the intercept and slope for the linear model.
In order to further assess the appropriateness of the linear model, we look at a residual plot, showing the residual (y minus fitted) vs. fitted. We add a horizontal line at 0 to help us judge the presence of a pattern:
Remember that any remaining pattern in the residual plot indicates an incomplete model.
Do you see that there is a pattern in the residuals, or do they look “unpatterned” for the most part?
Correlation and regression are both susceptible to the effects of outliers and other influential points. Let’s see what happens when we filter out the U.S. entry from the data. The following command will create a new dataset called gunsFiltered
by removing the row for the U.S. from the guns
dataset. You will need to replace the “..change this..” with an expression involving the various variables, to leave out the U.S. (Hint: The U.S. is the only country with a very high gun ownership rate).
If this was done correctly, you should see a new dataset entry in your environment that does not contain the U.S.
Repeat the code for #4 - #7 with the newly filtered dataset (gunsFiltered
), changing the names graph1
, fit1
and r1
to graph2
, fit2
and r2
respectively. In the new scatterplot, for example, you should no longer see the point for the U.S. (own_rate > 80).
Now we will show both linear models on the original (unfiltered) scatterplot. This demonstrates the effect of a single influential observation on the modeling process.
ladd(panel.abline(fit1, col="black", lwd=2), plot=graph1)
ladd(panel.abline(fit2, col="magenta", lwd=2))
Explain why the second line (fit2
) has a smaller slope.
We will now briefly take a look at how the two rates relate to the third variable of interest, hdi
. We can do so by creating a correlation table, that shows the correlations between all possible pairs of selected variables. To make a correlation table, use the cor
command in a new way:
Use a similar command to make a second table using the gunsFiltered
data.
To see the correlation between two variables x and y, for example, find x in the columns and y in the rows. The number for that row and column gives the correlation for that pair.
hdi
most strongly correlated?hdi
tries to measure how “advanced” the country is in terms of education, health, etc. How does it make sense that hdi
is positively correlated with gun ownership rate?We want to investigate whether the relationship between mortality rate and ownership rate varies depending on the hdi
of the various countries. In order to do that, we will split the countries into groups based on their hdi
. To find a good split, draw a histogram for hdi
using a suitable number of breaks. Where would you split the values into different categories?
Create a new column hdicat
in the dataset, which splits the cases depending on whether the hdi
is lower or higher than \(0.85\). We pipe the hdi
values into the cut
command to create the desired breaks.
If this worked correctly, you should see a new column hdicat
added to the guns
dataset.
Now we make a paneled scatterplot for mort_rate
versus own_rate
, paneled by hdicat
. We use ladd
to add the linear models and cor
to find the correlations:
xyplot(mort_rate~own_rate|hdicat, data=guns)
ladd(panel.lmline(x, y))
cor(mort_rate~own_rate, data=guns %>% filter(hdi <= 0.85))
cor(mort_rate~own_rate, data=guns %>% filter(hdi > 0.85))
What do you see in the resulting plot and correlations? Explain as best you can what is going on.
More > Export... > Download
. Do the same for Lab7Report.html.