# intro(9).pdf – Simple linear regression Contents R and Rstudio 1 The basics 1 Fitting regression lines . . . . . . . . . . . . . . . . . . . . . . . . .

Unformatted text preview: Simple linear regression Contents
R and Rstudio 1 The basics 1 Fitting regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Linear correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Identifying outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Categorical predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Baseline parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Cell means parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Using the predict function 10 R and Rstudio
In this course, we’ll code in the programming language R and we’ll use Rstudio to do so. Rstudio is a
program that makes coding in R nice. It has features that allow us to import dataset through menus, save
plots easily, etc.
For this course, we don’t need to know all the intricacies of R programming, but it is certainly helpful to learn
a little bit. I recommend the first four chapters of the open access book ModernDive and the DataCamp
courses “Introduction to R” and “Introduction to the Tidyverse”. That should be enough background for
everybody, especially if you have some prior coding background. The basics
We are going to analyze a dataset that has information about pregnancy rates, violent crimes, and poverty
rates in the US in 2002. The data are hosted at and the source is the course STAT462
at Penn State.
Before we read in the data, we’ll need to have library(readr) installed and loaded. If you don’t have it
installed, you can do it with the command install.packages(“readr”). Once you have installed it, you
other library in R (instead of readr, you’d type in the name of the library you’re interested in).
We can read in the data using the Import Dataset menu in Rstudio with the option “From text (readr)”.
Be careful, though, because by default Rstudio will think that the delimiter (column separator) is a comma,
which is wrong in this case. If we change it to white space, it will work (please see the videolecture if this is
unclear). Alternatively, the code below will read in the data for us.
After running the code above (or using the Import Dataset menu), you’ll have a dataset named index that
contains all the information we need.
We can get a quick look at the dataset using summary
summary(index) ##
##
##
##
##
##
##
##
##
##
##
##
##
## Location
Length:51
Class :character
Mode :character ViolCrime
Min.
: 0.900
1st Qu.: 3.900
Median : 6.300
Mean
: 7.855
3rd Qu.: 9.450
Max.
:65.000 PovPct
Min.
: 5.30
1st Qu.:10.25
Median :12.20
Mean
:13.12
3rd Qu.:15.80
Max.
:25.30
TeenBrth
Min.
:20.00
1st Qu.:33.90
Median :39.50
Mean
:42.24
3rd Qu.:52.60
Max.
:69.10 Brth15to17
Min.
: 8.10
1st Qu.:17.25
Median :20.00
Mean
:22.28
3rd Qu.:28.10
Max.
:44.80 Brth18to19
Min.
: 39.00
1st Qu.: 58.30
Median : 69.40
Mean
: 72.02
3rd Qu.: 87.95
Max.
:104.30 The variable PovPct has the percentage of population in each Location that lives under the poverty line.
The variables Brth15to17 and Brth18to19 have pregnancy rates (over 1000 people) among 15 to 17 year
olds and 18 to 19 year olds, respectively. ViolCrime has violent crime rates over 1000 people. The output
from summary tells us, for example, that the average poverty percentage is 13.12.
In this course, we’ll use functions in library(tidyverse). If you don’t have the library, you can install it
with the command install.packages(“tidyverse”). Once you have the library installed, you can load it
with the command
library(tidyverse) Fitting regression lines
Let’s see if poverty rates are associated with birth rates at 15 to 17 years old. The code below gives us a
plot with poverty rates on the x-axis and birth rates on the y-axis. It has a linear trend overlaid which, in
some sense, is the best line we can fit to the cloud of points.
ggplot(index) +
aes(x = PovPct, y = Brth15to17) +
geom_point() +
geom_smooth(method = “lm”) 2 Brth15to17 40 30 20 10
5 10 15 20 25 PovPct
The command is fairly intuitive. First, we tell R that we want to create a plot with the dataset index with
the command ggplot(index). Then, we tell R the variables that we’re going to use on the x- and y-axes
in aes(x = PovPct, y = Brth15to17). After that, geom_point() tells R that we want a scatterplot (that
is, a cloud of points), and geom_smooth(method = “lm”) adds the linear trend. If we run geom_smooth()
without method = “lm”, the fitted trend is non-linear. Try it out if you’re curious. If this is the first time
you see a ggplot command, you might be intimidated. However, all ggplots look the same, conceptually,
so in practice you can copy and paste the code above and adapt it to a new problem when you need to.
The plot we created had a line overlaid. In general, what is the equation of a line? It is
y =m×x+b
In the equation above
• b is the intercept, which is the value that the line takes on at x = 0.
• m is the slope, which is the expected increment in y after increasing x by one unit.
In R, we can find regression lines (that is, the line that you can see on the plot) using the lm function. Here’s
the code
lm(Brth15to17 ~ PovPct, data = index)
##
##
##
##
##
##
## Call:
lm(formula = Brth15to17 ~ PovPct, data = index)
Coefficients:
(Intercept)
4.267 PovPct
1.373 lm works as follows:
1. Put in the variable you want to have on the y-axis (here, it’s Brth15to17).
2. Put a tilde ~.
3 3. Put the variable you want on the x-axis (here, it’s PovPct).
4. Put in data = “name of dataset” (here, it’s index).
The equation of the line that you see on the plot is
Brth15to17 = 4.267 + 1.373 × PovPct
How can we interpret these numbers in context?
• Interpretation of 4.267: If we were in an ideal state with 0% of people living under the poverty line
our line predicts a rate of 4.267 births per 1000 among women between 15 to 17.
• Interpretation of 1.373: If the percentage of people in the state who live under the poverty line increases
by 1%, we expect an increase in pregnancy rates of 1.373 births.
What we have covered so far is called simple linear regression. It’s simply fitting a line through a cloud
of points. With a line, we can use the variable x (here it’s PovPct) to explain or predict the variable y (here,
Brth15to17).
Let’s work through another example. We’ll work with data(“evals”) from library(openintro). Our goal
is investigating whether better-looking professors tend to get better teaching evaluation scores. Let’s read
the data in.
library(openintro)
data(“evals”)
If you don’t have library(openintro), you’ll have to install it first.
Let’s create a plot that shows the teaching evaluations scores on the y-axis and how “beautiful” the professors
are on the x-axis (on a scale from 0 to 10). In the data, teaching evaluation scores are named score and
the “beauty” scores are named bty_avg.
ggplot(evals) +
aes(x = bty_avg, y = score) +
geom_point() +
geom_smooth(method = “lm”)
5 score 4 3 2 4 6 bty_avg
4 8 Compare the code above to the code we wrote before, where PovPct was on the x-axis and Brth15to17 was
on the y-axis. You’ll notice that the only changes are the name of the dataset (now, it’s evals instead of
index) and the variables involved in the plot.
Let’s run lm to find the equation of the line we see on the plot.
lm(score ~ bty_avg, data = evals)
##
##
##
##
##
##
## Call:
lm(formula = score ~ bty_avg, data = evals)
Coefficients:
(Intercept)
3.88034 bty_avg
0.06664 In this case, the line that we see on the plot is
score = 3.88034 + 0.06664 × bty_avg
Let’s interpret the intercept and slope in context:
• Interpretation of intercept: the predicted teaching evaluation score for a professor whose beauty rating
is a 0/10 is 3.88034.
• Interpretation of the slope: the expected increase in teaching evaluation scores by increasing bty_avg
by one unit is 0.06664.
Given the line, we can predict the teaching evaluation scores for any given value of bty_avg. For example,
the predicted score for a professor whose bty_avg is a 10/10 is 3.88034 + 0.06664 × 10 = 4.54674.
Exercise: Try repeating the same analysis, but now with age instead bty_avg as the predictor (that is, as
the variable on the x-axis). Linear correlation coefficient
Pearson’s linear correlation coefficient quantifies thes strength of linear relationships. If it’s near 0, it means
that there isn’t a linear relationship between the variables. If you get something near 1, it means that there
is a positive (line goes up) linear association among the variables. And if you get something close to -1, it
means that there is a negative (line goes down) linear association between the variables.
Here’s how you find the correlation between PovPct and Brth15to17:
cor(index\$PovPct, index\$Brth15to17)
## [1] 0.7302931
The correlation is about 0.73 which is positive and far away from 0.
And here’s how you’d find the correlation between bty_avg and score.
cor(evals\$bty_avg, evals\$score)
## [1] 0.1871424
5 It’s 0.19. It’s smaller than the previous correlation. It makes sense, intuitively, since the relationship
between PovPct and Brth15to17 looks stronger (visually) than the one between bty_avg and score. Both
are positive since, in both cases, the fitted line goes up.
Exercise: Find the correlation between bty_avg and score in the evals dataset. Comment on what you
see and compare it to the correlations we just found.
Caution! The correlation coefficient is good at quantifying the strength of linear relationships. If the
relationship between two variables is non-linear, the linear correlation coefficient is not a good measure.
Take a look at the following example,
x = -10:10
x
## [1] -10
## [20]
9 -9
10 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 y = x^2
qplot(x = x, y = y)
100 y 75 50 25 0
−10 −5 0 5 10 x
cor(x, y)
## [1] 0
The variables x and y are very highly related: y is a function of x. However, the linear correlation coefficient
is equal to 0. In other words, the linear correlation coefficient tells us whether there is a line that is good at
describing the relationship between the variables. In the example above, there isn’t a line that describes the
relationship between x and y well, so the correlation is low (in this case, exactly 0). Identifying outliers
Let’s go back to the index dataset, which is the one that has poverty rates in all 50 states plus DC. Now,
let’s display the relationship between poverty (PovPct) and violent crime rates (ViolCrime): 6 ggplot(index) +
aes(x = PovPct, y = ViolCrime) +
geom_point() +
geom_smooth(method = “lm”) ViolCrime 60 40 20 0
5 10 15 20 25 PovPct
We can see that, in this case, there is an observation that clearly sticks out. In statistics, observations that
very far away from the bulk of the data are referred to as outliers.
There are different ways we can identify the outlier. In lecture, I clicked on index in the Data panel of
Rstudio (top right corner) and then clicked on the variable name ViolCrime in the menu, which sorts the
data. Another way to identify the outlier is the command
index %>% slice_max(ViolCrime)
## # A tibble: 1 x 6
##
Location
PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth
##
<chr>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
## 1 District_of_Columbia
22
44.8
102.
65
69.1
The command slice_max finds the maximum of a variable. In this case, it finds the maximum of ViolCrime.
The observation that sticks out is DC.
Yet another option is creating an interactive plot with library(plotly). The steps for creating an interactive plot are
2. Create a ggplot.
3. Save the plot into a variable.
4. Run the ggplotly function on the plot you just saved.
Let’s create an interactive plot for this example. First, we load library(plotly) and create a ggplot. We
use the Location (that is, the variable that tells us the state each observation comes from) as a label. 7 library(plotly)
violent_plot = ggplot(index) +
aes(x = PovPct, y = ViolCrime, label = Location) +
geom_point() +
geom_smooth(method = “lm”)
Then, we can create the interactive plot by running
ggplotly(violent_plot)
I can’t display the interactive plot in this document, but if you run the code in your computer (or take a
look at the videolecture), you’ll see that you can identify the points by clicking on them. If you go near the
outlier, you’ll see it’s DC. Categorical predictors
In previous sections, we built models to predict a numerical outcome (y-axis) as a function of a numerical
predictor (x-axis). For example, we built a model that predicted teaching evaluation scores (score) as a
function of how beautiful the professors are (bty_avg).
In this section, we’ll see how to build linear models when the outcome variable is still numerical and the
predictor is categorical. There are different ways we can work with categorical predictors. Here, we will
consider two: the baseline and cell means parametrizations.
Baseline parametrization
Our first example will use the teaching evaluations data we’ve been working with. Recall that the dataset is
available in library(openintro) and can be read in with the command data(evals). We’re going to fit
a linear model that predicts teaching evaluation scores (score) given gender, which here can take on the
values male and female.
In previous sections, we used the function lm to fit the model. The same syntax works here.
lm(score ~ gender, data = evals)
##
##
##
##
##
##
## Call:
lm(formula = score ~ gender, data = evals)
Coefficients:
(Intercept)
gendermale
4.0928
0.1415 How can we interpret the output?
• gendermale is a “dummy” variable created by R which takes on the values 1 for male and 0 otherwise
(that is, it is 0 if gender is female. [In general, in statistics, variables that can take on the values 0
or 1 exclusively are often referred to as “dummy” variables.]
• The predicted values are
Predicted score for female : 4.0928 + 0.1415 × 0 = 4.0928
Predicted score for male : 4.0928 + 0.1415 × 1 = 4.2343
8 • This is still a line, and we can interpret the intercept and slope in context. Here, the intercept is
the predicted score for female and the slope is the difference in predicted scores among males and
females.
This is just an example but, in general, linear models with categorical predictors in the baseline parametrization will look like this. R will create dummy variables. The intercept represents the predictions for a baseline or reference category (both terms are used interchangeably). In our example, the baseline category
is female1 The slopes of the dummy variables represent differences with respect to the baseline category.
Here, the slope represents the predicted difference in scores between male and the baseline, which is female.
Let’s do another example. Now, let’s fit a model that predicts score given the rank of the instructor, which
can be teaching, tenure track or tenured.
lm(score ~ rank, data = evals)
##
## Call:
## lm(formula = score ~ rank, data = evals)
##
## Coefficients:
##
(Intercept) ranktenure track
ranktenured
##
4.2843
-0.1297
-0.1452
We have 2 dummy variables: one for tenure track and another one for tenured. The baseline category
is teaching. In general, the number of dummy variables you will see will be equal to “total number of
categories minus one.” In this example, we have three categories, so we end up having two dummies. How
can we interpret these numbers?
• The intercept is the predicted score for the baseline category, which is teaching.
• ranktenure track is a dummy variable which is equal to 1 if the rank of an instructor is tenure
track and 0 otherwise. Similarly, ranktenured is 1 if rank is tenured and 0 otherwise.
• These are the predicted teaching evaluations score, by rank
Predicted score for teaching : 4.2843 − 0.1297 × 0 − 0.1452 × 0 = 4.2843
Predicted score for tenure track : 4.2843 − 0.1297 × 1 − 0.1452 × 0 = 4.1546
Predicted score for tenured : 4.2843 − 0.1297 × 0 − 0.1452 × 1 = 4.1391.
In our first examples with numerical predictors, it was intuitively clear that lm was finding, in some sense,
the best line that we could fit through the cloud of points. What is lm doing now? The answer is that the
predicted values are means grouped by the categorical variable. For example, in this example with rank:
library(tidyverse)
evals %>% group_by(rank) %>% summarize(mean(score))
##
##
##
##
##
## # A tibble: 3 x 2
rank
‘mean(score)‘
<fct>
<dbl>
1 teaching
4.28
2 tenure track
4.15
3 tenured
4.14
1 Unless we explicitly change it, R will pick the baseline to be the first category in alphabetical order. 9 The same will be true for the example we did by gender: the predicted score for male as given by lm is
going to be the average score for male in the data (the same goes for female). Try it out.
Exercise Load in data(gapminder) in library(gapminder). Use a subset of the data that only includes
information about the year 2007, that is:
library(gapminder)
data(gapminder)
gap2007 = gapminder %>% filter(year == 2007)
Find grouped means of lifeExp (life expectancy in years) by continents and fit a regression model where
the outcome is lifeExp and the predictor is continent. How many dummy variables are there? What is
the interpretation of the intercept and the slopes? Compare your predicted values with lm to the grouped
means.
Cell means parametrization
In the baseline parametrization, R picks a baseline category and the coefficients of the dummy variables
represent differences with respect to the baseline. There is another way to work with categorical variables
which doesn’t require working with differences with respect to a baseline. It is called the cell means
parametrization.
If we go back to our example with data(evals) in library(openintro), where the outcome was score
and the predictor was rank, we could have done the following
lm(score ~ rank + 0, data = evals)
##
## Call:
## lm(formula = score ~ rank + 0, data = evals)
##
## Coefficients:
##
rankteaching ranktenure track
ranktenured
##
4.284
4.155
4.139
The difference is that we’re writing + 0 after rank. The output is fairly easy to interpret: the coefficients
are the predicted values for the different ranks. There is no notion of baseline or differences with respect
to a baseline. In any case, the results are the same, since we get the same predicted values. These are just
different ways of writing the same model.
While this output is easier to interpret (and there is nothing wrong with it), most statisticians use the
baseline parametrization, so I’ll expect you to understand it (i.e. it will be on the tests).
Exercise Repeat the analyses we did in this section, but now with the cell means parametrization. Compare
the results. Using the predict function
So far, we’ve found predictions “manually” with the equation of the line we found with lm. In practice, we
don’t have to do that. We can use the predict function.
For example, suppose that we want to predict math scores for students who scored 10, 30, 50, and 100 in
read. First, we save the output of lm into a variable
10 mod = lm(math ~ read, data = hsb2)
Then, we create a data.frame with the values of read that we want to predict
It is important that variable name (here, read) in this new data.frame is the same as the variable name in
the model.
Then, we can use predict