**Unformatted text preview: **Simple linear regression Contents

R and Rstudio 1 The basics 1 Fitting regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Linear correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Identifying outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Categorical predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Baseline parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Cell means parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Using the predict function 10 R and Rstudio

In this course, we’ll code in the programming language R and we’ll use Rstudio to do so. Rstudio is a

program that makes coding in R nice. It has features that allow us to import dataset through menus, save

plots easily, etc.

For this course, we don’t need to know all the intricacies of R programming, but it is certainly helpful to learn

a little bit. I recommend the first four chapters of the open access book ModernDive and the DataCamp

courses “Introduction to R” and “Introduction to the Tidyverse”. That should be enough background for

everybody, especially if you have some prior coding background. The basics

We are going to analyze a dataset that has information about pregnancy rates, violent crimes, and poverty

rates in the US in 2002. The data are hosted at and the source is the course STAT462

at Penn State.

Before we read in the data, we’ll need to have library(readr) installed and loaded. If you don’t have it

installed, you can do it with the command install.packages(“readr”). Once you have installed it, you

can load it with the command library(readr). The same procedure works for installing and loading any

other library in R (instead of readr, you’d type in the name of the library you’re interested in).

We can read in the data using the Import Dataset menu in Rstudio with the option “From text (readr)”.

Be careful, though, because by default Rstudio will think that the delimiter (column separator) is a comma,

which is wrong in this case. If we change it to white space, it will work (please see the videolecture if this is

unclear). Alternatively, the code below will read in the data for us.

1 library(readr)

index = read_table2(” 😉

After running the code above (or using the Import Dataset menu), you’ll have a dataset named index that

contains all the information we need.

We can get a quick look at the dataset using summary

summary(index) ##

##

##

##

##

##

##

##

##

##

##

##

##

## Location

Length:51

Class :character

Mode :character ViolCrime

Min.

: 0.900

1st Qu.: 3.900

Median : 6.300

Mean

: 7.855

3rd Qu.: 9.450

Max.

:65.000 PovPct

Min.

: 5.30

1st Qu.:10.25

Median :12.20

Mean

:13.12

3rd Qu.:15.80

Max.

:25.30

TeenBrth

Min.

:20.00

1st Qu.:33.90

Median :39.50

Mean

:42.24

3rd Qu.:52.60

Max.

:69.10 Brth15to17

Min.

: 8.10

1st Qu.:17.25

Median :20.00

Mean

:22.28

3rd Qu.:28.10

Max.

:44.80 Brth18to19

Min.

: 39.00

1st Qu.: 58.30

Median : 69.40

Mean

: 72.02

3rd Qu.: 87.95

Max.

:104.30 The variable PovPct has the percentage of population in each Location that lives under the poverty line.

The variables Brth15to17 and Brth18to19 have pregnancy rates (over 1000 people) among 15 to 17 year

olds and 18 to 19 year olds, respectively. ViolCrime has violent crime rates over 1000 people. The output

from summary tells us, for example, that the average poverty percentage is 13.12.

In this course, we’ll use functions in library(tidyverse). If you don’t have the library, you can install it

with the command install.packages(“tidyverse”). Once you have the library installed, you can load it

with the command

library(tidyverse) Fitting regression lines

Let’s see if poverty rates are associated with birth rates at 15 to 17 years old. The code below gives us a

plot with poverty rates on the x-axis and birth rates on the y-axis. It has a linear trend overlaid which, in

some sense, is the best line we can fit to the cloud of points.

ggplot(index) +

aes(x = PovPct, y = Brth15to17) +

geom_point() +

geom_smooth(method = “lm”) 2 Brth15to17 40 30 20 10

5 10 15 20 25 PovPct

The command is fairly intuitive. First, we tell R that we want to create a plot with the dataset index with

the command ggplot(index). Then, we tell R the variables that we’re going to use on the x- and y-axes

in aes(x = PovPct, y = Brth15to17). After that, geom_point() tells R that we want a scatterplot (that

is, a cloud of points), and geom_smooth(method = “lm”) adds the linear trend. If we run geom_smooth()

without method = “lm”, the fitted trend is non-linear. Try it out if you’re curious. If this is the first time

you see a ggplot command, you might be intimidated. However, all ggplots look the same, conceptually,

so in practice you can copy and paste the code above and adapt it to a new problem when you need to.

The plot we created had a line overlaid. In general, what is the equation of a line? It is

y =m×x+b

In the equation above

• b is the intercept, which is the value that the line takes on at x = 0.

• m is the slope, which is the expected increment in y after increasing x by one unit.

In R, we can find regression lines (that is, the line that you can see on the plot) using the lm function. Here’s

the code

lm(Brth15to17 ~ PovPct, data = index)

##

##

##

##

##

##

## Call:

lm(formula = Brth15to17 ~ PovPct, data = index)

Coefficients:

(Intercept)

4.267 PovPct

1.373 lm works as follows:

1. Put in the variable you want to have on the y-axis (here, it’s Brth15to17).

2. Put a tilde ~.

3 3. Put the variable you want on the x-axis (here, it’s PovPct).

4. Put in data = “name of dataset” (here, it’s index).

The equation of the line that you see on the plot is

Brth15to17 = 4.267 + 1.373 × PovPct

How can we interpret these numbers in context?

• Interpretation of 4.267: If we were in an ideal state with 0% of people living under the poverty line

our line predicts a rate of 4.267 births per 1000 among women between 15 to 17.

• Interpretation of 1.373: If the percentage of people in the state who live under the poverty line increases

by 1%, we expect an increase in pregnancy rates of 1.373 births.

What we have covered so far is called simple linear regression. It’s simply fitting a line through a cloud

of points. With a line, we can use the variable x (here it’s PovPct) to explain or predict the variable y (here,

Brth15to17).

Let’s work through another example. We’ll work with data(“evals”) from library(openintro). Our goal

is investigating whether better-looking professors tend to get better teaching evaluation scores. Let’s read

the data in.

library(openintro)

data(“evals”)

If you don’t have library(openintro), you’ll have to install it first.

Let’s create a plot that shows the teaching evaluations scores on the y-axis and how “beautiful” the professors

are on the x-axis (on a scale from 0 to 10). In the data, teaching evaluation scores are named score and

the “beauty” scores are named bty_avg.

ggplot(evals) +

aes(x = bty_avg, y = score) +

geom_point() +

geom_smooth(method = “lm”)

5 score 4 3 2 4 6 bty_avg

4 8 Compare the code above to the code we wrote before, where PovPct was on the x-axis and Brth15to17 was

on the y-axis. You’ll notice that the only changes are the name of the dataset (now, it’s evals instead of

index) and the variables involved in the plot.

Let’s run lm to find the equation of the line we see on the plot.

lm(score ~ bty_avg, data = evals)

##

##

##

##

##

##

## Call:

lm(formula = score ~ bty_avg, data = evals)

Coefficients:

(Intercept)

3.88034 bty_avg

0.06664 In this case, the line that we see on the plot is

score = 3.88034 + 0.06664 × bty_avg

Let’s interpret the intercept and slope in context:

• Interpretation of intercept: the predicted teaching evaluation score for a professor whose beauty rating

is a 0/10 is 3.88034.

• Interpretation of the slope: the expected increase in teaching evaluation scores by increasing bty_avg

by one unit is 0.06664.

Given the line, we can predict the teaching evaluation scores for any given value of bty_avg. For example,

the predicted score for a professor whose bty_avg is a 10/10 is 3.88034 + 0.06664 × 10 = 4.54674.

Exercise: Try repeating the same analysis, but now with age instead bty_avg as the predictor (that is, as

the variable on the x-axis). Linear correlation coefficient

Pearson’s linear correlation coefficient quantifies thes strength of linear relationships. If it’s near 0, it means

that there isn’t a linear relationship between the variables. If you get something near 1, it means that there

is a positive (line goes up) linear association among the variables. And if you get something close to -1, it

means that there is a negative (line goes down) linear association between the variables.

Here’s how you find the correlation between PovPct and Brth15to17:

cor(index$PovPct, index$Brth15to17)

## [1] 0.7302931

The correlation is about 0.73 which is positive and far away from 0.

And here’s how you’d find the correlation between bty_avg and score.

cor(evals$bty_avg, evals$score)

## [1] 0.1871424

5 It’s 0.19. It’s smaller than the previous correlation. It makes sense, intuitively, since the relationship

between PovPct and Brth15to17 looks stronger (visually) than the one between bty_avg and score. Both

are positive since, in both cases, the fitted line goes up.

Exercise: Find the correlation between bty_avg and score in the evals dataset. Comment on what you

see and compare it to the correlations we just found.

Caution! The correlation coefficient is good at quantifying the strength of linear relationships. If the

relationship between two variables is non-linear, the linear correlation coefficient is not a good measure.

Take a look at the following example,

x = -10:10

x

## [1] -10

## [20]

9 -9

10 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 y = x^2

qplot(x = x, y = y)

100 y 75 50 25 0

−10 −5 0 5 10 x

cor(x, y)

## [1] 0

The variables x and y are very highly related: y is a function of x. However, the linear correlation coefficient

is equal to 0. In other words, the linear correlation coefficient tells us whether there is a line that is good at

describing the relationship between the variables. In the example above, there isn’t a line that describes the

relationship between x and y well, so the correlation is low (in this case, exactly 0). Identifying outliers

Let’s go back to the index dataset, which is the one that has poverty rates in all 50 states plus DC. Now,

let’s display the relationship between poverty (PovPct) and violent crime rates (ViolCrime): 6 ggplot(index) +

aes(x = PovPct, y = ViolCrime) +

geom_point() +

geom_smooth(method = “lm”) ViolCrime 60 40 20 0

5 10 15 20 25 PovPct

We can see that, in this case, there is an observation that clearly sticks out. In statistics, observations that

very far away from the bulk of the data are referred to as outliers.

There are different ways we can identify the outlier. In lecture, I clicked on index in the Data panel of

Rstudio (top right corner) and then clicked on the variable name ViolCrime in the menu, which sorts the

data. Another way to identify the outlier is the command

index %>% slice_max(ViolCrime)

## # A tibble: 1 x 6

##

Location

PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth

##

<chr>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

## 1 District_of_Columbia

22

44.8

102.

65

69.1

The command slice_max finds the maximum of a variable. In this case, it finds the maximum of ViolCrime.

The observation that sticks out is DC.

Yet another option is creating an interactive plot with library(plotly). The steps for creating an interactive plot are

1. Load library(plotly).

2. Create a ggplot.

3. Save the plot into a variable.

4. Run the ggplotly function on the plot you just saved.

Let’s create an interactive plot for this example. First, we load library(plotly) and create a ggplot. We

use the Location (that is, the variable that tells us the state each observation comes from) as a label. 7 library(plotly)

violent_plot = ggplot(index) +

aes(x = PovPct, y = ViolCrime, label = Location) +

geom_point() +

geom_smooth(method = “lm”)

Then, we can create the interactive plot by running

ggplotly(violent_plot)

I can’t display the interactive plot in this document, but if you run the code in your computer (or take a

look at the videolecture), you’ll see that you can identify the points by clicking on them. If you go near the

outlier, you’ll see it’s DC. Categorical predictors

In previous sections, we built models to predict a numerical outcome (y-axis) as a function of a numerical

predictor (x-axis). For example, we built a model that predicted teaching evaluation scores (score) as a

function of how beautiful the professors are (bty_avg).

In this section, we’ll see how to build linear models when the outcome variable is still numerical and the

predictor is categorical. There are different ways we can work with categorical predictors. Here, we will

consider two: the baseline and cell means parametrizations.

Baseline parametrization

Our first example will use the teaching evaluations data we’ve been working with. Recall that the dataset is

available in library(openintro) and can be read in with the command data(evals). We’re going to fit

a linear model that predicts teaching evaluation scores (score) given gender, which here can take on the

values male and female.

In previous sections, we used the function lm to fit the model. The same syntax works here.

lm(score ~ gender, data = evals)

##

##

##

##

##

##

## Call:

lm(formula = score ~ gender, data = evals)

Coefficients:

(Intercept)

gendermale

4.0928

0.1415 How can we interpret the output?

• gendermale is a “dummy” variable created by R which takes on the values 1 for male and 0 otherwise

(that is, it is 0 if gender is female. [In general, in statistics, variables that can take on the values 0

or 1 exclusively are often referred to as “dummy” variables.]

• The predicted values are

Predicted score for female : 4.0928 + 0.1415 × 0 = 4.0928

Predicted score for male : 4.0928 + 0.1415 × 1 = 4.2343

8 • This is still a line, and we can interpret the intercept and slope in context. Here, the intercept is

the predicted score for female and the slope is the difference in predicted scores among males and

females.

This is just an example but, in general, linear models with categorical predictors in the baseline parametrization will look like this. R will create dummy variables. The intercept represents the predictions for a baseline or reference category (both terms are used interchangeably). In our example, the baseline category

is female1 The slopes of the dummy variables represent differences with respect to the baseline category.

Here, the slope represents the predicted difference in scores between male and the baseline, which is female.

Let’s do another example. Now, let’s fit a model that predicts score given the rank of the instructor, which

can be teaching, tenure track or tenured.

lm(score ~ rank, data = evals)

##

## Call:

## lm(formula = score ~ rank, data = evals)

##

## Coefficients:

##

(Intercept) ranktenure track

ranktenured

##

4.2843

-0.1297

-0.1452

We have 2 dummy variables: one for tenure track and another one for tenured. The baseline category

is teaching. In general, the number of dummy variables you will see will be equal to “total number of

categories minus one.” In this example, we have three categories, so we end up having two dummies. How

can we interpret these numbers?

• The intercept is the predicted score for the baseline category, which is teaching.

• ranktenure track is a dummy variable which is equal to 1 if the rank of an instructor is tenure

track and 0 otherwise. Similarly, ranktenured is 1 if rank is tenured and 0 otherwise.

• These are the predicted teaching evaluations score, by rank

Predicted score for teaching : 4.2843 − 0.1297 × 0 − 0.1452 × 0 = 4.2843

Predicted score for tenure track : 4.2843 − 0.1297 × 1 − 0.1452 × 0 = 4.1546

Predicted score for tenured : 4.2843 − 0.1297 × 0 − 0.1452 × 1 = 4.1391.

In our first examples with numerical predictors, it was intuitively clear that lm was finding, in some sense,

the best line that we could fit through the cloud of points. What is lm doing now? The answer is that the

predicted values are means grouped by the categorical variable. For example, in this example with rank:

library(tidyverse)

evals %>% group_by(rank) %>% summarize(mean(score))

##

##

##

##

##

## # A tibble: 3 x 2

rank

‘mean(score)‘

<fct>

<dbl>

1 teaching

4.28

2 tenure track

4.15

3 tenured

4.14

1 Unless we explicitly change it, R will pick the baseline to be the first category in alphabetical order. 9 The same will be true for the example we did by gender: the predicted score for male as given by lm is

going to be the average score for male in the data (the same goes for female). Try it out.

Exercise Load in data(gapminder) in library(gapminder). Use a subset of the data that only includes

information about the year 2007, that is:

library(gapminder)

data(gapminder)

gap2007 = gapminder %>% filter(year == 2007)

Find grouped means of lifeExp (life expectancy in years) by continents and fit a regression model where

the outcome is lifeExp and the predictor is continent. How many dummy variables are there? What is

the interpretation of the intercept and the slopes? Compare your predicted values with lm to the grouped

means.

Cell means parametrization

In the baseline parametrization, R picks a baseline category and the coefficients of the dummy variables

represent differences with respect to the baseline. There is another way to work with categorical variables

which doesn’t require working with differences with respect to a baseline. It is called the cell means

parametrization.

If we go back to our example with data(evals) in library(openintro), where the outcome was score

and the predictor was rank, we could have done the following

lm(score ~ rank + 0, data = evals)

##

## Call:

## lm(formula = score ~ rank + 0, data = evals)

##

## Coefficients:

##

rankteaching ranktenure track

ranktenured

##

4.284

4.155

4.139

The difference is that we’re writing + 0 after rank. The output is fairly easy to interpret: the coefficients

are the predicted values for the different ranks. There is no notion of baseline or differences with respect

to a baseline. In any case, the results are the same, since we get the same predicted values. These are just

different ways of writing the same model.

While this output is easier to interpret (and there is nothing wrong with it), most statisticians use the

baseline parametrization, so I’ll expect you to understand it (i.e. it will be on the tests).

Exercise Repeat the analyses we did in this section, but now with the cell means parametrization. Compare

the results. Using the predict function

So far, we’ve found predictions “manually” with the equation of the line we found with lm. In practice, we

don’t have to do that. We can use the predict function.

For example, suppose that we want to predict math scores for students who scored 10, 30, 50, and 100 in

read. First, we save the output of lm into a variable

10 mod = lm(math ~ read, data = hsb2)

Then, we create a data.frame with the values of read that we want to predict

pred_read = data.frame(read = c(10, 30, 50, 100))

It is important that variable name (here, read) in this new data.frame is the same as the variable name in

the model.

Then, we can use predict

predict(mod, newdata = pred_read) ##

1

2

3

4

## 27.08963 39.19258 51.29552 81.55289

The outputs above are the predicted math scores for students who scored 10, 30, 50, and 100 (in that order). 11 …

View

Full Document *
*

Read more here: Source link