Chapter 2 Practical 1: Getting Started with RStudio, Data and Spatial Data | GEOG5917 Big Data & Consumer Analytics

The preceding sections created a number of R objects. You should see them in the Environment pane in RStudio or listing them by entering ls() at the console. There are a number of fundamental data types in R that are the building blocks for data analysis. The sections below explore different data types and illustrate further operations on them.

Vectors

Examples of vectors are:

Vectors may have different modes such as logical, numeric or character. The first two vectors above are numeric, the third is logical (i.e. a vector with elements of mode logical), and the fourth is a string vector (i.e. a vector with elements of mode character). The missing value symbol, which is NA, can be included as an element of a vector.

The c in c(2, 3, 5, 7, 1) above is an acronym for concatenate, i.e. the meaning is: Join these numbers together in to a vector. Existing vectors may be included among the elements that are to be concatenated. In the following code, we form vectors x and y (overwriting those that the x and y that were defined earlier) which we then concatenate to form a vector z:

## [1] 2 3 5 2 7 1
## [1] 10 15 12
## [1]  2  3  5  2  7  1 10 15 12

The concatenate function c() may also be used to join lists.

Vectors can be subsetted. There are two common ways to extract subsets from vectors. Note in both cases, the use of the square brackets [ ].

  1. Specify or index the elements that are to be extracted, e.g.
## [1] 3 2

Note that negative numbers can be used to omit specific vector elements:

## [1] 2 2 7 1
  1. Specify a vector of logical values to select elements. The elements that are extracted are those for which the logical value is TRUE. Thus suppose we want to extract values of x that are greater than 4.
## [1] 5 7

Examine the logical selection:

## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE

A number of relations may be used in the extraction of subsets of vectors are < <= > >= == !=. The first four compare magnitudes, == tests for equality, and != tests for inequality.

Matrices and Data Frames

The fundamental difference between a matrix and data.frame are that matrices can only contain a single data type – numeric, logical, text etc. Whereas a data frame can have different types of data in each column, with all elements of any column being of the same type i.e. all numeric, all factors, all logical, all character, etc.

Matrices are easy to define:

##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
## [4,]    7    8
## [5,]    9   10
##      [,1] [,2]
## [1,] "a"  "f" 
## [2,] "b"  "g" 
## [3,] "c"  "h" 
## [4,] "d"  "i" 
## [5,] "e"  "j"

Many R packages come with datasets. The iris dataset is an internal R dataset and is loaded to your R session with the code below.

This a data.frame:

## [1] "data.frame"

The code below uses the head() function to print out the first 6 rows and the dim() function to tell us the dimensions of iris.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## [1] 150   5

The str() function can be used to indicate the formats of the attributes (columns, fields) in iris:

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Here we can see that 4 of the attributes are numeric and the other is a factor (a kind of ordered categorical variable).

The summary() function is also very useful and shows different summaries of the individual attributes (columns) in iris.

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

The main R graphics function is plot() and when it is applied to a data frame or it a matrix shows how attribute values correlate to each other. There are various other alternative helpful forms of graphical summary. The scatterplot shown in Figure 2 is of the first 4 fields (columns) in iris. Note how the inclusion of upper.panel=panel.smooth causes the lowess curves to be added to Figure 2.2.


A plot of the numeric variables in the iris data.

Figure 2.2: A plot of the numeric variables in the iris data.

The individual data types can also be investigated using the sapply() function. This applies a function to each column in matrix or data frame:

A key property in a data.frame is that columns can be vectors of any type. It is effectively a list (group) of column vectors, all of equal length.

Further Data types
There are many more data types in R. Chapter 1 in Comber and Brunsdon (
2021) provides a brief introduction to some of the important ones, and Chapter 2 in [Brunsdon and Comber (2018) provides a comprehensive overview with worked examples and exercises.

Read more here: Source link