Life Expectancy Data from Kaggle v1

The data-set related to life expectancy and health factors for 193
countries has been collected from the same WHO data repository website
and its corresponding economic data was collected from United Nation
website. It was collected from WHO and United Nations website with the
help of Deeksha Russell and Duan Wang.

Purpose of the project :

  • Know the relationship between “Life Expectancy” based on historical
    data.
  • Learn to use a linear regression model to predict “Life Expectancy”
    based on the dataset.

Explanation on “Life Expectancy” data :

  • Country = Country Observed.
  • Year = Year Observed.
  • Status = Developed or Developing status.
  • Life.expectancy = Life Expectancy in age.
  • Adult.Mortality = Adult Mortality Rates on both
    sexes (probability of dying between 15-60 years/1000 population).
  • infant.deaths = Number of Infant Deaths per 1000
    population.
  • Alcohol = Alcohol recorded per capita (15+)
    consumption (in litres of pure alcohol).
  • percentage.expenditure = Expenditure on health as a
    percentage of Gross Domestic Product per capita(%).
  • Hepatitis.B = Hepatitis B (HepB) immunization
    coverage among 1-year-olds (%).
  • Measles = Number of reported Measles cases per 1000
    population.
  • BMI = Average Body Mass Index of entire
    population.
  • under.five.deaths = Number of under-five deaths per
    1000 population.
  • Polio = Polio (Pol3) immunization coverage among
    1-year-olds (%).
  • Total.expenditure = General government expenditure
    on health as a percentage of total government expenditure (%).
  • Diphtheria = Diphtheria tetanus toxoid and
    pertussis (DTP3) immunization coverage among 1-year-olds (%).
  • HIV.AIDS = Deaths per 1 000 live births HIV/AIDS
    (0-4 years).
  • GDP = Gross Domestic Product per capita (in
    USD).
  • Population = Population of the country.
  • thinness..1-19 years = Prevalence of thinness among
    children and adolescents for Age 10 to 19 (%).
  • thinness.5-9 years = Prevalence of thinness among
    children for Age 5 to 9(%).
  • Income.composition.of.resources = Human Development
    Index in terms of income composition of resources (index ranging from 0
    to 1).
  • Schooling = Number of years of Schooling(years)
    .

Load required libraries.

# Load libraries
library(caret)
library(GGally)
library(car)
library(lmtest)
library(rmarkdown)
library(dplyr) 

options(scipen = 100, max.print = 1e+06)

Load the dataset.

# Load data
le <- read.csv("assets/le.csv")

# Show data as table
paged_table(le)

Check structure of the new data frame

# Check structure
le %>% glimpse()
## Rows: 2,938
## Columns: 22
## $ Country                         <chr> "Afghanistan", "Afghanistan", "Afghani…
## $ Year                            <int> 2015, 2014, 2013, 2012, 2011, 2010, 20…
## $ Status                          <chr> "Developing", "Developing", "Developin…
## $ Life.expectancy                 <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58…
## $ Adult.Mortality                 <int> 263, 271, 268, 272, 275, 279, 281, 287…
## $ infant.deaths                   <int> 62, 64, 66, 69, 71, 74, 77, 80, 82, 84…
## $ Alcohol                         <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.…
## $ percentage.expenditure          <dbl> 71.279624, 73.523582, 73.219243, 78.18…
## $ Hepatitis.B                     <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 64…
## $ Measles                         <int> 1154, 492, 430, 2787, 3013, 1989, 2861…
## $ BMI                             <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16…
## $ under.five.deaths               <int> 83, 86, 89, 93, 97, 102, 106, 110, 113…
## $ Polio                           <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, 58,…
## $ Total.expenditure               <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.…
## $ Diphtheria                      <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 58…
## $ HIV.AIDS                        <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1…
## $ GDP                             <dbl> 584.25921, 612.69651, 631.74498, 669.9…
## $ Population                      <dbl> 33736494, 327582, 31731688, 3696958, 2…
## $ thinness..1.19.years            <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4, 18…
## $ thinness.5.9.years              <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18…
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4…
## $ Schooling                       <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8…
# Update to categorical
le <- le %>% 
  mutate_at(vars(Country, Year, Status), as.factor)

N/A value on our data frame

# Check proportion of missing data
table(is.na(le))
## 
## FALSE  TRUE 
## 62073  2563
le <- le %>% na.omit()
le %>% is.na() %>% colSums()
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                               0 
##                 Adult.Mortality                   infant.deaths 
##                               0                               0 
##                         Alcohol          percentage.expenditure 
##                               0                               0 
##                     Hepatitis.B                         Measles 
##                               0                               0 
##                             BMI               under.five.deaths 
##                               0                               0 
##                           Polio               Total.expenditure 
##                               0                               0 
##                      Diphtheria                        HIV.AIDS 
##                               0                               0 
##                             GDP                      Population 
##                               0                               0 
##            thinness..1.19.years              thinness.5.9.years 
##                               0                               0 
## Income.composition.of.resources                       Schooling 
##                               0                               0

The proportion of missing values (NA) from the data is only 4%.
Therefore, it can be deleted.

Take a look on data summary

le %>% summary()
##         Country          Year            Status     Life.expectancy
##  Afghanistan:  16   2014   :131   Developed : 242   Min.   :44.0   
##  Albania    :  16   2011   :130   Developing:1407   1st Qu.:64.4   
##  Armenia    :  15   2013   :130                     Median :71.7   
##  Austria    :  15   2012   :129                     Mean   :69.3   
##  Belarus    :  15   2010   :128                     3rd Qu.:75.0   
##  Belgium    :  15   2009   :126                     Max.   :89.0   
##  (Other)    :1557   (Other):875                                    
##  Adult.Mortality infant.deaths        Alcohol       percentage.expenditure
##  Min.   :  1.0   Min.   :   0.00   Min.   : 0.010   Min.   :    0.00      
##  1st Qu.: 77.0   1st Qu.:   1.00   1st Qu.: 0.810   1st Qu.:   37.44      
##  Median :148.0   Median :   3.00   Median : 3.790   Median :  145.10      
##  Mean   :168.2   Mean   :  32.55   Mean   : 4.533   Mean   :  698.97      
##  3rd Qu.:227.0   3rd Qu.:  22.00   3rd Qu.: 7.340   3rd Qu.:  509.39      
##  Max.   :723.0   Max.   :1600.00   Max.   :17.870   Max.   :18961.35      
##                                                                           
##   Hepatitis.B       Measles            BMI        under.five.deaths
##  Min.   : 2.00   Min.   :     0   Min.   : 2.00   Min.   :   0.00  
##  1st Qu.:74.00   1st Qu.:     0   1st Qu.:19.50   1st Qu.:   1.00  
##  Median :89.00   Median :    15   Median :43.70   Median :   4.00  
##  Mean   :79.22   Mean   :  2224   Mean   :38.13   Mean   :  44.22  
##  3rd Qu.:96.00   3rd Qu.:   373   3rd Qu.:55.80   3rd Qu.:  29.00  
##  Max.   :99.00   Max.   :131441   Max.   :77.10   Max.   :2100.00  
##                                                                    
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.740    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:81.00   1st Qu.: 4.410    1st Qu.:82.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.840    Median :92.00   Median : 0.100  
##  Mean   :83.56   Mean   : 5.956    Mean   :84.16   Mean   : 1.984  
##  3rd Qu.:97.00   3rd Qu.: 7.470    3rd Qu.:97.00   3rd Qu.: 0.700  
##  Max.   :99.00   Max.   :14.390    Max.   :99.00   Max.   :50.600  
##                                                                    
##       GDP              Population         thinness..1.19.years
##  Min.   :     1.68   Min.   :        34   Min.   : 0.100      
##  1st Qu.:   462.15   1st Qu.:    191897   1st Qu.: 1.600      
##  Median :  1592.57   Median :   1419631   Median : 3.000      
##  Mean   :  5566.03   Mean   :  14653626   Mean   : 4.851      
##  3rd Qu.:  4718.51   3rd Qu.:   7658972   3rd Qu.: 7.100      
##  Max.   :119172.74   Max.   :1293859294   Max.   :27.200      
##                                                               
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.100     Min.   :0.0000                  Min.   : 4.20  
##  1st Qu.: 1.700     1st Qu.:0.5090                  1st Qu.:10.30  
##  Median : 3.200     Median :0.6730                  Median :12.30  
##  Mean   : 4.908     Mean   :0.6316                  Mean   :12.12  
##  3rd Qu.: 7.100     3rd Qu.:0.7510                  3rd Qu.:14.00  
##  Max.   :28.200     Max.   :0.9360                  Max.   :20.70  
## 

EDA (exploratory & data analysis) is one of the phase to explore
the variables, allow us to get any pattern and insight on each
variables. We can know and indicate any kind of correlation between
variables.

Check target data distribution

boxplot(le$Life.expectancy, ylab = "Life Expectancy (Age)") 


💡 Insight :

  • “Life.expectancy” has many outlier values. Remember : Regression
    models can be sensitive to outlier values.

Check correlation for each variables

ggcorr(le,
       label = T,
       label_size = 3,
       hjust = 1,
       layout.exp = 10)

💡 Insight :

  • “Schooling” and “Income.composition.of.resources” are the most
    correlated predictors. On the other side, “Life.expectation” has
    negative correlation with “Adult.Mortality” (This is a valid finding due
    to mortality rate of adult is high, life expectancy of people will be
    low).
  • “Life.expectancy” has weak correlation with “Population”, “Measles”
    and “infant.deaths”.
  • There are 4 variables with strong correlation :
    • “thinness.5.9.years” and “thinness..1.19.years”.
    • “GDP” and “percentage.expenditure”.
    • “infant.deaths” and “under.five.deaths”.
    • The correlation between those predictors are so strong that they are
      essentially measuring the same underlying concept, then it can be said
      that there is multicollinearity.

Check levels on categorical variables

# Country variable
levels(le$Country)
##   [1] "Afghanistan"                                         
##   [2] "Albania"                                             
##   [3] "Algeria"                                             
##   [4] "Angola"                                              
##   [5] "Antigua and Barbuda"                                 
##   [6] "Argentina"                                           
##   [7] "Armenia"                                             
##   [8] "Australia"                                           
##   [9] "Austria"                                             
##  [10] "Azerbaijan"                                          
##  [11] "Bahamas"                                             
##  [12] "Bahrain"                                             
##  [13] "Bangladesh"                                          
##  [14] "Barbados"                                            
##  [15] "Belarus"                                             
##  [16] "Belgium"                                             
##  [17] "Belize"                                              
##  [18] "Benin"                                               
##  [19] "Bhutan"                                              
##  [20] "Bolivia (Plurinational State of)"                    
##  [21] "Bosnia and Herzegovina"                              
##  [22] "Botswana"                                            
##  [23] "Brazil"                                              
##  [24] "Brunei Darussalam"                                   
##  [25] "Bulgaria"                                            
##  [26] "Burkina Faso"                                        
##  [27] "Burundi"                                             
##  [28] "Cabo Verde"                                          
##  [29] "Cambodia"                                            
##  [30] "Cameroon"                                            
##  [31] "Canada"                                              
##  [32] "Central African Republic"                            
##  [33] "Chad"                                                
##  [34] "Chile"                                               
##  [35] "China"                                               
##  [36] "Colombia"                                            
##  [37] "Comoros"                                             
##  [38] "Congo"                                               
##  [39] "Cook Islands"                                        
##  [40] "Costa Rica"                                          
##  [41] "Côte d'Ivoire"                                       
##  [42] "Croatia"                                             
##  [43] "Cuba"                                                
##  [44] "Cyprus"                                              
##  [45] "Czechia"                                             
##  [46] "Democratic People's Republic of Korea"               
##  [47] "Democratic Republic of the Congo"                    
##  [48] "Denmark"                                             
##  [49] "Djibouti"                                            
##  [50] "Dominica"                                            
##  [51] "Dominican Republic"                                  
##  [52] "Ecuador"                                             
##  [53] "Egypt"                                               
##  [54] "El Salvador"                                         
##  [55] "Equatorial Guinea"                                   
##  [56] "Eritrea"                                             
##  [57] "Estonia"                                             
##  [58] "Ethiopia"                                            
##  [59] "Fiji"                                                
##  [60] "Finland"                                             
##  [61] "France"                                              
##  [62] "Gabon"                                               
##  [63] "Gambia"                                              
##  [64] "Georgia"                                             
##  [65] "Germany"                                             
##  [66] "Ghana"                                               
##  [67] "Greece"                                              
##  [68] "Grenada"                                             
##  [69] "Guatemala"                                           
##  [70] "Guinea"                                              
##  [71] "Guinea-Bissau"                                       
##  [72] "Guyana"                                              
##  [73] "Haiti"                                               
##  [74] "Honduras"                                            
##  [75] "Hungary"                                             
##  [76] "Iceland"                                             
##  [77] "India"                                               
##  [78] "Indonesia"                                           
##  [79] "Iran (Islamic Republic of)"                          
##  [80] "Iraq"                                                
##  [81] "Ireland"                                             
##  [82] "Israel"                                              
##  [83] "Italy"                                               
##  [84] "Jamaica"                                             
##  [85] "Japan"                                               
##  [86] "Jordan"                                              
##  [87] "Kazakhstan"                                          
##  [88] "Kenya"                                               
##  [89] "Kiribati"                                            
##  [90] "Kuwait"                                              
##  [91] "Kyrgyzstan"                                          
##  [92] "Lao People's Democratic Republic"                    
##  [93] "Latvia"                                              
##  [94] "Lebanon"                                             
##  [95] "Lesotho"                                             
##  [96] "Liberia"                                             
##  [97] "Libya"                                               
##  [98] "Lithuania"                                           
##  [99] "Luxembourg"                                          
## [100] "Madagascar"                                          
## [101] "Malawi"                                              
## [102] "Malaysia"                                            
## [103] "Maldives"                                            
## [104] "Mali"                                                
## [105] "Malta"                                               
## [106] "Marshall Islands"                                    
## [107] "Mauritania"                                          
## [108] "Mauritius"                                           
## [109] "Mexico"                                              
## [110] "Micronesia (Federated States of)"                    
## [111] "Monaco"                                              
## [112] "Mongolia"                                            
## [113] "Montenegro"                                          
## [114] "Morocco"                                             
## [115] "Mozambique"                                          
## [116] "Myanmar"                                             
## [117] "Namibia"                                             
## [118] "Nauru"                                               
## [119] "Nepal"                                               
## [120] "Netherlands"                                         
## [121] "New Zealand"                                         
## [122] "Nicaragua"                                           
## [123] "Niger"                                               
## [124] "Nigeria"                                             
## [125] "Niue"                                                
## [126] "Norway"                                              
## [127] "Oman"                                                
## [128] "Pakistan"                                            
## [129] "Palau"                                               
## [130] "Panama"                                              
## [131] "Papua New Guinea"                                    
## [132] "Paraguay"                                            
## [133] "Peru"                                                
## [134] "Philippines"                                         
## [135] "Poland"                                              
## [136] "Portugal"                                            
## [137] "Qatar"                                               
## [138] "Republic of Korea"                                   
## [139] "Republic of Moldova"                                 
## [140] "Romania"                                             
## [141] "Russian Federation"                                  
## [142] "Rwanda"                                              
## [143] "Saint Kitts and Nevis"                               
## [144] "Saint Lucia"                                         
## [145] "Saint Vincent and the Grenadines"                    
## [146] "Samoa"                                               
## [147] "San Marino"                                          
## [148] "Sao Tome and Principe"                               
## [149] "Saudi Arabia"                                        
## [150] "Senegal"                                             
## [151] "Serbia"                                              
## [152] "Seychelles"                                          
## [153] "Sierra Leone"                                        
## [154] "Singapore"                                           
## [155] "Slovakia"                                            
## [156] "Slovenia"                                            
## [157] "Solomon Islands"                                     
## [158] "Somalia"                                             
## [159] "South Africa"                                        
## [160] "South Sudan"                                         
## [161] "Spain"                                               
## [162] "Sri Lanka"                                           
## [163] "Sudan"                                               
## [164] "Suriname"                                            
## [165] "Swaziland"                                           
## [166] "Sweden"                                              
## [167] "Switzerland"                                         
## [168] "Syrian Arab Republic"                                
## [169] "Tajikistan"                                          
## [170] "Thailand"                                            
## [171] "The former Yugoslav republic of Macedonia"           
## [172] "Timor-Leste"                                         
## [173] "Togo"                                                
## [174] "Tonga"                                               
## [175] "Trinidad and Tobago"                                 
## [176] "Tunisia"                                             
## [177] "Turkey"                                              
## [178] "Turkmenistan"                                        
## [179] "Tuvalu"                                              
## [180] "Uganda"                                              
## [181] "Ukraine"                                             
## [182] "United Arab Emirates"                                
## [183] "United Kingdom of Great Britain and Northern Ireland"
## [184] "United Republic of Tanzania"                         
## [185] "United States of America"                            
## [186] "Uruguay"                                             
## [187] "Uzbekistan"                                          
## [188] "Vanuatu"                                             
## [189] "Venezuela (Bolivarian Republic of)"                  
## [190] "Viet Nam"                                            
## [191] "Yemen"                                               
## [192] "Zambia"                                              
## [193] "Zimbabwe"
# Year variable
levels(le$Year)
##  [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015"
# Status variable
levels(le$Status)
## [1] "Developed"  "Developing"

💡 Insight :

  • Country has many levels and it doesn’t give valuable information to
    predict “Life.expectancy”.
  • Year is a time series data. It is not suitable to be a predictors
    for “Life.expectancy”.
  • Status has 2 levels and it is suitable to be a categorical
    predictors for “Life.expectancy”.

A glimpse of vaccination information on data

# Subset and see the vaccination
le_vaccination <- le %>% 
  select(c(Hepatitis.B,
           Polio,
           Diphtheria))
# Check range for each variable
summary(le_vaccination)
##   Hepatitis.B        Polio         Diphtheria   
##  Min.   : 2.00   Min.   : 3.00   Min.   : 2.00  
##  1st Qu.:74.00   1st Qu.:81.00   1st Qu.:82.00  
##  Median :89.00   Median :93.00   Median :92.00  
##  Mean   :79.22   Mean   :83.56   Mean   :84.16  
##  3rd Qu.:96.00   3rd Qu.:97.00   3rd Qu.:97.00  
##  Max.   :99.00   Max.   :99.00   Max.   :99.00

💡 Insight :

  • For all three variables, The range between the minimum value and the
    1st Quartile is too wide. Therefore, adjustment should be done for these
    variables.
  • We can use Global
    Vaccine Action Plan
    statement to change the data type into
    categorical variables, “< 90% Covered” and “>= 90% Covered”. The
    purpose is to get a better view of the immunization impact on
    “Life.expectancy”.

Feature Engineering

Based on above summary, we will need to :

  • Remove 1 of variable with strong correlation =
    “thinness..1.19.years”, “GDP” and “infant.deaths”.
  • Remove non valuable and suitable variables = “Country” and
    “Year”.
  • Update to categorical data type = “Hepatitis.B”, “Polio” and
    “Diphtheria”
  • Remove outliers on “Life.expectancy”.
le_clean <- le %>% 
  select(-Country, -Year, -infant.deaths, -GDP, -thinness..1.19.years) %>% 
  mutate(Hepatitis.B = ifelse(Hepatitis.B < 90, "< 90% Covered", ">= 90% Covered"),
         Polio = ifelse(Polio < 90, "< 90% Covered", ">= 90% Covered"),
         Diphtheria = ifelse(Diphtheria < 90, "< 90% Covered", ">= 90% Covered"),
         Hepatitis.B = as.factor(Hepatitis.B),
         Polio = as.factor(Polio),
         Diphtheria = as.factor(Diphtheria))
le_clean <- le_clean[le_clean$Life.expectancy > 50, ]

Check structure new data frame

le_clean %>% glimpse()
## Rows: 1,590
## Columns: 17
## $ Status                          <fct> Developing, Developing, Developing, De…
## $ Life.expectancy                 <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58…
## $ Adult.Mortality                 <int> 263, 271, 268, 272, 275, 279, 281, 287…
## $ Alcohol                         <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.…
## $ percentage.expenditure          <dbl> 71.279624, 73.523582, 73.219243, 78.18…
## $ Hepatitis.B                     <fct> < 90% Covered, < 90% Covered, < 90% Co…
## $ Measles                         <int> 1154, 492, 430, 2787, 3013, 1989, 2861…
## $ BMI                             <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16…
## $ under.five.deaths               <int> 83, 86, 89, 93, 97, 102, 106, 110, 113…
## $ Polio                           <fct> < 90% Covered, < 90% Covered, < 90% Co…
## $ Total.expenditure               <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.…
## $ Diphtheria                      <fct> < 90% Covered, < 90% Covered, < 90% Co…
## $ HIV.AIDS                        <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1…
## $ Population                      <dbl> 33736494, 327582, 31731688, 3696958, 2…
## $ thinness.5.9.years              <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18…
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4…
## $ Schooling                       <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8…

Check outliers on “Life.expectancy”

boxplot(le_clean$Life.expectancy, ylab = "Life Expectancy (Age)")

Multiple predictors

We create a model using “Life.expectancy” as the target variable.

# Create model
le_model <- lm(formula = Life.expectancy ~ .,
               data = le_train)
# Model summary
summary(le_model)
## 
## Call:
## lm(formula = Life.expectancy ~ ., data = le_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6497  -1.9303  -0.0174   2.3692  11.7388 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     55.805745917713  0.882964812003  63.203
## StatusDeveloping                -0.918239525132  0.377137757239  -2.435
## Adult.Mortality                 -0.017419078387  0.001189654093 -14.642
## Alcohol                         -0.085254722512  0.036409231465  -2.342
## percentage.expenditure           0.000414862898  0.000067217776   6.172
## Hepatitis.B>= 90% Covered       -0.831983940017  0.348766970892  -2.386
## Measles                          0.000026844691  0.000012230715   2.195
## BMI                              0.034367794254  0.006465365766   5.316
## under.five.deaths               -0.003140719911  0.000937012759  -3.352
## Polio>= 90% Covered              0.063306758025  0.473798530450   0.134
## Total.expenditure                0.103002083684  0.045706670500   2.254
## Diphtheria>= 90% Covered         1.119748320645  0.525159100317   2.132
## HIV.AIDS                        -0.556403937803  0.033568046892 -16.575
## Population                       0.000000001844  0.000000001767   1.043
## thinness.5.9.years              -0.013148223109  0.029056477496  -0.453
## Income.composition.of.resources  9.398409439085  0.876786690379  10.719
## Schooling                        0.860327636148  0.065550198898  13.125
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                            0.015040 *  
## Adult.Mortality                 < 0.0000000000000002 ***
## Alcohol                                     0.019359 *  
## percentage.expenditure                0.000000000909 ***
## Hepatitis.B>= 90% Covered                   0.017204 *  
## Measles                                     0.028356 *  
## BMI                                   0.000000125660 ***
## under.five.deaths                           0.000827 ***
## Polio>= 90% Covered                         0.893728    
## Total.expenditure                           0.024397 *  
## Diphtheria>= 90% Covered                    0.033183 *  
## HIV.AIDS                        < 0.0000000000000002 ***
## Population                                  0.297081    
## thinness.5.9.years                          0.650983    
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.49 on 1255 degrees of freedom
## Multiple R-squared:  0.7969, Adjusted R-squared:  0.7943 
## F-statistic: 307.7 on 16 and 1255 DF,  p-value: < 0.00000000000000022

💡 Insight :Adj. R-squared value is
79.4%
, indicating the model isn’t good enough. –
Significant predictors : Most of predictors are
significant. Only “Polio”, “Population” and “thinness.5.9.years” aren’t
significant to target.

Step wise method

Create non target model

le_none <- lm(formula = Life.expectancy ~ 1,
               data = le_train)

Create backward step wise model

# Backward
le_backward <- step(le_model, direction = "backward")
## Start:  AIC=3197
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Polio + 
##     Total.expenditure + Diphtheria + HIV.AIDS + Population + 
##     thinness.5.9.years + Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Polio                            1       0.2 15291 3195.0
## - thinness.5.9.years               1       2.5 15293 3195.2
## - Population                       1      13.3 15304 3196.1
## <none>                                         15290 3197.0
## - Diphtheria                       1      55.4 15346 3199.6
## - Measles                          1      58.7 15349 3199.9
## - Total.expenditure                1      61.9 15352 3200.1
## - Alcohol                          1      66.8 15357 3200.5
## - Hepatitis.B                      1      69.3 15360 3200.8
## - Status                           1      72.2 15363 3201.0
## - under.five.deaths                1     136.9 15427 3206.3
## - BMI                              1     344.3 15635 3223.3
## - percentage.expenditure           1     464.1 15754 3233.0
## - Income.composition.of.resources  1    1399.9 16690 3306.4
## - Schooling                        1    2098.7 17389 3358.6
## - Adult.Mortality                  1    2612.1 17902 3395.6
## - HIV.AIDS                         1    3347.4 18638 3446.8
## 
## Step:  AIC=3195.01
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure + 
##     Diphtheria + HIV.AIDS + Population + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness.5.9.years               1       2.5 15293 3193.2
## - Population                       1      13.3 15304 3194.1
## <none>                                         15291 3195.0
## - Measles                          1      58.5 15349 3197.9
## - Total.expenditure                1      61.8 15352 3198.1
## - Alcohol                          1      66.7 15357 3198.5
## - Hepatitis.B                      1      69.8 15360 3198.8
## - Status                           1      72.4 15363 3199.0
## - Diphtheria                       1     117.4 15408 3202.7
## - under.five.deaths                1     137.1 15428 3204.4
## - BMI                              1     344.1 15635 3221.3
## - percentage.expenditure           1     463.9 15754 3231.0
## - Income.composition.of.resources  1    1400.9 16692 3304.5
## - Schooling                        1    2104.9 17395 3357.1
## - Adult.Mortality                  1    2613.2 17904 3393.7
## - HIV.AIDS                         1    3372.2 18663 3446.5
## 
## Step:  AIC=3193.22
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure + 
##     Diphtheria + HIV.AIDS + Population + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Population                       1      12.8 15306 3192.3
## <none>                                         15293 3193.2
## - Measles                          1      62.5 15356 3196.4
## - Total.expenditure                1      64.0 15357 3196.5
## - Alcohol                          1      64.3 15357 3196.6
## - Hepatitis.B                      1      71.1 15364 3197.1
## - Status                           1      72.2 15365 3197.2
## - Diphtheria                       1     116.9 15410 3200.9
## - under.five.deaths                1     171.7 15465 3205.4
## - BMI                              1     411.2 15704 3225.0
## - percentage.expenditure           1     464.3 15757 3229.3
## - Income.composition.of.resources  1    1413.2 16706 3303.6
## - Schooling                        1    2115.2 17408 3356.0
## - Adult.Mortality                  1    2619.9 17913 3392.4
## - HIV.AIDS                         1    3436.4 18729 3449.1
## 
## Step:  AIC=3192.28
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure + 
##     Diphtheria + HIV.AIDS + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         15306 3192.3
## - Measles                          1      62.7 15369 3195.5
## - Total.expenditure                1      62.9 15369 3195.5
## - Alcohol                          1      65.8 15372 3195.7
## - Status                           1      71.2 15377 3196.2
## - Hepatitis.B                      1      71.6 15377 3196.2
## - Diphtheria                       1     119.1 15425 3200.1
## - under.five.deaths                1     183.2 15489 3205.4
## - BMI                              1     414.2 15720 3224.2
## - percentage.expenditure           1     464.9 15771 3228.3
## - Income.composition.of.resources  1    1414.7 16721 3302.7
## - Schooling                        1    2151.1 17457 3357.6
## - Adult.Mortality                  1    2618.7 17924 3391.2
## - HIV.AIDS                         1    3447.7 18754 3448.7

Create forward step wise model

le_forward <- step(le_none, scope = list(lower = le_none, upper = le_model) ,direction = "forward")
## Start:  AIC=5192.52
## Life.expectancy ~ 1
## 
##                                   Df Sum of Sq   RSS    AIC
## + Schooling                        1     41020 34258 4193.1
## + Income.composition.of.resources  1     38051 37226 4298.8
## + Adult.Mortality                  1     32200 43078 4484.5
## + BMI                              1     20882 54396 4781.2
## + HIV.AIDS                         1     20165 55113 4797.9
## + Status                           1     16525 58753 4879.3
## + thinness.5.9.years               1     15837 59441 4894.1
## + percentage.expenditure           1     14391 60887 4924.6
## + Alcohol                          1     14016 61261 4932.4
## + Diphtheria                       1     13255 62023 4948.2
## + Polio                            1     13153 62125 4950.2
## + Hepatitis.B                      1      5836 69442 5091.9
## + Total.expenditure                1      3848 71430 5127.8
## + under.five.deaths                1      3053 72225 5141.9
## + Measles                          1       731 74547 5182.1
## <none>                                         75278 5192.5
## + Population                       1        87 75191 5193.0
## 
## Step:  AIC=4193.11
## Life.expectancy ~ Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## + Adult.Mortality                  1   11389.8 22868 3681.0
## + HIV.AIDS                         1   11286.6 22971 3686.7
## + Income.composition.of.resources  1    3859.3 30399 4043.1
## + BMI                              1    1987.1 32271 4119.1
## + percentage.expenditure           1    1339.9 32918 4144.4
## + thinness.5.9.years               1    1313.2 32945 4145.4
## + Status                           1     853.2 33405 4163.0
## + Polio                            1     711.8 33546 4168.4
## + Diphtheria                       1     664.6 33593 4170.2
## + under.five.deaths                1     182.5 34075 4188.3
## + Hepatitis.B                      1     129.8 34128 4190.3
## + Total.expenditure                1     108.7 34149 4191.1
## + Alcohol                          1      73.8 34184 4192.4
## <none>                                         34258 4193.1
## + Population                       1       7.5 34251 4194.8
## + Measles                          1       0.1 34258 4195.1
## 
## Step:  AIC=3681.01
## Life.expectancy ~ Schooling + Adult.Mortality
## 
##                                   Df Sum of Sq   RSS    AIC
## + HIV.AIDS                         1    4037.3 18831 3435.9
## + Income.composition.of.resources  1    2053.7 20814 3563.3
## + BMI                              1    1003.4 21865 3625.9
## + percentage.expenditure           1     758.0 22110 3640.1
## + thinness.5.9.years               1     700.5 22168 3643.4
## + Diphtheria                       1     347.7 22520 3663.5
## + Polio                            1     330.0 22538 3664.5
## + Status                           1     316.5 22552 3665.3
## + under.five.deaths                1     276.1 22592 3667.6
## + Hepatitis.B                      1      66.4 22802 3679.3
## + Population                       1      36.1 22832 3681.0
## <none>                                         22868 3681.0
## + Total.expenditure                1      25.3 22843 3681.6
## + Alcohol                          1      23.8 22844 3681.7
## + Measles                          1      10.6 22858 3682.4
## 
## Step:  AIC=3435.92
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS
## 
##                                   Df Sum of Sq   RSS    AIC
## + Income.composition.of.resources  1   1873.51 16957 3304.6
## + percentage.expenditure           1    905.07 17926 3375.3
## + BMI                              1    686.08 18145 3390.7
## + thinness.5.9.years               1    414.85 18416 3409.6
## + Status                           1    343.59 18487 3414.5
## + under.five.deaths                1    221.64 18609 3422.9
## + Total.expenditure                1    197.80 18633 3424.5
## + Alcohol                          1    127.41 18704 3429.3
## + Diphtheria                       1    107.91 18723 3430.6
## + Polio                            1     72.88 18758 3433.0
## + Population                       1     43.74 18787 3435.0
## <none>                                         18831 3435.9
## + Measles                          1      2.12 18829 3437.8
## + Hepatitis.B                      1      0.57 18830 3437.9
## 
## Step:  AIC=3304.63
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources
## 
##                          Df Sum of Sq   RSS    AIC
## + percentage.expenditure  1    670.31 16287 3255.3
## + BMI                     1    472.83 16485 3270.7
## + under.five.deaths       1    277.61 16680 3285.6
## + thinness.5.9.years      1    265.83 16692 3286.5
## + Status                  1    208.27 16749 3290.9
## + Total.expenditure       1    198.60 16759 3291.6
## + Population              1     62.17 16895 3302.0
## + Diphtheria              1     50.96 16906 3302.8
## + Polio                   1     31.05 16926 3304.3
## <none>                                16957 3304.6
## + Measles                 1     13.68 16944 3305.6
## + Alcohol                 1      6.75 16951 3306.1
## + Hepatitis.B             1      0.18 16957 3306.6
## 
## Step:  AIC=3255.32
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure
## 
##                      Df Sum of Sq   RSS    AIC
## + BMI                 1    479.07 15808 3219.3
## + under.five.deaths   1    272.28 16015 3235.9
## + thinness.5.9.years  1    230.34 16057 3239.2
## + Total.expenditure   1    127.56 16160 3247.3
## + Diphtheria          1     59.14 16228 3252.7
## + Population          1     57.47 16230 3252.8
## + Status              1     39.60 16248 3254.2
## + Polio               1     39.56 16248 3254.2
## <none>                            16287 3255.3
## + Measles             1      9.91 16277 3256.5
## + Alcohol             1      8.26 16279 3256.7
## + Hepatitis.B         1      3.28 16284 3257.1
## 
## Step:  AIC=3219.35
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI
## 
##                      Df Sum of Sq   RSS    AIC
## + under.five.deaths   1   175.895 15632 3207.1
## + Total.expenditure   1    87.448 15721 3214.3
## + Diphtheria          1    81.711 15726 3214.8
## + thinness.5.9.years  1    57.697 15750 3216.7
## + Polio               1    57.324 15751 3216.7
## + Status              1    44.185 15764 3217.8
## + Population          1    31.888 15776 3218.8
## <none>                            15808 3219.3
## + Alcohol             1    11.905 15796 3220.4
## + Hepatitis.B         1     6.632 15801 3220.8
## + Measles             1     0.202 15808 3221.3
## 
## Step:  AIC=3207.12
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + under.five.deaths
## 
##                      Df Sum of Sq   RSS    AIC
## + Total.expenditure   1    67.829 15564 3203.6
## + Measles             1    54.065 15578 3204.7
## + Diphtheria          1    48.668 15584 3205.1
## + Status              1    43.343 15589 3205.6
## + Polio               1    29.959 15602 3206.7
## <none>                            15632 3207.1
## + Population          1    13.661 15618 3208.0
## + Alcohol             1     8.856 15623 3208.4
## + thinness.5.9.years  1     4.388 15628 3208.8
## + Hepatitis.B         1     0.534 15632 3209.1
## 
## Step:  AIC=3203.58
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + under.five.deaths + Total.expenditure
## 
##                      Df Sum of Sq   RSS    AIC
## + Measles             1    61.291 15503 3200.6
## + Diphtheria          1    38.595 15526 3202.4
## + Status              1    35.291 15529 3202.7
## <none>                            15564 3203.6
## + Polio               1    23.358 15541 3203.7
## + Population          1    15.017 15549 3204.4
## + Alcohol             1    13.699 15551 3204.5
## + thinness.5.9.years  1     2.203 15562 3205.4
## + Hepatitis.B         1     0.015 15564 3205.6
## 
## Step:  AIC=3200.57
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + under.five.deaths + Total.expenditure + 
##     Measles
## 
##                      Df Sum of Sq   RSS    AIC
## + Diphtheria          1    40.313 15463 3199.3
## + Status              1    35.631 15467 3199.6
## + Polio               1    27.011 15476 3200.3
## <none>                            15503 3200.6
## + Population          1    14.781 15488 3201.4
## + Alcohol             1    13.712 15489 3201.4
## + thinness.5.9.years  1     0.268 15503 3202.5
## + Hepatitis.B         1     0.070 15503 3202.6
## 
## Step:  AIC=3199.25
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + under.five.deaths + Total.expenditure + 
##     Measles + Diphtheria
## 
##                      Df Sum of Sq   RSS    AIC
## + Hepatitis.B         1    58.028 15405 3196.5
## + Status              1    32.965 15430 3198.5
## <none>                            15463 3199.3
## + Alcohol             1    19.933 15443 3199.6
## + Population          1    13.415 15449 3200.1
## + thinness.5.9.years  1     0.815 15462 3201.2
## + Polio               1     0.444 15462 3201.2
## 
## Step:  AIC=3196.47
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + under.five.deaths + Total.expenditure + 
##     Measles + Diphtheria + Hepatitis.B
## 
##                      Df Sum of Sq   RSS    AIC
## + Status              1    33.058 15372 3195.7
## + Alcohol             1    27.595 15377 3196.2
## <none>                            15405 3196.5
## + Population          1    13.102 15392 3197.4
## + Polio               1     0.239 15404 3198.5
## + thinness.5.9.years  1     0.189 15404 3198.5
## 
## Step:  AIC=3195.74
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + under.five.deaths + Total.expenditure + 
##     Measles + Diphtheria + Hepatitis.B + Status
## 
##                      Df Sum of Sq   RSS    AIC
## + Alcohol             1    65.775 15306 3192.3
## <none>                            15372 3195.7
## + Population          1    14.289 15357 3196.6
## + Polio               1     0.093 15372 3197.7
## + thinness.5.9.years  1     0.012 15372 3197.7
## 
## Step:  AIC=3192.28
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + under.five.deaths + Total.expenditure + 
##     Measles + Diphtheria + Hepatitis.B + Status + Alcohol
## 
##                      Df Sum of Sq   RSS    AIC
## <none>                            15306 3192.3
## + Population          1   12.8055 15293 3193.2
## + thinness.5.9.years  1    1.9979 15304 3194.1
## + Polio               1    0.1993 15306 3194.3

Create forward & backward step wise model

# Both
le_both <- step(le_model, 
                   scope = list(lower = le_none, 
                                upper = le_model), direction = "both")
## Start:  AIC=3197
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Polio + 
##     Total.expenditure + Diphtheria + HIV.AIDS + Population + 
##     thinness.5.9.years + Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Polio                            1       0.2 15291 3195.0
## - thinness.5.9.years               1       2.5 15293 3195.2
## - Population                       1      13.3 15304 3196.1
## <none>                                         15290 3197.0
## - Diphtheria                       1      55.4 15346 3199.6
## - Measles                          1      58.7 15349 3199.9
## - Total.expenditure                1      61.9 15352 3200.1
## - Alcohol                          1      66.8 15357 3200.5
## - Hepatitis.B                      1      69.3 15360 3200.8
## - Status                           1      72.2 15363 3201.0
## - under.five.deaths                1     136.9 15427 3206.3
## - BMI                              1     344.3 15635 3223.3
## - percentage.expenditure           1     464.1 15754 3233.0
## - Income.composition.of.resources  1    1399.9 16690 3306.4
## - Schooling                        1    2098.7 17389 3358.6
## - Adult.Mortality                  1    2612.1 17902 3395.6
## - HIV.AIDS                         1    3347.4 18638 3446.8
## 
## Step:  AIC=3195.01
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure + 
##     Diphtheria + HIV.AIDS + Population + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness.5.9.years               1       2.5 15293 3193.2
## - Population                       1      13.3 15304 3194.1
## <none>                                         15291 3195.0
## + Polio                            1       0.2 15290 3197.0
## - Measles                          1      58.5 15349 3197.9
## - Total.expenditure                1      61.8 15352 3198.1
## - Alcohol                          1      66.7 15357 3198.5
## - Hepatitis.B                      1      69.8 15360 3198.8
## - Status                           1      72.4 15363 3199.0
## - Diphtheria                       1     117.4 15408 3202.7
## - under.five.deaths                1     137.1 15428 3204.4
## - BMI                              1     344.1 15635 3221.3
## - percentage.expenditure           1     463.9 15754 3231.0
## - Income.composition.of.resources  1    1400.9 16692 3304.5
## - Schooling                        1    2104.9 17395 3357.1
## - Adult.Mortality                  1    2613.2 17904 3393.7
## - HIV.AIDS                         1    3372.2 18663 3446.5
## 
## Step:  AIC=3193.22
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure + 
##     Diphtheria + HIV.AIDS + Population + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Population                       1      12.8 15306 3192.3
## <none>                                         15293 3193.2
## + thinness.5.9.years               1       2.5 15291 3195.0
## + Polio                            1       0.2 15293 3195.2
## - Measles                          1      62.5 15356 3196.4
## - Total.expenditure                1      64.0 15357 3196.5
## - Alcohol                          1      64.3 15357 3196.6
## - Hepatitis.B                      1      71.1 15364 3197.1
## - Status                           1      72.2 15365 3197.2
## - Diphtheria                       1     116.9 15410 3200.9
## - under.five.deaths                1     171.7 15465 3205.4
## - BMI                              1     411.2 15704 3225.0
## - percentage.expenditure           1     464.3 15757 3229.3
## - Income.composition.of.resources  1    1413.2 16706 3303.6
## - Schooling                        1    2115.2 17408 3356.0
## - Adult.Mortality                  1    2619.9 17913 3392.4
## - HIV.AIDS                         1    3436.4 18729 3449.1
## 
## Step:  AIC=3192.28
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure + 
##     Diphtheria + HIV.AIDS + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         15306 3192.3
## + Population                       1      12.8 15293 3193.2
## + thinness.5.9.years               1       2.0 15304 3194.1
## + Polio                            1       0.2 15306 3194.3
## - Measles                          1      62.7 15369 3195.5
## - Total.expenditure                1      62.9 15369 3195.5
## - Alcohol                          1      65.8 15372 3195.7
## - Status                           1      71.2 15377 3196.2
## - Hepatitis.B                      1      71.6 15377 3196.2
## - Diphtheria                       1     119.1 15425 3200.1
## - under.five.deaths                1     183.2 15489 3205.4
## - BMI                              1     414.2 15720 3224.2
## - percentage.expenditure           1     464.9 15771 3228.3
## - Income.composition.of.resources  1    1414.7 16721 3302.7
## - Schooling                        1    2151.1 17457 3357.6
## - Adult.Mortality                  1    2618.7 17924 3391.2
## - HIV.AIDS                         1    3447.7 18754 3448.7
summary(le_backward)
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Alcohol + 
##     percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + 
##     Total.expenditure + Diphtheria + HIV.AIDS + Income.composition.of.resources + 
##     Schooling, data = le_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6587  -1.9273  -0.0234   2.3269  11.8412 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     55.60390048  0.82344174  67.526
## StatusDeveloping                -0.91154846  0.37671511  -2.420
## Adult.Mortality                 -0.01743226  0.00118823 -14.671
## Alcohol                         -0.08349779  0.03591153  -2.325
## percentage.expenditure           0.00041518  0.00006716   6.182
## Hepatitis.B>= 90% Covered       -0.83487886  0.34417632  -2.426
## Measles                          0.00002749  0.00001211   2.271
## BMI                              0.03545956  0.00607768   5.834
## under.five.deaths               -0.00277041  0.00071399  -3.880
## Total.expenditure                0.10350644  0.04553187   2.273
## Diphtheria>= 90% Covered         1.17656001  0.37612305   3.128
## HIV.AIDS                        -0.55922622  0.03322075 -16.834
## Income.composition.of.resources  9.42925025  0.87443614  10.783
## Schooling                        0.86694378  0.06519956  13.297
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                             0.01567 *  
## Adult.Mortality                 < 0.0000000000000002 ***
## Alcohol                                      0.02023 *  
## percentage.expenditure                0.000000000856 ***
## Hepatitis.B>= 90% Covered                    0.01542 *  
## Measles                                      0.02334 *  
## BMI                                   0.000000006859 ***
## under.five.deaths                            0.00011 ***
## Total.expenditure                            0.02318 *  
## Diphtheria>= 90% Covered                     0.00180 ** 
## HIV.AIDS                        < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.488 on 1258 degrees of freedom
## Multiple R-squared:  0.7967, Adjusted R-squared:  0.7946 
## F-statistic: 379.2 on 13 and 1258 DF,  p-value: < 0.00000000000000022
summary(le_forward)
## 
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality + 
##     HIV.AIDS + Income.composition.of.resources + percentage.expenditure + 
##     BMI + under.five.deaths + Total.expenditure + Measles + Diphtheria + 
##     Hepatitis.B + Status + Alcohol, data = le_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6587  -1.9273  -0.0234   2.3269  11.8412 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     55.60390048  0.82344174  67.526
## Schooling                        0.86694378  0.06519956  13.297
## Adult.Mortality                 -0.01743226  0.00118823 -14.671
## HIV.AIDS                        -0.55922622  0.03322075 -16.834
## Income.composition.of.resources  9.42925025  0.87443614  10.783
## percentage.expenditure           0.00041518  0.00006716   6.182
## BMI                              0.03545956  0.00607768   5.834
## under.five.deaths               -0.00277041  0.00071399  -3.880
## Total.expenditure                0.10350644  0.04553187   2.273
## Measles                          0.00002749  0.00001211   2.271
## Diphtheria>= 90% Covered         1.17656001  0.37612305   3.128
## Hepatitis.B>= 90% Covered       -0.83487886  0.34417632  -2.426
## StatusDeveloping                -0.91154846  0.37671511  -2.420
## Alcohol                         -0.08349779  0.03591153  -2.325
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## percentage.expenditure                0.000000000856 ***
## BMI                                   0.000000006859 ***
## under.five.deaths                            0.00011 ***
## Total.expenditure                            0.02318 *  
## Measles                                      0.02334 *  
## Diphtheria>= 90% Covered                     0.00180 ** 
## Hepatitis.B>= 90% Covered                    0.01542 *  
## StatusDeveloping                             0.01567 *  
## Alcohol                                      0.02023 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.488 on 1258 degrees of freedom
## Multiple R-squared:  0.7967, Adjusted R-squared:  0.7946 
## F-statistic: 379.2 on 13 and 1258 DF,  p-value: < 0.00000000000000022
summary(le_both)
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Alcohol + 
##     percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + 
##     Total.expenditure + Diphtheria + HIV.AIDS + Income.composition.of.resources + 
##     Schooling, data = le_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6587  -1.9273  -0.0234   2.3269  11.8412 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     55.60390048  0.82344174  67.526
## StatusDeveloping                -0.91154846  0.37671511  -2.420
## Adult.Mortality                 -0.01743226  0.00118823 -14.671
## Alcohol                         -0.08349779  0.03591153  -2.325
## percentage.expenditure           0.00041518  0.00006716   6.182
## Hepatitis.B>= 90% Covered       -0.83487886  0.34417632  -2.426
## Measles                          0.00002749  0.00001211   2.271
## BMI                              0.03545956  0.00607768   5.834
## under.five.deaths               -0.00277041  0.00071399  -3.880
## Total.expenditure                0.10350644  0.04553187   2.273
## Diphtheria>= 90% Covered         1.17656001  0.37612305   3.128
## HIV.AIDS                        -0.55922622  0.03322075 -16.834
## Income.composition.of.resources  9.42925025  0.87443614  10.783
## Schooling                        0.86694378  0.06519956  13.297
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                             0.01567 *  
## Adult.Mortality                 < 0.0000000000000002 ***
## Alcohol                                      0.02023 *  
## percentage.expenditure                0.000000000856 ***
## Hepatitis.B>= 90% Covered                    0.01542 *  
## Measles                                      0.02334 *  
## BMI                                   0.000000006859 ***
## under.five.deaths                            0.00011 ***
## Total.expenditure                            0.02318 *  
## Diphtheria>= 90% Covered                     0.00180 ** 
## HIV.AIDS                        < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.488 on 1258 degrees of freedom
## Multiple R-squared:  0.7967, Adjusted R-squared:  0.7946 
## F-statistic: 379.2 on 13 and 1258 DF,  p-value: < 0.00000000000000022

💡 Insight :

  • Adj. R-squared value for both “backward”, “forward”
    and “both” step wise are same, with 79.4%
  • Both result are still not satisfied

Feature selection

As we get all significant predictors (with three ***), let’s create a
new model with those predictors.

# Model selection

le_selected <- lm(formula = Life.expectancy ~ Adult.Mortality + under.five.deaths + HIV.AIDS + percentage.expenditure + BMI + Income.composition.of.resources + Schooling,
               data = le_train)
summary(le_selected)
## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + under.five.deaths + 
##     HIV.AIDS + percentage.expenditure + BMI + Income.composition.of.resources + 
##     Schooling, data = le_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8367  -2.0608   0.0356   2.3203  12.1137 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     55.27699936  0.61067502  90.518
## Adult.Mortality                 -0.01839507  0.00117363 -15.674
## under.five.deaths               -0.00231010  0.00061255  -3.771
## HIV.AIDS                        -0.56342891  0.03255143 -17.309
## percentage.expenditure           0.00046564  0.00006319   7.369
## BMI                              0.03374420  0.00606621   5.563
## Income.composition.of.resources  9.48884449  0.86342388  10.990
## Schooling                        0.88661004  0.06080591  14.581
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## under.five.deaths                            0.00017 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## percentage.expenditure             0.000000000000309 ***
## BMI                                0.000000032397113 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.517 on 1264 degrees of freedom
## Multiple R-squared:  0.7923, Adjusted R-squared:  0.7912 
## F-statistic:   689 on 7 and 1264 DF,  p-value: < 0.00000000000000022

💡 Insight :

  • Adj. R-squared value for selected predictor is
    still 79.4%. It is still not a good result.

Model comparison

data.frame(model = c("le_model","le_backward", "le_forward", "le_both", "le_selected"), 
           AdjRsquare = c(summary(le_model)$adj.r.square,
                          summary(le_backward)$adj.r.square,
                          summary(le_forward)$adj.r.square,
                          summary(le_both)$adj.r.square,
                          summary(le_selected)$adj.r.square))
##         model AdjRsquare
## 1    le_model  0.7942915
## 2 le_backward  0.7945742
## 3  le_forward  0.7945742
## 4     le_both  0.7945742
## 5 le_selected  0.7911908

💡 Insight :

  • “le_backward” and “le_both” are the best model than other models.
    Therefore, we will tune one of them before do prediction on
    “Life.expectancy”

Transformation log

Log Transformation allow us to transform the data using Log. On
above, we decide to use “le_backward”, “le_forward” or “le_both” as our
tuned model, we will only use variables inside that model to transform
the data.

le_log <- lm(formula = log1p(Life.expectancy) ~ Status + log1p(Adult.Mortality) + log1p(Alcohol) + log1p(percentage.expenditure) + Hepatitis.B + log1p(Measles) + log1p(BMI) + log1p(under.five.deaths) + log1p(Total.expenditure) + Diphtheria + log1p(HIV.AIDS) + log1p(Income.composition.of.resources) + log1p(Schooling), 
             data = le_clean)


summary(le_log)
## 
## Call:
## lm(formula = log1p(Life.expectancy) ~ Status + log1p(Adult.Mortality) + 
##     log1p(Alcohol) + log1p(percentage.expenditure) + Hepatitis.B + 
##     log1p(Measles) + log1p(BMI) + log1p(under.five.deaths) + 
##     log1p(Total.expenditure) + Diphtheria + log1p(HIV.AIDS) + 
##     log1p(Income.composition.of.resources) + log1p(Schooling), 
##     data = le_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.186507 -0.028061  0.001444  0.027619  0.178458 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                             3.9520531  0.0230530 171.434
## StatusDeveloping                       -0.0097076  0.0041594  -2.334
## log1p(Adult.Mortality)                 -0.0097795  0.0012800  -7.640
## log1p(Alcohol)                          0.0062903  0.0018735   3.358
## log1p(percentage.expenditure)           0.0063802  0.0008170   7.809
## Hepatitis.B>= 90% Covered              -0.0126270  0.0041799  -3.021
## log1p(Measles)                         -0.0004827  0.0004812  -1.003
## log1p(BMI)                              0.0030001  0.0018224   1.646
## log1p(under.five.deaths)               -0.0059778  0.0010672  -5.601
## log1p(Total.expenditure)                0.0069510  0.0033860   2.053
## Diphtheria>= 90% Covered                0.0128293  0.0045744   2.805
## log1p(HIV.AIDS)                        -0.0841528  0.0022181 -37.939
## log1p(Income.composition.of.resources)  0.1739592  0.0146800  11.850
## log1p(Schooling)                        0.1018532  0.0091858  11.088
##                                                    Pr(>|t|)    
## (Intercept)                            < 0.0000000000000002 ***
## StatusDeveloping                                   0.019725 *  
## log1p(Adult.Mortality)                   0.0000000000000375 ***
## log1p(Alcohol)                                     0.000805 ***
## log1p(percentage.expenditure)            0.0000000000000104 ***
## Hepatitis.B>= 90% Covered                          0.002561 ** 
## log1p(Measles)                                     0.315981    
## log1p(BMI)                                         0.099919 .  
## log1p(under.five.deaths)                 0.0000000250690925 ***
## log1p(Total.expenditure)                           0.040250 *  
## Diphtheria>= 90% Covered                           0.005100 ** 
## log1p(HIV.AIDS)                        < 0.0000000000000002 ***
## log1p(Income.composition.of.resources) < 0.0000000000000002 ***
## log1p(Schooling)                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04782 on 1576 degrees of freedom
## Multiple R-squared:  0.8265, Adjusted R-squared:  0.8251 
## F-statistic: 577.6 on 13 and 1576 DF,  p-value: < 0.00000000000000022
data.frame(model = c("le_model","le_backward", "le_forward", "le_both", "le_selected", "le_log"), 
           AdjRsquare = c(summary(le_model)$adj.r.square,
                          summary(le_backward)$adj.r.square,
                          summary(le_forward)$adj.r.square,
                          summary(le_both)$adj.r.square,
                          summary(le_selected)$adj.r.square,
                          summary(le_log)$adj.r.square))
##         model AdjRsquare
## 1    le_model  0.7942915
## 2 le_backward  0.7945742
## 3  le_forward  0.7945742
## 4     le_both  0.7945742
## 5 le_selected  0.7911908
## 6      le_log  0.8250929

💡 Insight :

  • Adj. R-squared value are better, with 82.5%. It
    indicates that a good linear model.

Normality test

Using histogram

hist(le_log$residuals, breaks = 20)


Most of the Residuals are distributed on the center, indicated a normal
distribution.

Using QQ Plot

plot(le_log, which = 2)


Most of the Residuals are gathered on the center, indicated a normal
distribution.

Shapiro test

shapiro.test(le_log$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  le_log$residuals
## W = 0.99285, p-value = 0.0000005719

The W statistic is 0.99285, which is close to 1, indicating that the
residuals are fairly normally distributed. The p-value is 0.0000005719,
which is very small, suggesting strong evidence against the null
hypothesis of normality. Therefore, it is likely that the residuals are
not normally distributed.

Homoscedasticity

Creating plot to check using visualisation

plot(le_train$Life.expectancy, le_backward$residuals)
abline(h = 0, col = "red")