Predicting Life Expectancy Using Various Fundamental Global Demographical, Economic, Environmental, and Health Indicators
Our team is interested in modeling Life Expectancy using a combination of various fundamental global indicators available at the World Bank. Among the various indicators for which data was available, we chose only those with sufficient data provided for most countries, had the latest data no older than three years, and were seemingly unrelated to other indicators to mitigate any multi-collinearity issues.
After careful consideration, we chose Population Density, GDP per Capita, Fertility Rate, CO2 emissions, Unemployment Rate, Death Rate, Urban Population, Diabetes and Inflation as our leading indicators to try and predict Life Expectancy. We collated the latest available data for all indicators into one excel sheet and loaded the data set in Rstudio to run our analysis.
Our main goal with this was to apply a Multiple Linear Regression Model (MLRM) using the most useful and significant indicators to accurately model and predict Life Expectancy.
Using nine various fundamental global indicators, i.e., Population Density, GDP Per Capita, Fertility Rate, Unemployment, CO2, Death Rate, Urban Population, Diabetes, and Inflation, which ones are the most significant to model and predict Life Expectancy using Multiple Linear Regression Model (MLRM)?
Life.Exp
: Life expectancy at birth, total (years) (2019) - Life
expectancy at birth indicates the number of years a newborn infant would
live if prevailing patterns of mortality at the time of its birth were
to stay the same throughout its life. Life.Exp
Dataset
Pop.Dens
: Population density (people per sq. km of land area)
(2020) - Population density is midyear population divided by land area
in square kilometers. Population is based on the de facto definition of
population, which counts all residents regardless of legal status or
citizenship–except for refugees not permanently settled in the country
of asylum, who are generally considered part of the population of their
country of origin. Land area is a country’s total area, excluding area
under inland water bodies, national claims to continental shelf, and
exclusive economic zones. In most cases the definition of inland water
bodies includes major rivers and lakes. Pop.Dens
Dataset
GDP.PerCap
: GDP per capita (current US$) (2020) - GDP per capita
is gross domestic product divided by midyear population. GDP is the sum
of gross value added by all resident producers in the economy plus any
product taxes and minus any subsidies not included in the value of the
products. It is calculated without making deductions for depreciation of
fabricated assets or for depletion and degradation of natural resources.
Data are in current U.S. dollars. GDP
Dataset
Fert.Rate
: Fertility rate, total (births per woman) (2020) - Total
fertility rate represents the number of children that would be born to a
woman if she were to live to the end of her childbearing years and bear
children in accordance with age-specific fertility rates of the
specified year. Fert.Rate
Dataset
Unemp
: Unemployment, total (% of total labor force) (modeled ILO
estimate) (2021) - Unemployment refers to the share of the labor force
that is without work but available for and seeking employment. Unemp
Dataset
CO2
: CO2 emissions (metric tons per capita) (2019) - Carbon
dioxide emissions are those stemming from the burning of fossil fuels
and the manufacture of cement.nThey include carbon dioxide produced
during consumption of solid, liquid, and gas fuels and gas flaring. CO2
Dataset
Death.Rate
: Death rate, crude (per 1,000 people) - Crude death
rate indicates the number of deaths occurring during the year, per 1,000
population estimated at midyear. Subtracting the crude death rate from
the crude birth rate provides the rate of natural increase, which is
equal to the rate of population change in the absence of migration.
Death.Rate
Dataset
Urban.Pop
: Urban population (% of total population) (2021) - Urban
population refers to people living in urban areas as defined by national
statistical offices. The data are collected and smoothed by United
Nations Population Division. Urban.Pop
Dataset
Diabetes
: Diabetes prevalence (% of population ages 20 to 79)
(2021) - Diabetes prevalence refers to the percentage of people ages
20-79 who have type 1 or type 2 diabetes. It is calculated by adjusting
to a standard population age-structure. Diabetes
Dataset
Inflation
: Inflation, consumer prices (annual %) (2019) -
Inflation as measured by the consumer price index reflects the annual
percentage change in the cost to the average consumer of acquiring a
basket of goods and services that may be fixed or changed at specified
intervals, such as yearly. Inflation
Dataset
Note: some data points have no values associated to it. In order to fix this, we ran the na.omit() function to eliminate all rows with empty parameters. That being said, it is essential to acknowledge that omitting rows with empty parameters may cause a bias in our data as we will be removing many developing/under-developed from our dataset thereby causing us to only look at those countries which have the data to provide.
Many of the predictor variables are heavily skewed to the right due to the presence of extreme outliers. In order to rectify these issues, our team chose to use the logarithmic scaling for the necessary variables in order to make our predictors resemble a Gaussian distribution and reduce the effect of outliers.
data$Pop.Dens <- log(data$Pop.Dens)
data$GDP <- log(data$GDP)
data$Fert.Rate <- log(data$Fert.Rate)
data$Infl <- sign(data$Infl) * log(abs(data$Infl))
data$CO2 <- log(data$CO2)
data$Diabetes <- log(data$Diabetes)
data$Unemp <- log(data$Unemp)
Life Expectancy ~ Log Unemployment and Life Expectancy ~ Death Rate do not follow a linear trend, i.e., they cannot be effectively modeled by a linear function. Therefore, our team chose to remove those variables from our MLR model as we deemed that they are not useful in predicting Life Expectancy.
model_0 <- lm(Life.Exp ~ Pop.Dens + GDP + Fert.Rate + CO2 + Urban.Pop + Diabetes + Infl, data)
summary(model_0)
##
## Call:
## lm(formula = Life.Exp ~ Pop.Dens + GDP + Fert.Rate + CO2 + Urban.Pop +
## Diabetes + Infl, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.548 -1.554 0.262 1.822 7.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.24551 3.95919 13.701 < 2e-16 ***
## Pop.Dens 0.21206 0.18816 1.127 0.2611
## GDP 2.20161 0.40122 5.487 1.23e-07 ***
## Fert.Rate -7.71249 0.90983 -8.477 5.16e-15 ***
## CO2 -0.48984 0.38295 -1.279 0.2023
## Urban.Pop 0.03247 0.01566 2.074 0.0394 *
## Diabetes 1.84017 0.45015 4.088 6.31e-05 ***
## Infl -0.42070 0.22911 -1.836 0.0678 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.074 on 199 degrees of freedom
## Multiple R-squared: 0.8191, Adjusted R-squared: 0.8127
## F-statistic: 128.7 on 7 and 199 DF, p-value: < 2.2e-16
vif(model_0)
## Pop.Dens GDP Fert.Rate CO2 Urban.Pop Diabetes Infl
## 1.200707 6.384505 3.458967 5.642539 2.424800 1.265287 1.175767
As we can see that two predictors have a Variance Inflation Factor (VIF) > 5, i.e., Log GDP Per Capita and Log CO2. This suggests that there is a multi-collinearity issue between those two predictors.
The multi-collinearity can be explained as a high GDP Per Capita implies that, on average, there is more production in the economy by resident producers, which implies that there is more CO2 emissions to facilitate those productions.
Therefore, in order to address this multi-collinearity issue, our team chose to remove Log CO2 from our model and continue our analysis on the MLRM with the remaining variables.
model_1 <- lm(Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes + Infl, data)
summary(model_1)
##
## Call:
## lm(formula = Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop +
## Diabetes + Infl, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4430 -1.6313 0.2067 1.8116 7.2264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.42119 3.58088 15.756 < 2e-16 ***
## Pop.Dens 0.26197 0.18436 1.421 0.156885
## GDP 1.92928 0.34061 5.664 5.09e-08 ***
## Fert.Rate -7.37950 0.87318 -8.451 5.93e-15 ***
## Urban.Pop 0.02815 0.01531 1.838 0.067490 .
## Diabetes 1.62119 0.41699 3.888 0.000138 ***
## Infl -0.42881 0.22939 -1.869 0.063032 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.079 on 200 degrees of freedom
## Multiple R-squared: 0.8176, Adjusted R-squared: 0.8121
## F-statistic: 149.4 on 6 and 200 DF, p-value: < 2.2e-16
vif(model_1)
## Pop.Dens GDP Fert.Rate Urban.Pop Diabetes Infl
## 1.149079 4.586719 3.175784 2.311755 1.082286 1.174867
As all our predictors have VIF < 5, we can conclude that our model no longer shows a multi-collinarity issue. Thus allowing us to run Goodness of Fit and F-tests to see which of our variables are most useful.
In order to know if any of our predictors are useful, we can do the Goodness of Fit Test and calculate an F-statistic:
Y = data$Life.Exp
n = length(Y)
SST = sum((Y - mean(Y))^2)
SSE = sum(resid(model_1)^2)
MSE = SSE / model_1$df.residual
SSR = SST - SSE
MSR = SSR / (n - 1 - model_1$df.residual)
F = MSR / MSE
alpha = 0.05
pval = 1 - pf(F,6,200) # from the summary(model_1) table
out = data.frame(SSE_RM = SST,SSE_FM = SSE, F=F, pval = pval,Fcrit=qf(1-alpha,6,200))
print(out)
## SSE_RM SSE_FM F pval Fcrit
## 1 10392.01 1895.768 149.3897 0 2.144133
As our F-statistic 149.39 is much greater than F-critical 2.14, which has an extremely low corresponding p-value ≈ 0, we favor the alternative hypothesis over the null hypothesis, i.e., at least one of the predictors is useful.
In order to test if each individual predictor is useful, we conduct F-tests (assuming all other predictors remain constant) by running an ANOVA (reduced model, full model) tests where reduced model = full model without jth variable:
For Log Population Density:
RM = lm(Life.Exp ~ GDP + Fert.Rate + Urban.Pop + Diabetes + Infl, data)
anova(RM, model_1)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ GDP + Fert.Rate + Urban.Pop + Diabetes + Infl
## Model 2: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes +
## Infl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 201 1914.9
## 2 200 1895.8 1 19.139 2.0191 0.1569
For Log GDP Per Capita
RM = lm(Life.Exp ~ Pop.Dens + Fert.Rate + Urban.Pop + Diabetes + Infl, data)
anova(RM, model_1)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ Pop.Dens + Fert.Rate + Urban.Pop + Diabetes + Infl
## Model 2: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes +
## Infl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 201 2199.9
## 2 200 1895.8 1 304.11 32.083 5.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For Log Fertility Rate
RM = lm(Life.Exp ~ Pop.Dens + GDP + Urban.Pop + Diabetes + Infl, data)
anova(RM, model_1)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ Pop.Dens + GDP + Urban.Pop + Diabetes + Infl
## Model 2: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes +
## Infl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 201 2572.8
## 2 200 1895.8 1 677.02 71.424 5.93e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For Urban Population
RM = lm(Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Diabetes + Infl, data)
anova(RM, model_1)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Diabetes + Infl
## Model 2: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes +
## Infl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 201 1927.8
## 2 200 1895.8 1 32.035 3.3796 0.06749 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For Log Diabetes
RM = lm(Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Infl, data)
anova(RM, model_1)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Infl
## Model 2: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes +
## Infl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 201 2039.0
## 2 200 1895.8 1 143.28 15.115 0.0001376 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For Log Inflation
RM = lm(Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes, data)
anova(RM, model_1)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes
## Model 2: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes +
## Infl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 201 1928.9
## 2 200 1895.8 1 33.124 3.4946 0.06303 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since we are conducting a non-trivial number of tests, it is essential to first control our Family-Wise Error Rate (FWER) and Type I error rate before conducting our hypothesis analysis.
In order to control our FWER, we first sort our test p-values from lowest to highest.
## [1] 5.929998e-15 5.090365e-08 1.375582e-04 6.303207e-02 6.748985e-02
## [6] 1.568848e-01
Then, we match the smallest p-value with the significance level α/m, then match the next smallest one to α/(m−1), then match the next to α/(m−2), etc… until we match the largest to α.
pvals2 <- c(0.05/6)
for(i in 1:5)
{
new_value = 0.05/(6 - i)
pvals2 <- c(pvals2, new_value)
}
pvals2
## [1] 0.008333333 0.010000000 0.012500000 0.016666667 0.025000000 0.050000000
We then look for the first time a p-value is greater than a significance level, and reject every p-value before that point
## [1] 5.929998e-15 5.090365e-08 1.375582e-04
Therefore, by conducting F-tests with Holm-Bonferroni method to control our FWER and Type 1 error rate, we conclude that our significant predictors are Log GDP Per Capita, Log Fertility Rate, and Log Diabetes.
confint(model_1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 49.360064886 63.48232090
## Pop.Dens -0.101569838 0.62550621
## GDP 1.257629404 2.60092698
## Fert.Rate -9.101322909 -5.65768374
## Urban.Pop -0.002044203 0.05833624
## Diabetes 0.798933032 2.44345103
## Infl -0.881131650 0.02351654
As the predictors with 0 in their respective 95% confidence interval are not significant, we can conclude that Log Population Density, Urban Population and Log Inflation are not significant, and thus we remove them from our model.
model_2 <- lm(Life.Exp ~ GDP + Fert.Rate + Diabetes, data)
summary(model_2)
##
## Call:
## lm(formula = Life.Exp ~ GDP + Fert.Rate + Diabetes, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9648 -1.8005 0.1864 1.8503 7.6034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.8584 3.1189 17.909 < 2e-16 ***
## GDP 2.3237 0.2707 8.585 2.38e-15 ***
## Fert.Rate -7.6453 0.8442 -9.057 < 2e-16 ***
## Diabetes 1.5410 0.4188 3.680 0.000299 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.122 on 203 degrees of freedom
## Multiple R-squared: 0.8096, Adjusted R-squared: 0.8068
## F-statistic: 287.7 on 3 and 203 DF, p-value: < 2.2e-16
To verify our conclusion, we can run another F-test using anova(model 2, model 1), where:
Null hypothesis (H0) is that the subset of predictors are not useful, i.e., H0 : βj = 0 where j ∈ I
Alternative hypothesis (HA) is that at least one of the subset of predictors is useful, i.e., HA : βj ≠ 0 where j ∈ I and I is our set of subset predictors
anova(model_2, model_1)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ GDP + Fert.Rate + Diabetes
## Model 2: Life.Exp ~ Pop.Dens + GDP + Fert.Rate + Urban.Pop + Diabetes +
## Infl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 203 1978.6
## 2 200 1895.8 3 82.874 2.9144 0.03545 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As p-value < 0.05, we reject the null hypothesis and conclude that model 2 is a more useful in predicting Life Expectancy than model 1.
vif(model_2)
## GDP Fert.Rate Diabetes
## 2.816757 2.886539 1.061568
As all our VIF values are < 5, we can conclude that there is no issues of multi-collinearity in our model. This can also be confirmed by looking at the correlation heat map, i.e., all combinations of predictors have different shades of color.
The Residuals vs Fitted plot is uniformly distributed along the 0 line. This confirms our assumption that the model is relatively linear.
The Scale-Location shows no clear pattern among the residuals. This satisfies Homoskedasticity assumption for our multiple linear model.
The Normal Q-Q plot is more or less consistent with the straight line with minor deviations in the bottom left. Therefore, we can conclude that our model has residuals that are approximately normally distributed.
The Residuals vs Leverage plot shows us that there exists no extreme points with high leverage or within the cooks distance which could hugely influence our linear model.
Life.Exp | ||||||||
---|---|---|---|---|---|---|---|---|
Predictors | Estimates | std. Error | std. Beta | standardized std. Error | CI | standardized CI | Statistic | p |
(Intercept) | 55.86 | 3.12 | -0.00 | 0.03 | 49.71 – 62.01 | -0.06 – 0.06 | 17.91 | \<0.001 |
GDP | 2.32 | 0.27 | 0.44 | 0.05 | 1.79 – 2.86 | 0.34 – 0.54 | 8.58 | \<0.001 |
Fert Rate | -7.65 | 0.84 | -0.47 | 0.05 | -9.31 – -5.98 | -0.57 – -0.37 | -9.06 | \<0.001 |
Diabetes | 1.54 | 0.42 | 0.12 | 0.03 | 0.72 – 2.37 | 0.05 – 0.18 | 3.68 | \<0.001 |
Observations | 207 | |||||||
R2 / R2 adjusted | 0.810 / 0.807 |
Therefore, we can see that our MLRM is of the form
We can also see that all our predictors (including the intercept) are significant, and our model explains ≈ 81% of variation in Life Expectancy.
We created a dataset by choosing various fundamental global indicators from the World Bank in an attempt to try and predict Life Expectancy. Our chosen indicators were Population Density, GDP per Capita, Fertility Rate, CO2 emissions, Unemployment Rate, Death Rate, Urban Population, Diabetes and Inflation.
Next, we plotted histogram/boxplots/density plots for all predictors, and did the required log transformation to try and get our data to be more normally distributed and to account for extreme outliers.
After the transformations, we plotted 9 scatter plots, each with Life expectancy in the y-axis and the predictors in the x-axis. We noticed that Life Expectancy ~ Log Unemployment and Life Expectancy ~ Death Rate had negligible R squared (and negative Adjusted R squared), thus we removed those predictors from our model.
With the remaining predictors, we fit a preliminary MLRM and ran the variance Inflation Factor (VIF) function to check for multi-collinearity. As Log GDP and Log CO2 had a VIF > 5, we chose to eliminate Log CO2 as a predictor and checked that the new VIFs < 5.
We used the Goodness of Fit Test using nested models to check the usefulness of each predictor, i.e., to see if the predictors are significant to our model (assuming all other predictors remain fixed). We chose to run Holm-Bonferroni method to control the family-wise error and type 1 error rate. We noticed that Log GDP Per Capita, Log Fertility Rate and Log Diabetes were the significant variables to our model. We also confirmed our analysis by looking at the confidence intervals and running a final Goodness of Fit Test to check if our newest model was more useful in predicting Life Expectancy than our previous model.
Finally, we fit final MLR model, plotted some exploratory graphs, and ran some basic diagnostic plots to confirm that all assumptions of MLR have been met and concluded that our model is a good fit where all predictors (including intercept) are significant and explains approximately 81% of the variation in Life Expectancy.