Predicting Life Expectancy from Immunizations and Socio-Economic Factors
Project for Regression and Forecasting Class
Using a Kaggle dataset, my team built multiple regression models and performed factor analysis to determine what the most significant factors were in predicting life expectancy for several countries. First, we cleaned the data. Next, we checked the assumptions for linear regression (constant variance and normality) and transformed the data to fix issues with collinearity of features. We used several techniques such as forward/backward selection, stepwise regression, and best subsets regression to learn the most significant features, and we used RMSE and R-squared as our accuracy measures. The most significant features found were mortality rates, deaths under 5 years of age, GDP, infant deaths, status of country (developing or developed), total government expenditure percentage, income composition of resources, and alcohol usage. With these features, the team was able to obtain an R-squared value of 83.76 percent. This project was done completely using R.