Abstract

In this project, various binary classification methods have been used to make predictions about US adult income level in relation to social factors including age, gender, education, and marital status. We first explore descriptive statistics for the dataset and deal with missing values. After that, we examine some widely used classification methods, including logistic regression, discriminant analysis, support vector machine, random forest, and boosting. Meanwhile, we also provide suitable R functions to demonstrate applications. Various metrics such as ROC curves, accuracy, recall and F-measure are calculated to compare the performance of these models. We find the boosting is the best method in our data analysis due to its highest AUC value and the highest prediction accuracy. In addition, among all predictor variables, we also find three variables that have the largest impact on the US adult income level.

Highlights

  • 1.1 ObjectiveThe inequality of wealth and income is a huge concern around the globe, and governments in different countries are using different interventions to address income inequality

  • Extreme Gradient Boosting (XGBOOST) for prediction tasks; [5] implemented Principal Component Analysis (PCA) to generate and evaluate income prediction data based on the current population survey provided by the U.S Census Bureau. [6] tried to replicate Bayesian networks, decision tree induction and lazy classifier for the dataset and presented a comparative analysis of the predictive performances

  • In addition to the existing approaches, there are a lot of machine learning strategies that might be suitable to analyze this dataset, such as discriminant analysis, support vector machine (SVM), random forest, and neural nets [7,8,9]

Read more

Summary

Objective

Our strategy is to train a binary classifier, denoted as Y, to predict the whether a person earns more than $50K or not per year based on the social factors and to find out what factors influence the income level the most

Description of Dataset and Challenges of the project
Project-related literature
Main Contributions
Data cleaning
Split training set and testing set
Exploratory data analysis
Categorical variables
Numerical variables
Linear discriminant analysis
Quadratic discriminant analysis
Random forest
Model introduction
Model comparison
Boosting prediction
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call