Abstract

This report is to determine the impact and significance of individuals' socioeconomic status on their annual income level where we set a bar of $50,000. We used six machine learning models to determine the goal and report the full procedure to acquire the outcome through analyzing and cleaning data, generating variables, and explaining of six different models: 1. By analyzing 14 variables, we can find each variable's correlation with income level of $50,000 per year. 2. We cleaned the data sheet by filling missing data with the statistic mode of that variable. 3. Because of the categorical variables which cannot be used in models, we decided to use Dummy Encoding method to turn those categorical variables into numerical data. 4. We also discarded certain variables with their data for various reasons listed in Section 4. We provided further explanation and progress by using Logistic Regression model and also briefly explained other five models with visual support. 5. Lastly, by comparing with six models, we found that Random Forest is the most reliable model to determine such goal with the highest testing score. We also received the most determinant socioeconomic status that affect individual's income ability, like marriage status to civil spouse, being in a committed role in a relationship, capital investment, and education, etc. This report used 1994 Census Bureau database by Ronny Kohavi and Barry Becker who collect data of 48,842 individuals with their 14 socioeconomic variables. Using Exploratory Data Analysis (EDA) by finding data's correlation and operating several weighting processes, this report is able to determine the impact and significance of individual's socioeconomic status on one's annual income level where we set a bar of being lower or higher than 50,000 dollars.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call