Authors: Kaitlin Khong, Anshul Pandas, Allison Kim, and Sharanya Sharma
Background
Voter turnout is pivotal for the health of the US democratic system. With the exponential growth of available data, leveraging it to understand the factors behind voter participation dispartieis becomes imperative. This study uses data from the voter files of 2020, which contains every registered voter’s geographic, demographic and household information across states, to understand voting patterns within the state of Wyoming. Characterized by its distinctive mix of rural expansiveness and demographic homogeneity, Wyoming presents a unique case against the nation’s trends of increasing racial and ethnic diversity. Our analysis aims to uncover the factors that influence electoral engagement.
Dataset
We found that each row in the dataset represented an individual, and the different data collected on them about the 2020 election. We expected to be dealing with a vast amount of rows, but having to choose features from 726 columns seemed a little daunting at first. Initially, we knew that we wanted to investigate how socioeconomic factors, such as income, wealth, ethnicity, location, and gender affect voter participation and political party affiliation in Wyoming. Our next step was to narrow our focus and see which columns of the dataset represented these factors. We chose these columns and found out what they meant:
Predictor Variables:
• CommercialData_EstimatedHHIncomeAmount: an individual’s estimated household income amount (US$ amount) • CommercialData_EstHomeValue: an individual’s estimated home value (US$ amount) • County: the county an indvidual voted from
• Ethnic_Description: an individual’s ethnic background
• Voters_Gender: an individual’s sex (M=‘male’ or F=‘female’)
Response Variables:
Voters_Active: whether an individual was an ‘active’ voter (A) or an ‘inac- tive’ voter (I) – Parties_Description: the political party an individual voted for
Since we knew we wanted to predict voter participation and party affiliation, we settled on the following research questions: 1) In the state of Wyoming, how do income, wealth, location, ethnicity, and gender affect voter participation? 2) Do these same factors affect the political party active voters affiliate with?
Data Preprocessing
We started our cleaning by converting our quantitative columns, ‘Commercial- Data_EstimatedHHIncomeAmount’ and ‘CommercialData_EstHomeValue,’ to integers. They were originally represented as strings on the dataset. Then for conveniance purposes we renamed these columns to ‘Home Value’ and ‘HH_Income_Amount.’
All of the columns had missing value percentages of under 10% except for ‘Ethnic_Description.’ For our quantitative columns, ‘CommercialData_EstimatedHHIncomeAmount’ and ‘Commercial- Data_EstHomeValue,’ we decided to imputate based on the median of each of the column. Since only under 10% of the data was missing, we figured that this technique is justifiable.
For ‘Voters_Gender,’ a categorical variable, there were only about 1% of the values missing, so we decided to remove those rows from the dataset. For ‘Ethnic_Description’, there were about 10% values missing, so we decided to add a possible value to the column called ‘Unkown,’ representing any type of ‘null’ or ‘None’ value.
Exploratory Data Analysis (EDA)
Modeling
One Hot Encoding
Before we can apply any of the models on our dataset, we must prepare our PySpark dataframe in an acceptable format, using one hot encoding. The goal of one hot encoding is to convert the categorical variables in a dataset to a format that is readable by machine learning algorithms.
For now, we are going to one hot encode the columns ‘County’ and ‘Ethnic_Description.’ We also modified ‘Voters_Gender’ to ‘Voters_Male,’ which represents if the individual is male (1) or female (0). For our response variable ‘Voters_Active,’ we changed values of “A” to be 1, representing that the voter is active, and 0 otherwise, indicating the voter is inactive. We will format ‘Parties_Description’ when we are looking at the models that predict an active voter’s political party affiliation.
Next, we will apply our machine learning models. We have chosen to use four classification models:
– Logistic Regression
– Support Vector Machines
– Random Forest Classification
– Gradient Boosting Trees
For each model we are fitting to our data, we are trying to predict whether a voter is “active” or “inactive”.
We follow this same procedure for every model we run:
• Splitting the data into training and testing sets
• Fitting the model on our data w/ chosen paramter(s)
• Analyze the accuracy and ROC scores
Logistic Regression
Logistic regression can be used for classification and is very easily interpretable. Our first step is to encode active voters with the value of ‘1’ and inactive voters with the value of ‘0.’ Then, we can go on and apply the model.
Accuracy: 0.9814485785953178 ROC AUC Score: 0.5
The Accuracy of this regression is .98 which is fairly high. This could be due to the fact that we obseved such a high amount of active voters. As mentioned before, the proportion of active voters is almost 1 across all ethnicities, income brackets, counties, etc. Thus, it would not be difficult for the model to predict whether a voter is active or inactive. The ROC of .5 suggests that the model’s performance is no better than random guessing, as an ROC of 1 would indicate a perfect classifier.
Support Vector Machines
Support vector machines is a type of ML algorithm that is great for binary classification, robust to overfitting, memory efficient, and easy to interpret. We thought that it would be a great first model to try out for our data to predict whether voters or “active” or “inactive” in the state of Wyoming.
In running this model, we chose to give a regularization penalty of 0.2, as that is when the model performed its best:
Accuracy: 0.9818506396905683
Area Under ROC: 0.6134016436415894
The SVM model had a very high accuracy, and indicated that it correctley predicted whether a voter was “active” or “inactive” 98% of the time. The area under the ROC was about 61%, indicating that the model had a moderate ability to discriminate between “active” voters and “inactive” voters. Although the model performs well at predicting whether voters are “active” or “inactive” overall, its capacity to effectively distinguish between these two categories was not as high.
Random Forest Classification
After running the SVMs, we thought that Random Forest Classification might be a better model for predicting voter activeness. Random Forest Classification is known for high accuracy, reducing overfitting, handling large dataset, and is robust to outliers. The tree like nature of the model is also able to capture complex interactions between the features, which could help us in hvaing a better area under ROC score. After running the model a number of times, we found that 15 trees and a maximum amount of bins value of 40 yielded the best results.
Accuracy: 0.9818506396905683
Area Under ROC AUC: 0.6079391653216034
Again, the accuracy was about 98%, but one thing that surprised us was that we were only able to achieve an area under the ROC of about 61%, which was not too different from our SVM model. Our Random Forest Classification Model had a moderate ability to discriminate between “active” voters and “inactive” voters. Due to this, we tried using Gradient Boosting Trees, hoping for better results.
Best Model – Gradient Boosting Trees
Gradient Boosting Trees is a technique that can be used to solve classification problems, such as this one. It works by combining multiple decision trees to create a robust model. In every iteration of the training, the performance of the model is increased by adding new trees to the current ensemble. Each new tree that is trained to correct the errors of the trees in the current ensemble.
Like a Random Forest Trees, Gradient Boosting Trees have high accuracy, capture complex pat- terns, and are robust to overfitting. We thought it would be useful because of its ability to handle imbalanced data, as there are a much higher proportion of “active” voters compared to “inactive” voters. Also, since each tree that ias added learns from the errors of the current ensemble, we thought this might improve the model’s performance.
After running this model several times, we decided that maxIter=15 and maxBins=60 were pa- rameters that helped improve the model’s performance by a sufficient amount.
Accuracy: 0.9819031450723699 Area Under ROC: 0.6431980964560804
The accuracy of this model was marginally higher than the Random Forest Classification model, but we did see an area under the ROC of about 0.643, which is about 0.04 higher than that of the Random Forest model. The Gradient Boosting Tree’s ability to discriminate between “active” and “inactive” voters is still moderate, as 0.643 is still between 0.5 and 0.7. However, this model has performed this Gradient Boosting model has performed the best out of any three models we ran, having the highest accuracy and area under the ROC. Therefore, we will analyze this model further.
Party Classification
We plan to apply a Random Forest Classification model as well as a Gradient Boosting Trees Classifier to whether Wyoming voters are Democratic or Republican.
It seems that ethnicity and county ended up being the most influential features in the model. Home value, houeshold income, and gender still seemed to be fairly influential, as their feature importance values aren’t super close to 0. Gender ended up being the least influential for the model. Ethnicity being the most influential factor in determining party affiliation makes a lot of sense because different ethnic groups have different political leanings historical, social, and economic factors. Location, in the form of county, makes sense as the second highest influential factor based on the varying demographics and social dynamics of each county. There are 22 different counties in Wyoming, so this is very probable. Home value and household income could influence the model in some ways, as a voter’s opinions on certain economic policies reflect their level of income.
Conclusion
Our project has provided insights into the factors influencing voter participation rates and election party predictions for the state of Wyoming and generally. Using Logistic Regression, Support Vector Machines (SVM), Gradient Boosting, and Random Forest, we analyzed the impact of household income, county, home value, ethnic description, and gender.
Our exploratory data analysis (EDA) showed challenges such as the scarcity of inactive voters in the dataset, which posed difficulties in visualization and potentially skewed our findings. For instance, the proportion of active voters to each ethnicity and each county is nearly 1, suggesting that almost everyone from each of these subgroups were an active voter. Limitations in the data collection process suggested biases that could have influenced our results.
In analyzing the models’ performance, we observed remarkably similar accuracies across Logistic Regression (98.14%), Support Vector Machines (98.185%), Random Forest (98.185%), and Gradient Boosting Trees (98.1903%). This accuracy suggests that the models were able to effectively classify voters as “active” or “inactive” in Wyoming. However, it is important to note that the accuracies may have been more of a result from the fact that most of the training data and test data included observations from only voters active. Thus is would be easier to classify this.
Despite the high accuracies, the ROC AUC values were relatively low across all models: Logistic Regression (0.5), SVM (0.61190), Random Forest (0.60401), and Gradient Boosting Trees (0.64316). This discrepancy between accuracy and ROC AUC suggests that while the models were able to correctly classify “active” and “inactive” voters with high accuracy, they struggled to effectively discriminate between the two classes when considering the trade-off between true positive rate and false positive rate.
Comparing the results our analysis yielded significant findings. For the first question concerning voter participation rates, while precise visualization was hindered by the limited data on inactive voters, our models provided insights into the influence of various factors. Moving on to the second question regarding election party predictions, we observed a slightly lower model accuracy at 80%, yet a higher ROC, indicating improved performance in distinguishing between different party affiliations. Interestingly, the features of household income, county, home value, ethnic description, and gender emerged as stronger predictors for election party predictions, suggesting their heightened influence in determining political affiliations.
To conclude, our project shows the complexity behind voter behavior and the multifaceted nature of predictive modeling in electoral contexts. While facing challenges such as data scarcity and potential biases, our analysis shows the interplay between socio-demographic factors and political outcomes, offering insights for future policy considerations.