Voter Behavior Analysis: Predicting Political Preferences with ANES Data

Rtidyverseggplot2randomForestClusteringK-meansClassificationLogistic RegressionData CleaningData VisualizationANES Dataset

Project Overview

This project involved a comprehensive analysis of the American National Election Studies (ANES) 2024 Time Series dataset. The goal was to explore voter demographics, political opinions, and behavior, ultimately segmenting voters into distinct groups using clustering and predicting voting preferences using classification models. The workflow covered data gathering, exploration, cleaning, preprocessing, clustering (K-means), classification (Logistic Regression, Random Forest), and model evaluation.

Date: 2025-03-15

Methodology

The analysis followed a structured data science process:

  1. Data Gathering & Integration: Loaded necessary R libraries (tidyverse, ggplot2, skimr) and the ANES dataset. Initial inspection of dimensions and structure.
  2. Data Exploration: Used summary statistics and visualizations (histograms, bar charts) to understand variable distributions and identify potential issues like missing values or outliers. Explored relationships using tables and correlation matrices.
  3. Data Cleaning: Handled missing values (coded as negative numbers) by converting them to NA and imputing using appropriate methods (median for numeric, mode for categorical). Addressed special codes (e.g., 99 for 'don't know') identified during exploration.
  4. Data Preprocessing: Renamed variables for clarity, recoded categorical variables into factors with meaningful labels, created derived variables (Political Leaning, Vote Choice, Vote Binary), and normalized numeric variables using scaling for clustering.
  5. Clustering (K-means): Determined the optimal number of clusters (k=3) using the Elbow and Silhouette methods. Performed K-means clustering on demographic and opinion variables. Visualized clusters using PCA.
  6. Classification: Prepared data to predict binary vote preference (Democrat vs. Republican). Split data into training (70%) and testing (30%) sets. Trained Logistic Regression and Random Forest models.
  7. Evaluation: Assessed classification model performance using accuracy, confusion matrices, precision, recall, F1-score, and ROC/AUC analysis. Compared models to select the best performer.

Key Findings

  • Voter Segmentation: K-means clustering identified three distinct voter segments: "Conservative-Leaning Moderates" (31.6%), "Liberal Democrats" (34.9%), and "Conservative Republicans" (33.5%), each with unique demographic and political profiles.
  • Predictive Model Performance: Logistic Regression achieved high accuracy (89.22%) in predicting vote choice, slightly outperforming Random Forest (89.01%). AUC for Logistic Regression was excellent (0.9602).
  • Strongest Predictors: Political opinions (President Approval, Ideological Placement, Supreme Court Approval) were stronger predictors of vote choice than demographics.
  • Demographic Influence: Higher education levels correlated significantly with Democratic voting. Race also showed significant effects (e.g., Black voters leaning Democrat, Asian/PI voters leaning Republican).
  • Data Quality Issues: Initial exploration revealed significant missing values (coded as negative numbers) and potential labeling issues in some categorical variables, requiring careful cleaning and imputation.
  • Political Alignment: Strong correlations were observed between Party ID, President Approval, and Vote Preference, confirming expected political alignments.

Implementation Details

Data Loading and Initial Exploration

# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(skimr)

# Load the dataset
anes_data <- read.csv("anes_timeseries_2024_csv_20250219.csv")

# Check dimensions and structure
dim(anes_data)
head(anes_data, 3)
summary_stats <- skim(anes_data)

Data Cleaning (Handling Missing Values)

# Create a clean dataset
clean_data <- analysis_data # subset with selected vars

# Convert negative values (missing codes) to NA
clean_data[clean_data < 0] <- NA

# Impute missing values (example: Age with median)
clean_data$V241458x[is.na(clean_data$V241458x)] <- median(clean_data$V241458x, na.rm = TRUE)

# Handle special codes (example: IdeologicalPlacement '99')
clean_data$V241177[clean_data$V241177 == 99] <- NA
clean_data$V241177[is.na(clean_data$V241177)] <- median(clean_data$V241177, na.rm = TRUE)

Data Preprocessing (Normalization & Recoding)

# Rename variables
names(clean_data) <- c("Age", "Gender", ...) # Assign meaningful names

# Recode categorical variables (example: Gender)
clean_data$Gender <- factor(clean_data$Gender, levels = c(1, 2), labels = c("Male", "Female"))

# Create derived variables (example: VoteChoice)
clean_data$VoteChoice <- "Other"
clean_data$VoteChoice[clean_data$VotePreference %in% c(10, 20, 30)] <- "Democrat"
clean_data$VoteChoice[clean_data$VotePreference %in% c(11, 21, 31)] <- "Republican"
clean_data$VoteChoice <- factor(clean_data$VoteChoice)

# Normalize numeric variables (example: Age)
numeric_vars <- c("Age", "Income", ...)
normalized_data <- clean_data
for(var in numeric_vars) {
  normalized_data[[var]] <- scale(clean_data[[var]])
}

Clustering (K-means)

library(cluster)

# Prepare data for clustering
cluster_data <- normalized_data[, cluster_vars]
cluster_data$Education <- as.numeric(cluster_data$Education) # Convert factor
cluster_data <- na.omit(cluster_data)

# Determine optimal K (Elbow/Silhouette methods - code omitted for brevity)
# ...

# Perform K-means clustering (assuming k=3)
set.seed(123)
kmeans_result <- kmeans(cluster_data, centers = 3, nstart = 25)

# Add cluster assignments
normalized_data$cluster <- as.factor(kmeans_result$cluster)

# Analyze cluster characteristics
aggregate(normalized_data[, numeric_vars], by = list(Cluster = normalized_data$cluster), FUN = mean)

Classification (Logistic Regression & Random Forest)

library(randomForest)
library(pROC)

# Prepare data (select features, handle NAs in target)
classification_data <- normalized_data[!is.na(normalized_data$VoteBinary), ]
# ... select columns ...

# Split data
set.seed(123)
train_indices <- sample(1:nrow(classification_data), 0.7 * nrow(classification_data))
train_data <- classification_data[train_indices, ]
test_data <- classification_data[-train_indices, ]

# Logistic Regression
logit_model <- glm(VoteBinary ~ ., data = train_data, family = "binomial")
logit_pred_prob <- predict(logit_model, newdata = test_data, type = "response")
logit_pred_class <- ifelse(logit_pred_prob > 0.5, 1, 0)
logit_confusion <- table(Predicted = logit_pred_class, Actual = test_data$VoteBinary)
logit_accuracy <- sum(diag(logit_confusion)) / sum(logit_confusion)

# Random Forest
set.seed(123)
rf_model <- randomForest(as.factor(VoteBinary) ~ ., data = train_data, ntree = 500)
rf_pred_class <- predict(rf_model, newdata = test_data)
rf_confusion <- table(Predicted = rf_pred_class, Actual = test_data$VoteBinary)
rf_accuracy <- sum(diag(rf_confusion)) / sum(rf_confusion)

# Evaluation (Accuracy, Precision, Recall, F1, ROC/AUC)
# ... calculations using confusion matrices and pROC ...

Results and Visualization

Key visualizations were generated throughout the analysis to explore data distributions and model results:

Demographic Distributions

Age distribution of survey respondents
Gender distribution of survey respondents
Race distribution of survey respondents
Education level distribution of survey respondents
Party identification distribution of survey respondents

Comprehensive visualization of demographic distributions across age, gender, race, education, and party identification

Cluster Determination

Elbow method plot showing decreasing WSS with increasing K
Silhouette plot showing average width peaking around K=3

Elbow method and Silhouette analysis guided the selection of K=3 for K-means clustering

Cluster Visualization (PCA)

PCA plot visualizing the three voter clusters in 2D space

PCA visualization showing clear separation between Liberal Democrats, Conservative Republicans, and Conservative-Leaning Moderates clusters

Vote Choice Relationships

Vote choice distribution by political leaning
Vote choice distribution by education level

Vote choice distributions showing relationships with political leaning and education level, highlighting partisan voting patterns and educational influences

Classification Model Evaluation

ROC curve showing high discrimination ability with AUC = 0.9602
Precision-Recall trade-off curve for model evaluation

Model evaluation curves demonstrating excellent discrimination ability (ROC) and precision-recall trade-off characteristics

Conclusion & Recommendations

This analysis successfully segmented the ANES 2024 electorate into three distinct groups and built a highly accurate model (89.2% accuracy, 0.96 AUC) to predict vote choice. Political opinions, particularly presidential approval and ideological self-placement, proved to be the most potent predictors, outweighing demographic factors, although education level remained a significant demographic influencer.

Data-Driven Recommendations

  • Political Campaigns: Focus messaging on institutional approval ratings. Target swing voters (Cluster 1) with economic themes. Develop distinct strategies for the three identified segments.
  • Policy Analysis: Note the increasing education polarization. Recognize the complex interplay between income and voting. Utilize institutional trust metrics as strong behavioral predictors.
  • Future Research: Conduct longitudinal analysis for cluster stability. Incorporate geographic data. Perform deeper analysis of the "Conservative-Leaning Moderates" for swing voter insights.

Methodological Strengths and Limitations

Strengths: High predictive accuracy, excellent model discrimination (AUC), balanced precision/recall, consistent findings.

Limitations: Reliance on imputation for missing values, potential self-reporting biases in survey data, cross-sectional design limiting causal claims.

Reflection

During the duration of the course, I developed a very strong set of analytical skills that transformed my data handling entirely. The linear progression from exploratory data analysis to advanced machine learning techniques not only provided me with the theoretical background but also allowed me to get hands-on experience in applying what I learned. I found the step-by-step data preprocessing workflow quite beneficial— such as, missing value treatment, data normalization, and feature engineering to prepare the raw data for analysis.

Another attention-grabbing aspect was the focus on visualization using ggplot2, as it enabled me to discover and convey patterns which would not have been obvious just by perusing the raw numbers. A significant part of the course was the introduction to machine learning where I utilized both supervised and unsupervised algorithms in conjunction with evaluation metrics to measure their performance.

Completing tasks in R quantified through practical application of these concepts, thus, I got to know how every analytical decision affects the final results in this way. For instance, I was using actual-data projects like ANES for this assignment, which not only helped me to gain confidence in extracting significant insights but also stressed the importance of both statistical rigor and understanding the broader context when making data-driven recommendations.

Download Full Report

Interested in the complete analysis and R code implementation? Download the full PDF report below.

Download PDF Report