Lab 5

Data Visualization

Data visualization is the practice of presenting data in a visually appealing way, so that interesting patterns can be easily spotted.

In this lab you will learn how to create three basic types of data visualization in R:

Line Plots
Scatter Plots
Bar Charts

Preparation

You should already have IPUMS_ACS_CA_2014_2019.csv in your R Studio Cloud files directory. If you don’t have this file, check the instructions for Lab 4.

You’ll also need RACHSING_LABELS.csv and DEGFIELD_LABELS.csv. Download these from Canvas and upload them to your R Studio Cloud files directory.

You’ll also need the packages dplyr and ggplot2. dplyr should already be installed, but not ggplot2. Go ahead and install ggplot2 by typing this into the console:

install.packages("ggplot2")

Instructions

Follow along as I show the class how to conduct today’s lab. If you followed along correctly, you should end up with the following three scripts.

Line Plots

This script makes a line plot showing average income by age for employed individuals, using data from California 2019. The lines are plotted separately for males and for females.

rm(list=ls())   # Clear workspace
library(dplyr)  # Load required packages
library(ggplot2)

# Load the main data
df <- read.csv("IPUMS_ACS_CA_2014_2019.csv")

# Deal with invalid values for INCWAGE and EMPSTAT
df$INCWAGE <- na_if(df$INCWAGE, 999999)
df$INCWAGE <- na_if(df$INCWAGE, 999998)
df$EMPSTAT <- na_if(df$EMPSTAT, 0)
df$EMPSTAT <- na_if(df$EMPSTAT, 9)

# First, calculate average income of employed individuals by
# age and sex, using data from 2019
inc_by_age_sex <- df %>%
  filter(YEAR==2019 & EMPSTAT==1) %>%
  group_by(AGE, SEX) %>%
  summarize(
    AVG_INCOME = weighted.mean(INCWAGE, PERWT, na.rm=TRUE)
  )
  
# Now create the line plot using the dataframe containing the stats
ggplot(data=inc_by_age_sex) +
  geom_line(aes(x=AGE, y=AVG_INCOME, color=as.factor(SEX))) + 
  xlab("Age") + 
  ylab("Average Income") + 
  ggtitle("Average Income of Employed Individuals by Age and Sex, California 2019")

Scatter Plots

This script makes a scatter plot where each dot is a county. The x axis shows the percent of age 25+ people in the county with 4+ years of college education. The y axis shows the average income of employed individuals in the county. The size of the dot shows the total population of the county.

rm(list=ls())   # Clear workspace
library(dplyr)  # Load required packages
library(ggplot2)

# Load the main data
df <- read.csv("IPUMS_ACS_CA_2014_2019.csv")

# Deal with invalid values for INCWAGE and EMPSTAT
df$INCWAGE <- na_if(df$INCWAGE, 999999)
df$INCWAGE <- na_if(df$INCWAGE, 999998)
df$EMPSTAT <- na_if(df$EMPSTAT, 0)
df$EMPSTAT <- na_if(df$EMPSTAT, 9)

# Create a boolean variable for 4+ years college
df$COLLEGE <- df$EDUC>=10

# Total population by county, using 2019 data
county_pop <- df %>%
  filter(YEAR==2019) %>%
  group_by(COUNTYFIP) %>%
  summarize(
    POPULATION = sum(PERWT, na.rm=TRUE)
  )
  
# Percent of age 25+ people with 4+ years of college,
# using 2019 data
county_educ <- df %>%
  filter(YEAR==2019 & AGE>=25) %>%
  group_by(COUNTYFIP) %>%
  summarize(
    PCT_COLLEGE = weighted.mean(COLLEGE, PERWT, na.rm=TRUE)
  )
  
# Average income of employed individuals, using data from 2019
county_inc <- df %>%
  filter(YEAR==2019 & EMPSTAT==1) %>%
  group_by(COUNTYFIP) %>%
  summarize(
    AVG_INCOME = weighted.mean(INCWAGE, PERWT, na.rm=TRUE)
  )

# Merge together to get one county level dataframe
county_df <- inner_join(county_pop, county_educ, by=c("COUNTYFIP"))
county_df <- inner_join(county_df, county_inc, by=c("COUNTYFIP"))

# Finally, create the scatter plot
ggplot(data=county_df) +
  geom_point(aes(x=PCT_COLLEGE, y=AVG_INCOME, size=POPULATION)) +
  xlab("% of age 25+ population with 4+ yrs of college") + 
  ylab("Avg income of employed individuals") + 
  ggtitle("Income and College Education of California Counties, 2019")

Bar Charts

This script makes a horizontal bar chart, showing the average income of employed individuals, by race, using data from 2019.

rm(list=ls())   # Clear workspace
library(dplyr)  # Load required packages
library(ggplot2)

# Load the main data
df <- read.csv("IPUMS_ACS_CA_2014_2019.csv")

# Deal with invalid values for INCWAGE and EMPSTAT
df$INCWAGE <- na_if(df$INCWAGE, 999999)
df$INCWAGE <- na_if(df$INCWAGE, 999998)
df$EMPSTAT <- na_if(df$EMPSTAT, 0)
df$EMPSTAT <- na_if(df$EMPSTAT, 9)

# First, calculate average income of employed individuals by race,  
# using data from 2019
inc_by_race <- df %>%
  filter(YEAR==2019 & EMPSTAT==1) %>%
  group_by(RACHSING) %>%
  summarize(
    AVG_INCOME = weighted.mean(INCWAGE, PERWT, na.rm=TRUE)
  )
  
# Second, use RACHSING_LABELS.csv to merge on the human-readable
# labels for RACHSING
labels_data <- read.csv("RACHSING_LABELS.csv")
inc_by_race <- inner_join(inc_by_race, labels_data, by=c("RACHSING"))

# Finally, make the horizontal bar chart
ggplot(data=inc_by_race) + 
  geom_col(aes(x=RACHSING_LABEL, y=AVG_INCOME)) + 
  xlab("Race") + 
  ylab("Average Income") + 
  ggtitle("Average Income by Race, California 2019") +
  coord_flip()

Assignment

Create a new script that accomplishes the following task:
- Using data from California 2019, calculate the percent of people within each DEGFIELD that are female.
- Create a horizontal bar chart showing the above data. Make sure you use DEGFIELD_LABELS.csv to make the degree fields human-readable in the chart.
- Give the chart the following title: “Percent Female by Degree Field, California 2019”
Show me your script and output to receive your grade and be dismissed. If you aren’t able to complete the assignment in class, you can upload the script to the Lab 05 Script assignment.

Takeaways

You can make line plots, scatter plots, and bar charts in R, using the ggplot2 package.