Summary statistics

One of the most common tasks in data analysis is to compute summary statistics, often for different groups in the data.

For example, we might be interested in calculating the average income of the employed population, broken down by race.

This vignette will present numerous examples of how to calculate different kinds of summary statistics.

The examples will make extensive use of the code pattern below:

some_dataframe_name <- dataframe %>%
  group_by(...) %>%
  summarize(
    ...
  )

This code pattern takes a dataframe, creates groups based on the variables you tell it to create groups by, then calculates a statistic for each of the groups.

You don’t need to fully understand the code pattern. You only need to know how to use it. The examples below will show you how.

The dplyr package is required for most of these examples.

Table of Contents


Stratified Samples

Overall average of a numeric variable

rm(list=ls())

df <- read.csv("IPUMS_ACS2019_CA_1.csv")

weighted.mean(df$AGE, df$PERWT, na.rm=TRUE)

This calculates the average of AGE using PERWT as weights. The result is displayed to the console.


Average of a numeric variable by groups

rm(list=ls())
library(dplyr)

df <- read.csv("IPUMS_ACS2019_CA_1.csv")

age_by_race_sex <- df %>%
  group_by(RACHSING, SEX) %>%
  summarize(
    AVG_AGE = weighted.mean(AGE, PERWT, na.rm=TRUE)
  )

This calculates the average AGE for each grouping of RACHSING and SEX, using PERWT as weights. The resulting table is stored in a dataframe called age_by_race_sex.


Percent of the overall population that has a certain characteristic

rm(list=ls())

df <- read.csv("IPUMS_ACS2019_CA_1.csv")

df$EMPLOYED <- df$EMPSTAT==1

weighted.mean(df$EMPLOYED, df$PERWT, na.rm=TRUE)

This calculates the percent of the overall population that is employed (has EMPSTAT==1). The result is displayed to the console.

As a general rule, the mean of a boolean variable is equal to the percent of individuals who have TRUE for that variable. This is useful for calculating things like employment rate, college education rate, percent of different races in the population, etc.


Percent of a group that has a certain characteristic

rm(list=ls())
library(dplyr)

df <- read.csv("IPUMS_ACS2019_CA_1.csv")

df$EMPLOYED <- df$EMPSTAT==1

emprate_by_race <- df %>%
  group_by(RACHSING) %>%
  summarize(
    EMPLOYMENT_RATE = weighted.mean(EMPLOYED, PERWT, na.rm=TRUE)
  )

This calculates the percent of each racial group (RACHSING) that is employed. The result is stored in a dataframe called emprate_by_race.


Total population

rm(list=ls())

df <- read.csv("IPUMS_ACS2019_CA_1.csv")

sum(PERWT, na.rm=TRUE)

This calcualtes the total population in the dataframe, df, assuming that PERWT is the population weight of each survey unit. The result is displayed to the console.


Total population by group

rm(list=ls())
library(dplyr)

df <- read.csv("IPUMS_ACS2019_CA_1.csv")

pop_by_empstat <- df %>%
  group_by(EMPSTAT) %>% 
  summarize(
    POPULATION = sum(PERWT, na.rm=TRUE)
  )

This calcualtes the total population that is employed, unemployed, and not-in-labor-force. (e.g. It calculates the total population for each value of EMPSTAT.) The result is stored in a dataframe called pop_by_empstat.


Non-Stratified Samples

Overall average of a numeric variable

rm(list=ls())

df <- read.csv("students.csv")

mean(df$test_score, na.rm=TRUE)

This calculates the overall average of test_score. The result is displayed to console.


Average of a numeric variable by groups

rm(list=ls())
library(dplyr)

df <- read.csv("students.csv")

scores_by_race <- df %>%
  group_by(race) %>%
  summarize(
    avg_score = mean(test_score, na.rm=TRUE)
  )

This calculates the average of test_score for each race. The result is stored in a dataframe called scores_by_race.


Percent of the overall population that has a certain characteristic

rm(list=ls())

df <- read.csv("students.csv")

df$IS_EXPERIMENTAL_COHORT <- df$cohort=="EXPERIMENTAL"

mean(df$IS_EXPERIMENTAL_COHORT, na.rm=TRUE)

This calculates the percent of students in the experimental cohort (cohort=="EXPERIMENTAL"). The result is displayed to console.


Percent of a group that has a certain characteristic

rm(list=ls())
library(dplyr)

df <- read.csv("students.csv")

df$IS_EXPERIMENTAL_COHORT <- df$cohort=="EXPERIMENTAL"

experimental_by_race <- df %>%
  group_by(race) %>%
  summarize(
    PCT_EXPERIMENTAL = mean(IS_EXPERIMENTAL_COHORT, na.rm=TRUE)
  )

This calculates the percent of students in the experimental cohort separately by the race of the student. The result is stored in a dataframe called experimental_by_race.


Total number of observations

rm(list=ls())

df <- read.csv("students.csv")

nrow(df)

This simply counts the number of rows, e.g. the number of observations, contained in students.csv. Since students.csv is not a stratified survey, there is no need to calculate how many population units are represented by the data.

The result is displayed to the console.


Total number of observations by group

rm(list=ls())
library(dplyr)

df <- read.csv("students.csv")

nobs_by_race <- df %>%
  group_by(race) %>% 
  summarize(
    num_obs = n()
  )

This counts the number of rows, e.g. the number of observations, for each race contained in the race variable. The result is stored in a dataframe called nobs_by_race.