Easier Cartography

July 18, 2017

R is wonderful for spatial statistics but not so for making beautiful maps. Displaying data on a map in R is straightforward, but making the presentation publication quality can be tedious. Until now. Martijn Tennekes’s tmap package simplifies things alot. And tmap functions accept the new simple features class (sf) of Edzer Pebesma, really sweetening the deal.

Let’s see how this works. We start with getting census data using Kyle Walker’s awesome tidycensus package, which provides an interface to the U.S. Census Bureau’s decennial Census and five-year American Community APIs. The data are returned as data frames (tidyverse-ready) with a simple feature geometry column. You need your own API key. Get it from http://api.census.gov/data/key_signup.html.

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
census_api_key("YOUR KEY HERE")

Our recent research on determinants of tornado casualties has led us to look closely at the proportion of elderly white people in the Mid South. Here we get the number of 65-74 yr-old women (white only) at the state level. The argument sumfile = "acs5" says get the data from the 5-year American Community Survey. By default, tidycensus functions return tidy data frames in which rows represent unit-variable combinations; for a wide data frame with Census variable names in the columns, set output = "wide" in the function call.

elderly.df = get_decennial(geography = "state", 
                    variables = "B01001A_029E",
                    sumfile = "acs5",
                    year = 2010,
                    output = "wide") %>%
  rename(Women = B01001A_029E)
## # A tibble: 6 x 3
##    Women       NAME GEOID
##    <dbl>      <chr> <chr>
## 1 151516    Alabama    01
## 2  11177     Alaska    02
## 3 217121    Arizona    04
## 4 104742   Arkansas    05
## 5 810243 California    06
## 6 135227   Colorado    08

The function returns a data frame with three columns in wide format. GEOID is an identifier for the geographical unit associated with the row and NAME is a descriptive name of the geographical unit. The variable name is changed to Women.

We can then use ggplot to visualize the data with a dot chart.

elderly.df %>%
  ggplot(aes(x = Women/1000, reorder(NAME, Women))) +
  geom_point() +
  xlab("Number of Women (65-74) [thousands]") + ylab("")

With geometry = TRUE in the get_decennial function we get a ‘simple feature’ geometry column in the data frame. Here polygons representing county boundaries (geography = "county"). We get the number of elderly (age 65-74) white women and the total number of white women by county. We use the mutate function to compute a percentage.

elderly.df = get_decennial(geography = "county",
                           variables = c("B01001A_029E", "B01001_026E"),
                           sumfile = "acs5",
                           year = 2010,
                           output = "wide",
                           geometry = TRUE) %>%
  rename(Elderly = B01001A_029E,
         Total = B01001_026E) %>%
  mutate(Percent = Elderly/Total * 100)
## # A tibble: 6 x 6
##   GEOID Elderly Total                      NAME               geometry
##   <chr>   <dbl> <dbl>                     <chr> <S3: sfc_MULTIPOLYGON>
## 1 01029     648  7426  Cleburne County, Alabama <S3: sfc_MULTIPOLYGON>
## 2 01031    1783 24609    Coffee County, Alabama <S3: sfc_MULTIPOLYGON>
## 3 01037     417  5804     Coosa County, Alabama <S3: sfc_MULTIPOLYGON>
## 4 01039    1798 19386 Covington County, Alabama <S3: sfc_MULTIPOLYGON>
## 5 01041     488  7191  Crenshaw County, Alabama <S3: sfc_MULTIPOLYGON>
## 6 01045    1757 25194      Dale County, Alabama <S3: sfc_MULTIPOLYGON>
## # ... with 1 more variables: Percent <dbl>

The data frame contains the county borders (geometry column) as multipolygons in well-known text (WKT) format, so we could go directly to the plotting package. However, the data contains information from U.S. territories that are not relevant to our interests.

We first need a map domain. Here we use the composite counties (Alaska and Hawaii inserted) from Bob Rudis’s nice albersusa package available on github. Make sure you have the devtools package installed.

cty_sf = counties_sf("aeqd")

The plot method for simple features plots the attribute selected (here census_area). This is the map domain we want.

Join data with spatial data and make a choropleth map.

cty_sf$GEOID = as.character(cty_sf$fips)
cty_elderly = left_join(as.data.frame(cty_sf), 
                        as.data.frame(elderly.df), by = "GEOID")
cty_elderly$geometry = cty_elderly$geometry.x
cty_elderly = sf::st_as_sf(cty_elderly)

While we can tool around with basic R graphics to add legends, scale bar, etc, this requires lots of time-sucking trial-and-error.

Instead, the tmap package is a flexible, layer-based, and easy to use approach to create thematic maps including choropleths and bubble maps. It is based on the grammar of graphics, and resembles the syntax of ggplot2.

us_sf = usa_sf("aeqd")

tm_shape(cty_elderly) +
              border.col = NULL,
              title = "% White Women 65-74",
              palette = "Purples") +
tm_shape(us_sf) + tm_borders() +
  tm_compass() + tm_scale_bar() +
  tm_format_NLD(legend.position = c("left", "bottom"),
                attr.position = c("right", "bottom")) 

This plot has two groups of layers from the shape objects cty_elderly and us_sf (state borders). Each layer is called with tm_shape. The shape objects can have different projections and can also cover different areas (bounding boxes).

The order of the layers dictates the plotting order. We first plot the percent white women by county and then add the state borders. The color ramp is specified using the palette names from the RColorBrewer package.

The compass and scale bar are added as separate layers. The layout format is specified by the tm_format_NLD function as a set of prespecified options from the tm_layout function.

Finally, we can plot total number of white women and the percentage of elderly white women for the states of Arkansas and Missouri, only.

tm_shape(cty_elderly[cty_elderly$state == "Arkansas" |
                     cty_elderly$state == "Missouri", ]) +
  tm_polygons(c("Elderly", "Percent"),
              title = c("Elderly White Women", "Percentage of Total"),
              palette = list("Greens", "Purples")) +
  tm_scale_bar() + tm_compass() +
  tm_format_NLD(legend.position = c("left", "bottom"), 
                attr.position = c("right", "top"))

The plot page can be rescaled to better position the compass, scale bar, and legend.

The format of the tmap map objects (meoms) are like those of the ggplot2 geometric objects (geoms) making it easy to get to a publication-quality map. The fine details can be worked out in production.

More information? See: https://cran.r-project.org/web/packages/tmap/vignettes/tmap-nutshell.html