Problem 1: For this problem, we will work with the BA_degrees dataset. It contains the proportions of Bachelor’s degrees awarded in the US between 1970 and 2015.

BA_degrees <- read_csv("https://wilkelab.org/SDS375/datasets/BA_degrees.csv")
BA_degrees
## # A tibble: 594 × 4
##    field                                              year  count     perc
##    <chr>                                             <dbl>  <dbl>    <dbl>
##  1 Agriculture and natural resources                  1971  12672 0.0151  
##  2 Architecture and related services                  1971   5570 0.00663 
##  3 Area, ethnic, cultural, gender, and group studies  1971   2579 0.00307 
##  4 Biological and biomedical sciences                 1971  35705 0.0425  
##  5 Business                                           1971 115396 0.137   
##  6 Communication, journalism, and related programs    1971  10324 0.0123  
##  7 Communications technologies                        1971    478 0.000569
##  8 Computer and information sciences                  1971   2388 0.00284 
##  9 Education                                          1971 176307 0.210   
## 10 Engineering                                        1971  45034 0.0536  
## # … with 584 more rows

From the entire dataset, select a subset of 6 fields of study, using arbitrary criteria. Plot a time series of the proportion of degrees (column perc) in this field over time, using facets to show each field. Also plot a straight line fit to the data for each field. You should modify the order of facets to maximize figure appearance and memorability. What do you observe?

BA_degrees %>%
  filter(field %in% c("Engineering", "Architecture and related services", "Computer and information sciences", "Communication, journalism, and related programs", "Biological and biomedical sciences","Agriculture and natural resources")) %>%
  mutate(field = fct_reorder(field, perc, function(x) { max(x) - min(x) })) %>%
ggplot(
  aes(year, perc)
  ) + 
  geom_point(
  ) +
  geom_line(
    stat = "smooth",
    method = "lm",
    formula = 'y ~ x',
    alpha = .5,
    size = 1,
    se = FALSE,
    aes(color = "lightcoral")
    ) +
  facet_wrap(
    vars(field), 
    nrow = 3
    ) +
  labs( 
    title="Percentage of Degrees",
    subtitle = "Attained Over Time",
    x = "Year",
    y = "Percent"
    ) +
  theme_bw( 
    ) +
  theme(
    legend.position = "none",
    axis.line = element_line(colour = "black"),
    panel.border = element_blank(),
    panel.background = element_blank()
    )

After creating this graph, ordered by the function(max(x) - min(x)), I observe the following trends:

  1. There are generally more data points for the years from 2005

  2. The line of best fit, shown in coral, for Computer and information sciences, Communication, journalism, and related programs, and Biological and biomedical sciences show a steady trend upwards. Whereas the line of best fit for Architecture and related services and Agriculture and natural resources show a pretty consistent straight line with the most recent years trending differently. Engineering shows a negative trend over the years.

  3. All of the fields shown show a slight increase in percent in the most recent 10 years, with the exception of Architecture and related services.

Problem 2: We will work the txhousing dataset provided by ggplot2. See here for details: https://ggplot2.tidyverse.org/reference/txhousing.html

Consider the number of houses sold in January 2015. There are records for 46 different cities:

txhousing_jan_2015 <- txhousing %>% 
  filter(year == 2015 & month == 1) %>% 
  arrange(desc(sales))

print(txhousing_jan_2015, n = nrow(txhousing_jan_2015))
## # A tibble: 46 × 9
##    city                   year month sales   volume median listi…¹ inven…²  date
##    <chr>                 <int> <int> <dbl>    <dbl>  <dbl>   <dbl>   <dbl> <dbl>
##  1 Houston                2015     1  4494   1.16e9 189300   18649     2.7  2015
##  2 Dallas                 2015     1  3066   7.74e8 203300    9063     1.8  2015
##  3 Austin                 2015     1  1656   5.12e8 237500    5567     2.2  2015
##  4 San Antonio            2015     1  1485   3.12e8 175900    7717     3.6  2015
##  5 Collin County          2015     1   776   2.42e8 268000    1780     1.3  2015
##  6 Fort Bend              2015     1   686   2.04e8 260300    2414     2.3  2015
##  7 Fort Worth             2015     1   658   1.12e8 143300    2089     2.1  2015
##  8 Montgomery County      2015     1   487   1.47e8 213200    2507     3.3  2015
##  9 NE Tarrant County      2015     1   482   1.27e8 204000    1093     1.4  2015
## 10 Denton County          2015     1   477   1.22e8 216100    1151     1.4  2015
## 11 El Paso                2015     1   406   6.27e7 135200    3995     7.7  2015
## 12 Bay Area               2015     1   401   8.12e7 172200    1910     2.9  2015
## 13 Arlington              2015     1   261   4.61e7 159700     552     1.3  2015
## 14 Tyler                  2015     1   248   3.98e7 139400    2290     6.9  2015
## 15 Corpus Christi         2015     1   241   4.53e7 162800    1872     4.8  2015
## 16 Amarillo               2015     1   204   3.32e7 138500    1120     4.3  2015
## 17 Lubbock                2015     1   202   3.24e7 132400     979     3.1  2015
## 18 Killeen-Fort Hood      2015     1   188   2.44e7 114100    1372     6.1  2015
## 19 Bryan-College Station  2015     1   173   3.82e7 189300     988     3.8  2015
## 20 Abilene                2015     1   158   2.35e7 134100     801     4.4  2015
## 21 Beaumont               2015     1   151   2.17e7 122000    1558     7.3  2015
## 22 McAllen                2015     1   146   1.94e7 118300    2068    11.6  2015
## 23 Waco                   2015     1   144   2.26e7 137500    1034     5    2015
## 24 Longview-Marshall      2015     1   134   1.82e7 131400    1766     9.1  2015
## 25 Garland                2015     1   114   1.76e7 135800     198     1.1  2015
## 26 Temple-Belton          2015     1   107   1.77e7 124500     727     4.9  2015
## 27 Midland                2015     1    91   2.41e7 235900     840     5    2015
## 28 Sherman-Denison        2015     1    88   1.22e7 121700     549     4.4  2015
## 29 Irving                 2015     1    82   2.02e7 157800     278     1.9  2015
## 30 Laredo                 2015     1    82   1.26e7 136200     512     5.2  2015
## 31 San Angelo             2015     1    82   1.38e7 138300     477     3.7  2015
## 32 Texarkana              2015     1    75   9.33e6 101400     317     3.7  2015
## 33 Harlingen              2015     1    74   7.53e6  85000    1560    18.7  2015
## 34 Wichita Falls          2015     1    71   7.52e6  82100     829     7.2  2015
## 35 Brazoria County        2015     1    69   1.04e7 146000     301     2.8  2015
## 36 Odessa                 2015     1    63   1.00e7 156200     308     3    2015
## 37 Victoria               2015     1    54   1.04e7 172500     280     3.6  2015
## 38 Kerrville              2015     1    53   1.35e7 212500     643    11.4  2015
## 39 Galveston              2015     1    43   1.08e7 187500     575     5.8  2015
## 40 Brownsville            2015     1    41   5.40e6  97000     733    10.7  2015
## 41 Lufkin                 2015     1    37   6.87e6 134000     404     7.6  2015
## 42 Port Arthur            2015     1    37   3.96e6  93800     558     7.8  2015
## 43 Paris                  2015     1    25   3.61e6 123300     299     8.1  2015
## 44 South Padre Island     2015     1    22   4.89e6 180000     688    18.5  2015
## 45 Nacogdoches            2015     1    20   3.22e6 140000     284    10.5  2015
## 46 San Marcos             2015     1    18   3.38e6 150000      85     3.4  2015
## # … with abbreviated variable names ¹​listings, ²​inventory

Answer:

The plot that would be most appropriate for this type of analysis would be Side-by-side bars. This is because Side-by-side bars allow for easy comparison of relative proportions (per slide 15 of the Visualizing Proportions lectures). Side-by-side bars also world well for a large number of subsets and is visually appealing.

Problem 3: Now make a pie chart of the txhousing_jan_2015 dataset, but show only the four cities with the most sales, plus all others lumped together into “Other”. (The code to prepare this lumped dataset has been provided for your convenience.) Make sure the pie slices are arranged in a reasonable order. Choose a reasonable color scale and a clean theme that avoids distracting visual elements.

# data preparation
top_four <- txhousing_jan_2015$sales[1:4]

txhousing_lumped <- txhousing_jan_2015 %>%
  mutate(city = ifelse(sales %in% top_four, city, "Other"), levels = sales) %>%
  group_by(city) %>%
  summarize(sales = sum(sales))
  
  
ggplot(
  txhousing_lumped, 
  aes(x="", y=sales, fill=reorder(city,sales))
  ) +
  geom_col(
  ) +
  coord_polar(
    "y",
    start = 0,
    direction = -1
    ) +
  labs( 
    title= "Number of Sales",
    subtitle = "Top Four Texas Cities",
    fill = "City"
    ) +
  geom_text(
    aes(
    label = sales), 
    color = "black", 
    size=3.5,
    position = position_stack(vjust = 0.5)
    ) +
  scale_fill_brewer(
    palette="Set3"
    ) +
  theme_bw( 
    ) +
  theme_void(
  ) +
  theme(
    legend.position = "right"
    )