Problem 1: For this problem, we will work with the
BA_degrees dataset. It contains the proportions of
Bachelor’s degrees awarded in the US between 1970 and 2015.
BA_degrees <- read_csv("https://wilkelab.org/SDS375/datasets/BA_degrees.csv")
BA_degrees
## # A tibble: 594 × 4
## field year count perc
## <chr> <dbl> <dbl> <dbl>
## 1 Agriculture and natural resources 1971 12672 0.0151
## 2 Architecture and related services 1971 5570 0.00663
## 3 Area, ethnic, cultural, gender, and group studies 1971 2579 0.00307
## 4 Biological and biomedical sciences 1971 35705 0.0425
## 5 Business 1971 115396 0.137
## 6 Communication, journalism, and related programs 1971 10324 0.0123
## 7 Communications technologies 1971 478 0.000569
## 8 Computer and information sciences 1971 2388 0.00284
## 9 Education 1971 176307 0.210
## 10 Engineering 1971 45034 0.0536
## # … with 584 more rows
From the entire dataset, select a subset of 6 fields of study, using
arbitrary criteria. Plot a time series of the proportion of degrees
(column perc) in this field over time, using facets to show
each field. Also plot a straight line fit to the data for each field.
You should modify the order of facets to maximize figure appearance and
memorability. What do you observe?
BA_degrees %>%
filter(field %in% c("Engineering", "Architecture and related services", "Computer and information sciences", "Communication, journalism, and related programs", "Biological and biomedical sciences","Agriculture and natural resources")) %>%
mutate(field = fct_reorder(field, perc, function(x) { max(x) - min(x) })) %>%
ggplot(
aes(year, perc)
) +
geom_point(
) +
geom_line(
stat = "smooth",
method = "lm",
formula = 'y ~ x',
alpha = .5,
size = 1,
se = FALSE,
aes(color = "lightcoral")
) +
facet_wrap(
vars(field),
nrow = 3
) +
labs(
title="Percentage of Degrees",
subtitle = "Attained Over Time",
x = "Year",
y = "Percent"
) +
theme_bw(
) +
theme(
legend.position = "none",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank()
)
After creating this graph, ordered by the function(max(x) - min(x)), I observe the following trends:
There are generally more data points for the years from 2005
The line of best fit, shown in coral, for Computer and information sciences, Communication, journalism, and related programs, and Biological and biomedical sciences show a steady trend upwards. Whereas the line of best fit for Architecture and related services and Agriculture and natural resources show a pretty consistent straight line with the most recent years trending differently. Engineering shows a negative trend over the years.
All of the fields shown show a slight increase in percent in the most recent 10 years, with the exception of Architecture and related services.
Problem 2: We will work the txhousing
dataset provided by ggplot2. See here for details: https://ggplot2.tidyverse.org/reference/txhousing.html
Consider the number of houses sold in January 2015. There are records for 46 different cities:
txhousing_jan_2015 <- txhousing %>%
filter(year == 2015 & month == 1) %>%
arrange(desc(sales))
print(txhousing_jan_2015, n = nrow(txhousing_jan_2015))
## # A tibble: 46 × 9
## city year month sales volume median listi…¹ inven…² date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Houston 2015 1 4494 1.16e9 189300 18649 2.7 2015
## 2 Dallas 2015 1 3066 7.74e8 203300 9063 1.8 2015
## 3 Austin 2015 1 1656 5.12e8 237500 5567 2.2 2015
## 4 San Antonio 2015 1 1485 3.12e8 175900 7717 3.6 2015
## 5 Collin County 2015 1 776 2.42e8 268000 1780 1.3 2015
## 6 Fort Bend 2015 1 686 2.04e8 260300 2414 2.3 2015
## 7 Fort Worth 2015 1 658 1.12e8 143300 2089 2.1 2015
## 8 Montgomery County 2015 1 487 1.47e8 213200 2507 3.3 2015
## 9 NE Tarrant County 2015 1 482 1.27e8 204000 1093 1.4 2015
## 10 Denton County 2015 1 477 1.22e8 216100 1151 1.4 2015
## 11 El Paso 2015 1 406 6.27e7 135200 3995 7.7 2015
## 12 Bay Area 2015 1 401 8.12e7 172200 1910 2.9 2015
## 13 Arlington 2015 1 261 4.61e7 159700 552 1.3 2015
## 14 Tyler 2015 1 248 3.98e7 139400 2290 6.9 2015
## 15 Corpus Christi 2015 1 241 4.53e7 162800 1872 4.8 2015
## 16 Amarillo 2015 1 204 3.32e7 138500 1120 4.3 2015
## 17 Lubbock 2015 1 202 3.24e7 132400 979 3.1 2015
## 18 Killeen-Fort Hood 2015 1 188 2.44e7 114100 1372 6.1 2015
## 19 Bryan-College Station 2015 1 173 3.82e7 189300 988 3.8 2015
## 20 Abilene 2015 1 158 2.35e7 134100 801 4.4 2015
## 21 Beaumont 2015 1 151 2.17e7 122000 1558 7.3 2015
## 22 McAllen 2015 1 146 1.94e7 118300 2068 11.6 2015
## 23 Waco 2015 1 144 2.26e7 137500 1034 5 2015
## 24 Longview-Marshall 2015 1 134 1.82e7 131400 1766 9.1 2015
## 25 Garland 2015 1 114 1.76e7 135800 198 1.1 2015
## 26 Temple-Belton 2015 1 107 1.77e7 124500 727 4.9 2015
## 27 Midland 2015 1 91 2.41e7 235900 840 5 2015
## 28 Sherman-Denison 2015 1 88 1.22e7 121700 549 4.4 2015
## 29 Irving 2015 1 82 2.02e7 157800 278 1.9 2015
## 30 Laredo 2015 1 82 1.26e7 136200 512 5.2 2015
## 31 San Angelo 2015 1 82 1.38e7 138300 477 3.7 2015
## 32 Texarkana 2015 1 75 9.33e6 101400 317 3.7 2015
## 33 Harlingen 2015 1 74 7.53e6 85000 1560 18.7 2015
## 34 Wichita Falls 2015 1 71 7.52e6 82100 829 7.2 2015
## 35 Brazoria County 2015 1 69 1.04e7 146000 301 2.8 2015
## 36 Odessa 2015 1 63 1.00e7 156200 308 3 2015
## 37 Victoria 2015 1 54 1.04e7 172500 280 3.6 2015
## 38 Kerrville 2015 1 53 1.35e7 212500 643 11.4 2015
## 39 Galveston 2015 1 43 1.08e7 187500 575 5.8 2015
## 40 Brownsville 2015 1 41 5.40e6 97000 733 10.7 2015
## 41 Lufkin 2015 1 37 6.87e6 134000 404 7.6 2015
## 42 Port Arthur 2015 1 37 3.96e6 93800 558 7.8 2015
## 43 Paris 2015 1 25 3.61e6 123300 299 8.1 2015
## 44 South Padre Island 2015 1 22 4.89e6 180000 688 18.5 2015
## 45 Nacogdoches 2015 1 20 3.22e6 140000 284 10.5 2015
## 46 San Marcos 2015 1 18 3.38e6 150000 85 3.4 2015
## # … with abbreviated variable names ¹listings, ²inventory
Answer:
The plot that would be most appropriate for this type of analysis would be Side-by-side bars. This is because Side-by-side bars allow for easy comparison of relative proportions (per slide 15 of the Visualizing Proportions lectures). Side-by-side bars also world well for a large number of subsets and is visually appealing.
Problem 3: Now make a pie chart of the
txhousing_jan_2015 dataset, but show only the four cities
with the most sales, plus all others lumped together into “Other”. (The
code to prepare this lumped dataset has been provided for your
convenience.) Make sure the pie slices are arranged in a reasonable
order. Choose a reasonable color scale and a clean theme that avoids
distracting visual elements.
# data preparation
top_four <- txhousing_jan_2015$sales[1:4]
txhousing_lumped <- txhousing_jan_2015 %>%
mutate(city = ifelse(sales %in% top_four, city, "Other"), levels = sales) %>%
group_by(city) %>%
summarize(sales = sum(sales))
ggplot(
txhousing_lumped,
aes(x="", y=sales, fill=reorder(city,sales))
) +
geom_col(
) +
coord_polar(
"y",
start = 0,
direction = -1
) +
labs(
title= "Number of Sales",
subtitle = "Top Four Texas Cities",
fill = "City"
) +
geom_text(
aes(
label = sales),
color = "black",
size=3.5,
position = position_stack(vjust = 0.5)
) +
scale_fill_brewer(
palette="Set3"
) +
theme_bw(
) +
theme_void(
) +
theme(
legend.position = "right"
)