Here we look at the basic functionality offered by the dplyr package for data manipulation.
The main commands in dplyr are the following:
mutate adds new variables to a dataset.transmute creates a new dataset from transformations on an existing dataset. It is like mutate but “forgets” all the old variables.filter restricts the rows that we consider.select restricts the columns that we consider.group_by organizes the data into groups based on a variable.summarize aggregates multiple values (typically using a grouping variable).arrange allows the reordering of the dataset based on a variable.To see some examples of using these, let’s consider the following situation from the counties dataset:
We will use the above commands in succession as follows:
mutate to add a “count_female” variable that counts the number of females on each county.group_by the state variable.summarize to add up the female count and overall population count for each state. This will result in a dataset with one row for each state.mutate again to compute the percent of females on each state.arrange to reorder the states.Here is the overall code, though what you really want to do is look at each piece step by step (perhaps by piping it into the View() command):
counties %>%
mutate(count_female = female * pop2010 / 100) %>%
group_by(state) %>%
summarize(
counties = n(),
pop2010 = sum(pop2010),
female = sum(count_female)
) %>%
mutate(femalePercent = female/pop2010 * 100) %>%
arrange(femalePercent)We could store the result, or further process it by piping it into a graph or something of the kind. Let’s suppose we had stored it in a variable femalesByState:
Note that this graph does not actually order the states in increasing order, which is what we wanted. In order to achieve that, we must use the fct_reorder command from the forcats package:
In this example we would like to collect one county from each state, namely the county with the smallest population in that state. In order to achieve that, we will perform the following steps:
fct_reorder to order the cases by population.counties %>%
group_by(state) %>%
arrange(pop2010) %>%
summarize(
name = first(name),
pop2010 = first(pop2010),
counties = n()
) %>%
filter(counties > 1) %>%
arrange(pop2010) %>%
mutate(
full = str_c(state, name, sep = " - ")
) %>%
gf_point(fct_reorder(full, pop2010)~pop2010) %>%
gf_refine(scale_x_log10()) %>%
gf_text(label=~pop2010, size=3, alpha=0.7, nudge_x=0.3, color="red")We build on the previous example. We will now also compute the largest county per state. Some differences:
counties %>%
group_by(state) %>%
arrange(pop2010) %>%
summarize(
smallest_name = first(name),
smallest = first(pop2010),
largest_name = last(name),
largest = last(pop2010),
median = median(pop2010),
counties = n()
) %>%
filter(counties > 1) %>%
mutate(
state = fct_reorder(state, median)
) %>%
gf_point(state~smallest) %>%
gf_point(state~largest) %>%
gf_linerangeh(state~smallest+largest) %>%
gf_point(state~median, color="red") %>%
gf_line(state~median, color="red", group=1) %>%
gf_refine(scale_x_log10())