Here we look at the basic functionality offered by the dplyr package for data manipulation.
The main commands in dplyr
are the following:
mutate
adds new variables to a dataset.transmute
creates a new dataset from transformations on an existing dataset. It is like mutate
but “forgets” all the old variables.filter
restricts the rows that we consider.select
restricts the columns that we consider.group_by
organizes the data into groups based on a variable.summarize
aggregates multiple values (typically using a grouping variable).arrange
allows the reordering of the dataset based on a variable.To see some examples of using these, let’s consider the following situation from the counties
dataset:
We will use the above commands in succession as follows:
mutate
to add a “count_female” variable that counts the number of females on each county.group_by
the state variable.summarize
to add up the female count and overall population count for each state. This will result in a dataset with one row for each state.mutate
again to compute the percent of females on each state.arrange
to reorder the states.Here is the overall code, though what you really want to do is look at each piece step by step (perhaps by piping it into the View()
command):
counties %>%
mutate(count_female = female * pop2010 / 100) %>%
group_by(state) %>%
summarize(
counties = n(),
pop2010 = sum(pop2010),
female = sum(count_female)
) %>%
mutate(femalePercent = female/pop2010 * 100) %>%
arrange(femalePercent)
We could store the result, or further process it by piping it into a graph or something of the kind. Let’s suppose we had stored it in a variable femalesByState
:
Note that this graph does not actually order the states in increasing order, which is what we wanted. In order to achieve that, we must use the fct_reorder command from the forcats
package:
In this example we would like to collect one county from each state, namely the county with the smallest population in that state. In order to achieve that, we will perform the following steps:
fct_reorder
to order the cases by population.counties %>%
group_by(state) %>%
arrange(pop2010) %>%
summarize(
name = first(name),
pop2010 = first(pop2010),
counties = n()
) %>%
filter(counties > 1) %>%
arrange(pop2010) %>%
mutate(
full = str_c(state, name, sep = " - ")
) %>%
gf_point(fct_reorder(full, pop2010)~pop2010) %>%
gf_refine(scale_x_log10()) %>%
gf_text(label=~pop2010, size=3, alpha=0.7, nudge_x=0.3, color="red")
We build on the previous example. We will now also compute the largest county per state. Some differences:
counties %>%
group_by(state) %>%
arrange(pop2010) %>%
summarize(
smallest_name = first(name),
smallest = first(pop2010),
largest_name = last(name),
largest = last(pop2010),
median = median(pop2010),
counties = n()
) %>%
filter(counties > 1) %>%
mutate(
state = fct_reorder(state, median)
) %>%
gf_point(state~smallest) %>%
gf_point(state~largest) %>%
gf_linerangeh(state~smallest+largest) %>%
gf_point(state~median, color="red") %>%
gf_line(state~median, color="red", group=1) %>%
gf_refine(scale_x_log10())