Tidyverse exercises
library(tidyverse)
FORAGE dataset
The FORAGE database (https://doi.org/10.1101/503334) contains data to fit functional response curves. Instead of doing this, which I am sure is obscure to most of you, we just assess how much handling time (h) varies across data sources (Source). We want to do three things:
- Retain only rows with adult prey and predator.
- Retain only the columns Fittted h (day) (rename it to h), Dim (rename it to dimensionality), and Temp (C ) (rename it to temperature).
- Plot h per temperature using dimensionality as grouping variable.
<- read_csv("forage.csv", show_col_types = FALSE) forage
## New names:
## • `Data set` -> `Data set...1`
## • `Vert/invert` -> `Vert/invert...9`
## • `Major grouping 1` -> `Major grouping 1...10`
## • `Major grouping 2` -> `Major grouping 2...11`
## • `Vert/invert` -> `Vert/invert...15`
## • `Major grouping 1` -> `Major grouping 1...16`
## • `Major grouping 2` -> `Major grouping 2...17`
## • `Data set` -> `Data set...36`
forage
## # A tibble: 2,682 × 46
## Data s…¹ Source Preda…² Dim Field…³ Habitat Fresh…⁴ Preda…⁵ Vert/…⁶ Major…⁷
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Soluk… Pred 2.5 <NA> Aquatic <NA> Metazo… Invert… Insect
## 2 2 Soluk… Pred 2.5 <NA> Aquatic <NA> Metazo… Invert… Insect
## 3 3 Hassa… Pred 2 <NA> Terres… <NA> Metazo… Invert… Insect
## 4 4 Hassa… Pred 2 <NA> Terres… <NA> Metazo… Invert… Insect
## 5 5 Hassa… Pred 2 <NA> Terres… <NA> Metazo… Invert… Insect
## 6 6 Hassa… Pred 2 <NA> Terres… <NA> Metazo… Invert… Insect
## 7 7 Paraj… Pred 2 <NA> Terres… <NA> Metazo… Invert… Insect
## 8 8 Paraj… Pred 2 <NA> Terres… <NA> Metazo… Invert… Insect
## 9 9 Enkeg… Pred 2 <NA> Terres… <NA> Metazo… Invert… Insect
## 10 10 Sepul… Pred 2 <NA> Terres… <NA> Metazo… Invert… Arachn…
## # … with 2,672 more rows, 36 more variables: `Major grouping 2...11` <chr>,
## # `Predator scientific name` <chr>, `Predator type` <chr>,
## # `Prey cellularity` <chr>, `Vert/invert...15` <chr>,
## # `Major grouping 1...16` <chr>, `Major grouping 2...17` <chr>, Prey <chr>,
## # `Prey scientific name` <chr>, `Prey type` <chr>, `Data type` <chr>,
## # `Prey replaced?` <chr>, `Hours starved` <chr>, `Temp (C )` <dbl>,
## # Interference <chr>, `Pred per arena` <chr>, …
As a first note, the dataset contains duplicated column names, which tidyverse handles by renaming them (New names). This is something that should not happend in datasets and there are proper ways to avoid it during the dataset design. It’s nice, however, that tidyverse catches these problems and explicitly tells us that some columns were renamed.
Retain only adult observations
We are interested only in predator-prey interactions between adults individuals. The column Predation or Parasitism contains information if the interactions is predatory (Pred) or parasitic (Par). The columns Predator type and Prey type the life stages of individuals. Both these columns are not standardized, i.e. there is no consistencies between entries. To filter this for only adults, you will need to grepl().
<- forage %>%
adult filter(
`Predation or Parasitism` == "Pred",
if_all(c(`Prey type`, `Predator type`), ~grepl("adult", .x, ignore.case = TRUE)),
if_all(c(`Prey mass (mg)`, `Predator mass (mg)`), ~!is.na(.x))
)
It’s good practice to check that what has been left out is possibly not an adult, but coded with a spelling mistake or with another standard:
%>%
forage filter(
`Predation or Parasitism` == "Pred",
if_all(c(`Prey type`, `Predator type`), ~!grepl("adult", .x, ignore.case = TRUE)),
if_all(c(`Prey mass (mg)`, `Predator mass (mg)`), ~is.na(.x))
%>%
) group_by(`Predator type`, `Prey type`) %>%
tally() %>%
arrange(desc(n)) %>%
print(n = 50)
## # A tibble: 18 × 3
## # Groups: Predator type [10]
## `Predator type` `Prey type` n
## <chr> <chr> <int>
## 1 Larva <NA> 4
## 2 Protonymph Protonymph 3
## 3 Female deutonymph 1st instar 2
## 4 Female protonymph 1st instar 2
## 5 Male deutonymph 1st instar 2
## 6 Male protonymph 1st instar 2
## 7 Protonymph Egg 2
## 8 5th instar 5th instar 1
## 9 C1-C2 N1-N2 1
## 10 C1-C2 N3-N4 1
## 11 C3-C4 N1-N2 1
## 12 C3-C4 N3-N4 1
## 13 Deutonymph Deutonymph 1
## 14 Deutonymph Larva 1
## 15 Deutonymph Protonymph 1
## 16 Larva Hatchling 1
## 17 Protonymph Deutonymph 1
## 18 Protonymph Larva 1
There is no solution to this but to check the dataset for all cases where this happens and that you want to catch; this is not a programming problem, but a data quality one. Note also that, after filtering only the interactions we want, we are left with just 7% of the observations:
message(nrow(adult), " kept of the total ", nrow(forage))
## 198 kept of the total 2682
Retain only selected columns
This can be achieved either by piping
mutate() %>% select()
, or by transmute()
,
which combines the two:
<- adult %>%
adult transmute(temperature = `Temp (C )`,
h = `Fittted h (day)`,
dimensionality = Dim)
adult
## # A tibble: 198 × 3
## temperature h dimensionality
## <dbl> <dbl> <dbl>
## 1 20 0.114 2
## 2 NA 0.0761 2
## 3 NA 0.197 2
## 4 NA 0.0862 2
## 5 NA 0.281 2
## 6 20 0.14 2
## 7 25 0.132 2
## 8 30 0.128 2
## 9 35 0.11 2
## 10 25 0.0426 2
## # … with 188 more rows
Plot h per temperature using dimensionality as grouping variable
For simplicitity, I make the plot in ggplot2; as we haven’t covered this, feel free to use whatever you want.
%>%
adult ggplot() +
aes(temperature, h, col = as.factor(dimensionality)) +
geom_point() +
geom_smooth(alpha = .1, method = "lm") +
theme_classic() +
scale_y_log10(labels = scales::label_log()) +
scale_color_manual(name = "Dimensionality",
values = c("darkblue", "darkred", "goldenrod"))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 34 rows containing non-finite values (stat_smooth).
## Warning: Removed 34 rows containing missing values (geom_point).
There is a bit of variability, but not so much. Is it a good thing or not? Honestly, I don’t know.
TetraDENSITY database
The TetraDENSITY database (https://doi.org/10.1111/geb.12756) contains information of vertebrate species densities (number of individual per area) globally. Here, we want to:
- Standardize Density column as number of individual per \(km^2\).
- Make a table with the number of observation (N) per taxonomic class.
- Add N to TetraDENSITY and select only the columns Class, N, and Density.
<- read_csv("tetradensity.csv", show_col_types = FALSE) tetra
Standardize density
Density was recorded as one of the four measures:
table(tetra$Density_unit)
##
## ind/ha ind/km2 males/ha pairs/km2
## 1536 13486 59 3165
To standardize density, we need to do two things: multiply the
density by a scaling factor to get the area to \(km^2\); and to multiply measures counting
only males or pairs to get the number of individuals (assume a
male-female ratio = 1; \(1 ha = 0.01
km^2\)). The easiest way to achieve this is to use
dplyr::case_when(), which allows vectorized if_else()
type of statements. If you are not sure about the syntax, type
help(case_when)
to explore it a bit before trying a
solution.
<- tetra %>%
tetra mutate(Density = case_when(
== "ind/ha" ~ Density * 1e2,
Density_unit == "ind/km2" ~ Density,
Density_unit == "males/ha" ~ Density * 2e2,
Density_unit == "pairs/km2" ~ Density * 2,
Density_unit ))
Number of observation per taxonomic class
This is quite straightforward, and I’ll give you only one hint: use tally().
%>%
tetra group_by(Class) %>%
tally()
## # A tibble: 4 × 2
## Class n
## <chr> <int>
## 1 Amphibia 541
## 2 Aves 8544
## 3 Mammalia 8107
## 4 Reptilia 1054
Add N to TetraDENSITY
We want to add the N column to the TetraDENSITY and retain only the columns Class, N, and Density. There are two ways to achieve this: the first is to use a join:
%>%
tetra left_join(
%>%
tetra group_by(Class) %>%
tally(name = "N")
%>%
) select(Class, N, Density)
and the second (the more tidy one) is to use the dplyr::add_tally() function, which achieve the same thing, but more efficiently:
<- tetra %>%
tetra group_by(Class) %>%
add_tally(name = "N") %>%
select(Class, N, Density)
tetra
## # A tibble: 18,246 × 3
## # Groups: Class [4]
## Class N Density
## <chr> <int> <dbl>
## 1 Amphibia 541 5200
## 2 Amphibia 541 177800
## 3 Amphibia 541 40000
## 4 Amphibia 541 1600
## 5 Amphibia 541 1600
## 6 Amphibia 541 1600
## 7 Amphibia 541 7800
## 8 Amphibia 541 5600
## 9 Amphibia 541 8000
## 10 Amphibia 541 4700
## # … with 18,236 more rows
Fungal record database
The Danish fungal record database (https://doi.org/10.15468/zn159h) contains ca. \(10^6\) occurrence records of Fungi worldwide, even if most of the data pertains to Danish species. The database is hosted on GBIF, hence it is a Darwin Core archive (DwC). DwC has a high standard that follow optimal data science practices and insures quality control of datasets. Most notably DwC archives are already tidy data (one row = one observation) and have already been checked for quality and entries issues.
When you are working with such a high standard as DwC, you will find out that tidyverse becomes a more secondary package. It is still useful though, especially to subset datasets, modify columns and select them. In this exercise, we want to:
- Load the fungal record database.
- Retain only observation that pertain to the kingdom Fungi.
- Select only species, longitude, and latitude columns as species, lon, and lat.
- Drop rows that have at least one NAs, count the number of observation per species and drop species that had less than 5,000 observations (so we don’t need to work with too many species for this exercise).
- Save the table for each species separately in a file called genus-species.csv in the directory fungal-data/.
Load and filter the database
This should be quite straightforward and you should know most of the
verbs to use. One you may not remember is transmute()
,
which combines mutate()
and select()
. Also,
the tidyr drop_na()
is quite useful to drop rows
with at least one NA.
<- read_tsv("fungal-database.csv", show_col_types = FALSE) %>%
fungal filter(kingdom == "Fungi") %>%
transmute(
species,lon = decimalLongitude,
lat = decimalLatitude
%>%
) drop_na()
## Warning: One or more parsing issues, see `problems()` for details
Tally species records
Here, the best is to get a vector of species names that have at least 5,000 records.
<- fungal %>%
sp group_by(species) %>%
tally() %>%
ungroup() %>%
filter(n >= 5000) %>%
pull(species)
This was quite intuitive; remember to ungroup grouped tables when you don’t need groups any more, as this avoids some issues that are quite hard to figure out.
Save tables
We want to save one table for each speceis seprartely. The easiest
thingis to create a list with elements the tables of each species and
than iterate over the list to save them. The dplyr function
group_split()
creates a list of tables from a grouped
tables.
<- fungal %>%
tab_list filter(species %in% sp) %>%
group_by(species) %>%
group_split()
To write you can iterate, e.g. using a for loop.
# better to check species names again
<- sapply(tab_list, function(l) l[["species"]][1])
sp names(tab_list) <- sp
for (x in sp) {
write_csv(tab_list[[x]], file.path("fungal-data", paste0(x, ".csv")))
}list.files("fungal-data/")
## [1] "Amanita rubescens.csv" "Bjerkandera adusta.csv"
## [3] "Fomes fomentarius.csv" "Fomitopsis pinicola.csv"
## [5] "Hypholoma fasciculare.csv" "Stereum hirsutum.csv"
## [7] "Trametes versicolor.csv"
Marriage dataset
The marriage dataset consists of 86 rows and 13 columns that reports the average age at first marriage for females and males in Europe. We want to do three things:
- Fix the first column, which has some issues.
- Make the wide table a long one.
- Plot the time series for each country and sex separately.
<- read_tsv("marriage.tsv", show_col_types = FALSE)
d d
## # A tibble: 86 × 13
## indic…¹ `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 FAGEMA… 25.8 25.7 25.5 : : : : : 26.8 :
## 2 FAGEMA… 29.3 29.5 29.9 : 30.3 30.6 30.7 30.9 31.2 31.3
## 3 FAGEMA… 24.5 24.3 24.3 : : : : 24.4 : :
## 4 FAGEMA… 28.6 28.9 29.3 29.4 : b : : : : :
## 5 FAGEMA… 25.8 25.9 26.2 26.6 26.6 : 26.7 26.9 27.0 27.1
## 6 FAGEMA… 24.0 24.1 : : 24.4 24.4 24.7 24.8 24.8 24.9
## 7 FAGEMA… 29.3 29.6 29.6 29.8 29.9 29.9 30.1 30.1 30.2 30.4
## 8 FAGEMA… 27.0 27.2 27.6 27.9 28.1 28.3 28.5 28.7 28.8 29.0
## 9 FAGEMA… 29.3 29.6 29.8 : 30.2 : 30.5 30.7 30.9 31.1
## 10 FAGEMA… 31.3 31.4 31.1 31.2 31.4 31.8 31.9 31.9 31.9 32.2
## # … with 76 more rows, 2 more variables: `2017` <chr>, `2018` <chr>, and
## # abbreviated variable name ¹`indic_de,geo\\time`
Fix first column
The first thing to notice is that the first column indic_de,geo\time` had some problems, namely it is composed of two features (indic_de and geo) that should be separated. Let’s split the string into two strings and create two new columns, indic_de and geo with those entries. Let’s also drop the original column, which is confusing.
<- d %>%
d mutate(
indic_de = str_split(`indic_de,geo\\time`, ",", simplify = TRUE)[, 1],
geo = str_split(`indic_de,geo\\time`, ",", simplify = TRUE)[, 2]
%>%
) select(-`indic_de,geo\\time`)
Wide to long
We want to gather all column relative to the years of data collection into one column, i.e. we want to make this wide table a long one. We can match the relevant columns with the string “20”. As the values represent ages, we call the new value column age.
<- d %>%
d pivot_longer(cols = contains("20"),
names_to = "year",
values_to = "age")
Also, note that the : is the placeholder for missing values. We don’t want this as it will force the column to be a string, while we want a numeric value. The easiest way to fix this is to replace : with NA and cast the columns as numeric:
<- d %>%
d mutate(age = ifelse(age == ":", NA, age)) %>%
mutate(across(year:age, as.numeric))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
This will throw a warning, but it’s safe to ignore it (NAs is what we want).
Plot time series
The column indic_de has contains the sex information as first letter. First, we want to create a new column sex specifying either “male” or “female” and drop indic_de. Use stringr::str_starts() to grep on the first letter.
<- d %>%
d mutate(sex = ifelse(str_starts(indic_de, "M"), "male", "female")) %>%
select(-indic_de)
and then we can draw the plot we wanted. I use again ggplot2, as it’s easy to achieve the goal, but you can use whatever you want.
%>%
d ggplot() +
aes(year, age, col = geo) +
geom_line() +
geom_smooth(aes(year, age, group = sex),
method = "lm", alpha = .2, size = 3) +
facet_wrap(sex ~ .) +
theme_classic()
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 258 rows containing non-finite values (stat_smooth).
## Warning: Removed 85 row(s) containing missing values (geom_path).
References
Uiterwaal, S. F., Lagerstrom, I. T., Lyon, S. R., & DeLong, J. P. (2018). Data paper: FoRAGE (Functional Responses from Around the Globe in all Ecosystems) database: a compilation of functional responses for consumers and parasitoids. BioRxiv, 503334.
Santini, L., Isaac, N. J., & Ficetola, G. F. (2018). TetraDENSITY: A database of population density estimates in terrestrial vertebrates. Global Ecology and Biogeography, 27(7), 787-791.