Tidyverse exercises

library(tidyverse)

FORAGE dataset

The FORAGE database (https://doi.org/10.1101/503334) contains data to fit functional response curves. Instead of doing this, which I am sure is obscure to most of you, we just assess how much handling time (h) varies across data sources (Source). We want to do three things:

Retain only rows with adult prey and predator.
Retain only the columns Fittted h (day) (rename it to h), Dim (rename it to dimensionality), and Temp (C ) (rename it to temperature).
Plot h per temperature using dimensionality as grouping variable.

forage <- read_csv("forage.csv", show_col_types = FALSE)

## New names:
## • `Data set` -> `Data set...1`
## • `Vert/invert` -> `Vert/invert...9`
## • `Major grouping 1` -> `Major grouping 1...10`
## • `Major grouping 2` -> `Major grouping 2...11`
## • `Vert/invert` -> `Vert/invert...15`
## • `Major grouping 1` -> `Major grouping 1...16`
## • `Major grouping 2` -> `Major grouping 2...17`
## • `Data set` -> `Data set...36`

forage

## # A tibble: 2,682 × 46
##    Data s…¹ Source Preda…²   Dim Field…³ Habitat Fresh…⁴ Preda…⁵ Vert/…⁶ Major…⁷
##       <dbl> <chr>  <chr>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
##  1        1 Soluk… Pred      2.5 <NA>    Aquatic <NA>    Metazo… Invert… Insect 
##  2        2 Soluk… Pred      2.5 <NA>    Aquatic <NA>    Metazo… Invert… Insect 
##  3        3 Hassa… Pred      2   <NA>    Terres… <NA>    Metazo… Invert… Insect 
##  4        4 Hassa… Pred      2   <NA>    Terres… <NA>    Metazo… Invert… Insect 
##  5        5 Hassa… Pred      2   <NA>    Terres… <NA>    Metazo… Invert… Insect 
##  6        6 Hassa… Pred      2   <NA>    Terres… <NA>    Metazo… Invert… Insect 
##  7        7 Paraj… Pred      2   <NA>    Terres… <NA>    Metazo… Invert… Insect 
##  8        8 Paraj… Pred      2   <NA>    Terres… <NA>    Metazo… Invert… Insect 
##  9        9 Enkeg… Pred      2   <NA>    Terres… <NA>    Metazo… Invert… Insect 
## 10       10 Sepul… Pred      2   <NA>    Terres… <NA>    Metazo… Invert… Arachn…
## # … with 2,672 more rows, 36 more variables: `Major grouping 2...11` <chr>,
## #   `Predator scientific name` <chr>, `Predator type` <chr>,
## #   `Prey cellularity` <chr>, `Vert/invert...15` <chr>,
## #   `Major grouping 1...16` <chr>, `Major grouping 2...17` <chr>, Prey <chr>,
## #   `Prey scientific name` <chr>, `Prey type` <chr>, `Data type` <chr>,
## #   `Prey replaced?` <chr>, `Hours starved` <chr>, `Temp (C )` <dbl>,
## #   Interference <chr>, `Pred per arena` <chr>, …

As a first note, the dataset contains duplicated column names, which tidyverse handles by renaming them (New names). This is something that should not happend in datasets and there are proper ways to avoid it during the dataset design. It’s nice, however, that tidyverse catches these problems and explicitly tells us that some columns were renamed.

Retain only adult observations

We are interested only in predator-prey interactions between adults individuals. The column Predation or Parasitism contains information if the interactions is predatory (Pred) or parasitic (Par). The columns Predator type and Prey type the life stages of individuals. Both these columns are not standardized, i.e. there is no consistencies between entries. To filter this for only adults, you will need to grepl().

adult <- forage %>%
  filter(
    `Predation or Parasitism` == "Pred",
    if_all(c(`Prey type`, `Predator type`), ~grepl("adult", .x, ignore.case = TRUE)),
    if_all(c(`Prey mass (mg)`, `Predator mass (mg)`), ~!is.na(.x))
  )

It’s good practice to check that what has been left out is possibly not an adult, but coded with a spelling mistake or with another standard:

forage %>%
  filter(
    `Predation or Parasitism` == "Pred",
    if_all(c(`Prey type`, `Predator type`), ~!grepl("adult", .x, ignore.case = TRUE)),
    if_all(c(`Prey mass (mg)`, `Predator mass (mg)`), ~is.na(.x))
  ) %>% 
  group_by(`Predator type`, `Prey type`) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  print(n = 50)

## # A tibble: 18 × 3
## # Groups:   Predator type [10]
##    `Predator type`   `Prey type`     n
##    <chr>             <chr>       <int>
##  1 Larva             <NA>            4
##  2 Protonymph        Protonymph      3
##  3 Female deutonymph 1st instar      2
##  4 Female protonymph 1st instar      2
##  5 Male deutonymph   1st instar      2
##  6 Male protonymph   1st instar      2
##  7 Protonymph        Egg             2
##  8 5th instar        5th instar      1
##  9 C1-C2             N1-N2           1
## 10 C1-C2             N3-N4           1
## 11 C3-C4             N1-N2           1
## 12 C3-C4             N3-N4           1
## 13 Deutonymph        Deutonymph      1
## 14 Deutonymph        Larva           1
## 15 Deutonymph        Protonymph      1
## 16 Larva             Hatchling       1
## 17 Protonymph        Deutonymph      1
## 18 Protonymph        Larva           1

There is no solution to this but to check the dataset for all cases where this happens and that you want to catch; this is not a programming problem, but a data quality one. Note also that, after filtering only the interactions we want, we are left with just 7% of the observations:

message(nrow(adult), " kept of the total ", nrow(forage))

## 198 kept of the total 2682

Retain only selected columns

This can be achieved either by piping mutate() %>% select(), or by transmute(), which combines the two:

adult <- adult %>% 
  transmute(temperature = `Temp (C )`, 
            h = `Fittted h (day)`,
            dimensionality = Dim)
adult

## # A tibble: 198 × 3
##    temperature      h dimensionality
##          <dbl>  <dbl>          <dbl>
##  1          20 0.114               2
##  2          NA 0.0761              2
##  3          NA 0.197               2
##  4          NA 0.0862              2
##  5          NA 0.281               2
##  6          20 0.14                2
##  7          25 0.132               2
##  8          30 0.128               2
##  9          35 0.11                2
## 10          25 0.0426              2
## # … with 188 more rows

Plot h per temperature using dimensionality as grouping variable

For simplicitity, I make the plot in ggplot2; as we haven’t covered this, feel free to use whatever you want.

adult %>% 
  ggplot() +
  aes(temperature, h, col = as.factor(dimensionality)) +
  geom_point() +
  geom_smooth(alpha = .1, method = "lm") +
  theme_classic() +
  scale_y_log10(labels = scales::label_log()) +
  scale_color_manual(name = "Dimensionality",
                     values = c("darkblue", "darkred", "goldenrod"))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 34 rows containing non-finite values (stat_smooth).

## Warning: Removed 34 rows containing missing values (geom_point).

There is a bit of variability, but not so much. Is it a good thing or not? Honestly, I don’t know.

TetraDENSITY database

The TetraDENSITY database (https://doi.org/10.1111/geb.12756) contains information of vertebrate species densities (number of individual per area) globally. Here, we want to:

Standardize Density column as number of individual per \(km^2\).
Make a table with the number of observation (N) per taxonomic class.
Add N to TetraDENSITY and select only the columns Class, N, and Density.

tetra <- read_csv("tetradensity.csv", show_col_types = FALSE)

Standardize density

Density was recorded as one of the four measures:

table(tetra$Density_unit)

## 
##    ind/ha   ind/km2  males/ha pairs/km2 
##      1536     13486        59      3165

To standardize density, we need to do two things: multiply the density by a scaling factor to get the area to \(km^2\); and to multiply measures counting only males or pairs to get the number of individuals (assume a male-female ratio = 1; \(1 ha = 0.01 km^2\)). The easiest way to achieve this is to use dplyr::case_when(), which allows vectorized if_else() type of statements. If you are not sure about the syntax, type help(case_when) to explore it a bit before trying a solution.

tetra <- tetra %>% 
  mutate(Density = case_when(
    Density_unit == "ind/ha" ~ Density * 1e2,
    Density_unit == "ind/km2" ~ Density,
    Density_unit == "males/ha" ~ Density * 2e2,
    Density_unit == "pairs/km2" ~ Density * 2,
  ))

Number of observation per taxonomic class

This is quite straightforward, and I’ll give you only one hint: use tally().

tetra %>% 
  group_by(Class) %>% 
  tally()

## # A tibble: 4 × 2
##   Class        n
##   <chr>    <int>
## 1 Amphibia   541
## 2 Aves      8544
## 3 Mammalia  8107
## 4 Reptilia  1054

Add N to TetraDENSITY

We want to add the N column to the TetraDENSITY and retain only the columns Class, N, and Density. There are two ways to achieve this: the first is to use a join:

tetra %>% 
  left_join(
    tetra %>% 
      group_by(Class) %>% 
      tally(name = "N")
  ) %>% 
  select(Class, N, Density)

and the second (the more tidy one) is to use the dplyr::add_tally() function, which achieve the same thing, but more efficiently:

tetra <- tetra %>% 
  group_by(Class) %>% 
  add_tally(name = "N") %>% 
  select(Class, N, Density)
tetra

## # A tibble: 18,246 × 3
## # Groups:   Class [4]
##    Class        N Density
##    <chr>    <int>   <dbl>
##  1 Amphibia   541    5200
##  2 Amphibia   541  177800
##  3 Amphibia   541   40000
##  4 Amphibia   541    1600
##  5 Amphibia   541    1600
##  6 Amphibia   541    1600
##  7 Amphibia   541    7800
##  8 Amphibia   541    5600
##  9 Amphibia   541    8000
## 10 Amphibia   541    4700
## # … with 18,236 more rows

Fungal record database

The Danish fungal record database (https://doi.org/10.15468/zn159h) contains ca. \(10^6\) occurrence records of Fungi worldwide, even if most of the data pertains to Danish species. The database is hosted on GBIF, hence it is a Darwin Core archive (DwC). DwC has a high standard that follow optimal data science practices and insures quality control of datasets. Most notably DwC archives are already tidy data (one row = one observation) and have already been checked for quality and entries issues.

When you are working with such a high standard as DwC, you will find out that tidyverse becomes a more secondary package. It is still useful though, especially to subset datasets, modify columns and select them. In this exercise, we want to:

Load the fungal record database.
Retain only observation that pertain to the kingdom Fungi.
Select only species, longitude, and latitude columns as species, lon, and lat.
Drop rows that have at least one NAs, count the number of observation per species and drop species that had less than 5,000 observations (so we don’t need to work with too many species for this exercise).
Save the table for each species separately in a file called genus-species.csv in the directory fungal-data/.

Load and filter the database

This should be quite straightforward and you should know most of the verbs to use. One you may not remember is transmute(), which combines mutate() and select(). Also, the tidyr drop_na() is quite useful to drop rows with at least one NA.

fungal <- read_tsv("fungal-database.csv", show_col_types = FALSE) %>% 
  filter(kingdom == "Fungi") %>% 
  transmute(
    species,
    lon = decimalLongitude,
    lat = decimalLatitude
  ) %>% 
  drop_na()

## Warning: One or more parsing issues, see `problems()` for details

Tally species records

Here, the best is to get a vector of species names that have at least 5,000 records.

sp <- fungal %>% 
  group_by(species) %>% 
  tally() %>% 
  ungroup() %>% 
  filter(n >= 5000) %>% 
  pull(species)

This was quite intuitive; remember to ungroup grouped tables when you don’t need groups any more, as this avoids some issues that are quite hard to figure out.

Save tables

We want to save one table for each speceis seprartely. The easiest thingis to create a list with elements the tables of each species and than iterate over the list to save them. The dplyr function group_split() creates a list of tables from a grouped tables.

tab_list <- fungal %>% 
  filter(species %in% sp) %>% 
  group_by(species) %>% 
  group_split()

To write you can iterate, e.g. using a for loop.

# better to check species names again
sp <- sapply(tab_list, function(l) l[["species"]][1])
names(tab_list) <- sp
for (x in sp) {
  write_csv(tab_list[[x]], file.path("fungal-data", paste0(x, ".csv")))
}
list.files("fungal-data/")

## [1] "Amanita rubescens.csv"     "Bjerkandera adusta.csv"   
## [3] "Fomes fomentarius.csv"     "Fomitopsis pinicola.csv"  
## [5] "Hypholoma fasciculare.csv" "Stereum hirsutum.csv"     
## [7] "Trametes versicolor.csv"

Marriage dataset

The marriage dataset consists of 86 rows and 13 columns that reports the average age at first marriage for females and males in Europe. We want to do three things:

Fix the first column, which has some issues.
Make the wide table a long one.
Plot the time series for each country and sex separately.

d <- read_tsv("marriage.tsv", show_col_types = FALSE)
d

## # A tibble: 86 × 13
##    indic…¹ `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
##    <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
##  1 FAGEMA… 25.8   25.7   25.5   :      :      :      :      :      26.8   :     
##  2 FAGEMA… 29.3   29.5   29.9   :      30.3   30.6   30.7   30.9   31.2   31.3  
##  3 FAGEMA… 24.5   24.3   24.3   :      :      :      :      24.4   :      :     
##  4 FAGEMA… 28.6   28.9   29.3   29.4   : b    :      :      :      :      :     
##  5 FAGEMA… 25.8   25.9   26.2   26.6   26.6   :      26.7   26.9   27.0   27.1  
##  6 FAGEMA… 24.0   24.1   :      :      24.4   24.4   24.7   24.8   24.8   24.9  
##  7 FAGEMA… 29.3   29.6   29.6   29.8   29.9   29.9   30.1   30.1   30.2   30.4  
##  8 FAGEMA… 27.0   27.2   27.6   27.9   28.1   28.3   28.5   28.7   28.8   29.0  
##  9 FAGEMA… 29.3   29.6   29.8   :      30.2   :      30.5   30.7   30.9   31.1  
## 10 FAGEMA… 31.3   31.4   31.1   31.2   31.4   31.8   31.9   31.9   31.9   32.2  
## # … with 76 more rows, 2 more variables: `2017` <chr>, `2018` <chr>, and
## #   abbreviated variable name ¹`indic_de,geo\\time`

Fix first column

The first thing to notice is that the first column indic_de,geo\time` had some problems, namely it is composed of two features (indic_de and geo) that should be separated. Let’s split the string into two strings and create two new columns, indic_de and geo with those entries. Let’s also drop the original column, which is confusing.

d <- d %>% 
  mutate(
    indic_de = str_split(`indic_de,geo\\time`, ",", simplify = TRUE)[, 1],
    geo = str_split(`indic_de,geo\\time`, ",", simplify = TRUE)[, 2]
  ) %>% 
  select(-`indic_de,geo\\time`)

Wide to long

We want to gather all column relative to the years of data collection into one column, i.e. we want to make this wide table a long one. We can match the relevant columns with the string “20”. As the values represent ages, we call the new value column age.

d <- d %>% 
  pivot_longer(cols = contains("20"),
               names_to = "year",
               values_to = "age")

Also, note that the : is the placeholder for missing values. We don’t want this as it will force the column to be a string, while we want a numeric value. The easiest way to fix this is to replace : with NA and cast the columns as numeric:

d <- d %>% 
  mutate(age = ifelse(age == ":", NA, age)) %>% 
  mutate(across(year:age, as.numeric))

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

This will throw a warning, but it’s safe to ignore it (NAs is what we want).

Plot time series

The column indic_de has contains the sex information as first letter. First, we want to create a new column sex specifying either “male” or “female” and drop indic_de. Use stringr::str_starts() to grep on the first letter.

d <- d %>% 
  mutate(sex = ifelse(str_starts(indic_de, "M"), "male", "female")) %>% 
  select(-indic_de)

and then we can draw the plot we wanted. I use again ggplot2, as it’s easy to achieve the goal, but you can use whatever you want.

d %>% 
  ggplot() +
  aes(year, age, col = geo) +
  geom_line() +
  geom_smooth(aes(year, age, group = sex),
              method = "lm", alpha = .2, size = 3) +
  facet_wrap(sex ~ .) +
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 258 rows containing non-finite values (stat_smooth).

## Warning: Removed 85 row(s) containing missing values (geom_path).

References

Uiterwaal, S. F., Lagerstrom, I. T., Lyon, S. R., & DeLong, J. P. (2018). Data paper: FoRAGE (Functional Responses from Around the Globe in all Ecosystems) database: a compilation of functional responses for consumers and parasitoids. BioRxiv, 503334.

Santini, L., Isaac, N. J., & Ficetola, G. F. (2018). TetraDENSITY: A database of population density estimates in terrestrial vertebrates. Global Ecology and Biogeography, 27(7), 787-791.