Why the tidyverse? Because it’s for lazy people.

##          date   km    time
## 1  10/27/2021  6.0 0:34:28
## 2  10/27/2021  6.0 0:33:24
## 3  10/31/2021 10.3 0:59:19
## 4   11/6/2021  8.7 0:54:00
## 5  11/13/2021  8.7 0:49:00
## 6  11/17/2021  4.6 0:25:28
## 7  11/21/2021 10.4 1:02:00
## 8  11/30/2021  6.0 0:34:08
## 9  11/30/2021  6.0 0:33:03
## 10  12/5/2021  6.2 0:35:00
## 11  12/9/2021  6.2 0:37:33
## 12 12/19/2021  6.2 0:35:22
## 13 12/22/2021  6.2 0:36:00
## 14 12/28/2021  6.2 0:35:00
## 15 12/31/2021  6.2 0:33:12
## 16   1/2/2022  6.2 0:33:43
## 17  1/12/2022  6.0 0:33:44
## 18  1/12/2022  6.0 0:33:45
## 19  1/16/2022  6.2 0:33:52
## 20  1/19/2022  6.0 0:33:38
## 21  1/19/2022  6.0 0:31:56
## 22  1/23/2022  6.2 0:34:26
## 23  1/26/2022  6.0 0:33:34
## 24  1/26/2022  6.0 0:32:19
## 25  1/30/2022  6.2 0:32:44
## 26   2/2/2022  6.0 0:33:59
## 27   2/2/2022  6.0 0:33:41
## 28   2/6/2022  6.2 0:34:35
## 29   2/9/2022  6.0 0:32:33
## 30   2/9/2022  6.0 0:32:20
## 31  2/12/2022  8.7 0:48:35
## 32  2/16/2022  6.2 0:34:02
## 33  2/19/2022  8.7 0:46:40
## 34  2/23/2022  6.0 0:34:03
## 35  2/23/2022  6.0 0:30:40
## 36  2/28/2022  8.1 0:43:18
## 37   3/2/2022  6.4 0:36:16
## 38   3/3/2022  6.4 0:38:22
## 39   3/9/2022  6.2 0:30:33
## 40  3/12/2022  8.7 0:45:20
## 41  3/16/2022  6.0 0:29:48
## 42  3/16/2022  6.0 0:29:48
## 43  3/19/2022  6.2 0:33:39
## 44  3/21/2022  6.0 0:30:57
## 45  3/21/2022  6.0 0:28:40
## 46  3/27/2022  8.7 0:47:03
## 47  3/30/2022  6.0 0:32:03
## 48  3/30/2022  6.0 0:30:03
## 49   4/3/2022  6.2 0:34:06
## 50   4/7/2022  6.0 0:29:59
## 51   4/7/2022  6.0 0:32:22

Why the tidyverse? Because it’s for lazy people.

as_tibble(d)

## # A tibble: 51 × 3
##    date          km time   
##    <chr>      <dbl> <chr>  
##  1 10/27/2021   6   0:34:28
##  2 10/27/2021   6   0:33:24
##  3 10/31/2021  10.3 0:59:19
##  4 11/6/2021    8.7 0:54:00
##  5 11/13/2021   8.7 0:49:00
##  6 11/17/2021   4.6 0:25:28
##  7 11/21/2021  10.4 1:02:00
##  8 11/30/2021   6   0:34:08
##  9 11/30/2021   6   0:33:03
## 10 12/5/2021    6.2 0:35:00
## # … with 41 more rows

Why the tidyverse? Because it makes the code more readable.

sub_d <- d[d$km > 4, ]
sub_d <- sub_d[sub_d$km < 6, ]
summary(sub_d$km)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.6     4.6     4.6     4.6     4.6     4.6

Why the tidyverse? Because it makes the code more readable.

d %>% 
  as_tibble() %>% 
  filter(km > 4, km < 6) %>% 
  pull(km) %>% 
  summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.6     4.6     4.6     4.6     4.6     4.6

Why the tidyverse? Because you will learn SQL without knowing it.

d %>% 
  as_tibble() %>% 
  filter(km > 4, km < 6) %>% 
  pull(km) %>% 
  summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.6     4.6     4.6     4.6     4.6     4.6

In SQL:

SELECT 
km
FROM 
d
WHERE km > 4 AND km < 6;

Basic ideas of tidyverse

Tables are the fundamental units of your analysis.
Tidy data: forces good practices for data science.
Complex table manipulations can be evaluated on the fly.
Table manipulations can be organized in modules.
Modules can be pipelined: %>%.

Basic ideas of tidyverse

Total distance and best time for each route - on the fly.

read_csv("kalenji.csv", show_col_types = FALSE) %>% 
  transmute(
    time = minute(time) + hour(time) * 60, #total minutes
    Distance = round(km) #round distance
  ) %>% 
  group_by(Distance) %>% #evalute by group
  summarize(
    `Total distance` = sum(Distance),
    `Fastest time` = min(time)
  ) %>% 
  arrange(desc(`Total distance`)) %>% 
  knitr::kable()

Distance	Total distance	Fastest time
6	246	28
9	54	45
10	20	59
8	8	43
5	5	25

tibble - goal

Goal: table data keeping only good features of data.frame.

Pros:

cleaner
better problem detection
no implicit modifications
better print method
can handle complex names: e.g., variable with spaces and / $ .

Cons:

Some packages still requires a data.frame object.

tibble - what it looks like

##          date   km    time
## 1  10/27/2021  6.0 0:34:28
## 2  10/27/2021  6.0 0:33:24
## 3  10/31/2021 10.3 0:59:19
## 4   11/6/2021  8.7 0:54:00
## 5  11/13/2021  8.7 0:49:00
## 6  11/17/2021  4.6 0:25:28
## 7  11/21/2021 10.4 1:02:00
## 8  11/30/2021  6.0 0:34:08
## 9  11/30/2021  6.0 0:33:03
## 10  12/5/2021  6.2 0:35:00
## 11  12/9/2021  6.2 0:37:33
## 12 12/19/2021  6.2 0:35:22
## 13 12/22/2021  6.2 0:36:00
## 14 12/28/2021  6.2 0:35:00
## 15 12/31/2021  6.2 0:33:12
## 16   1/2/2022  6.2 0:33:43
## 17  1/12/2022  6.0 0:33:44
## 18  1/12/2022  6.0 0:33:45
## 19  1/16/2022  6.2 0:33:52
## 20  1/19/2022  6.0 0:33:38
## 21  1/19/2022  6.0 0:31:56
## 22  1/23/2022  6.2 0:34:26
## 23  1/26/2022  6.0 0:33:34
## 24  1/26/2022  6.0 0:32:19
## 25  1/30/2022  6.2 0:32:44
## 26   2/2/2022  6.0 0:33:59
## 27   2/2/2022  6.0 0:33:41
## 28   2/6/2022  6.2 0:34:35
## 29   2/9/2022  6.0 0:32:33
## 30   2/9/2022  6.0 0:32:20
## 31  2/12/2022  8.7 0:48:35
## 32  2/16/2022  6.2 0:34:02
## 33  2/19/2022  8.7 0:46:40
## 34  2/23/2022  6.0 0:34:03
## 35  2/23/2022  6.0 0:30:40
## 36  2/28/2022  8.1 0:43:18
## 37   3/2/2022  6.4 0:36:16
## 38   3/3/2022  6.4 0:38:22
## 39   3/9/2022  6.2 0:30:33
## 40  3/12/2022  8.7 0:45:20
## 41  3/16/2022  6.0 0:29:48
## 42  3/16/2022  6.0 0:29:48
## 43  3/19/2022  6.2 0:33:39
## 44  3/21/2022  6.0 0:30:57
## 45  3/21/2022  6.0 0:28:40
## 46  3/27/2022  8.7 0:47:03
## 47  3/30/2022  6.0 0:32:03
## 48  3/30/2022  6.0 0:30:03
## 49   4/3/2022  6.2 0:34:06
## 50   4/7/2022  6.0 0:29:59
## 51   4/7/2022  6.0 0:32:22

tibble - from data.frames

as_tibble(d)

## # A tibble: 51 × 3
##    date          km time   
##    <chr>      <dbl> <chr>  
##  1 10/27/2021   6   0:34:28
##  2 10/27/2021   6   0:33:24
##  3 10/31/2021  10.3 0:59:19
##  4 11/6/2021    8.7 0:54:00
##  5 11/13/2021   8.7 0:49:00
##  6 11/17/2021   4.6 0:25:28
##  7 11/21/2021  10.4 1:02:00
##  8 11/30/2021   6   0:34:08
##  9 11/30/2021   6   0:33:03
## 10 12/5/2021    6.2 0:35:00
## # … with 41 more rows

tibble - from matrices

mat <- matrix(1:100, 50, 2)
colnames(mat) <- c("x", "y")
as_tibble(mat)

## # A tibble: 50 × 2
##        x     y
##    <int> <int>
##  1     1    51
##  2     2    52
##  3     3    53
##  4     4    54
##  5     5    55
##  6     6    56
##  7     7    57
##  8     8    58
##  9     9    59
## 10    10    60
## # … with 40 more rows

tibble - from lists

l <- list(x = 1:10, y = 1:10)
as_tibble(l)

## # A tibble: 10 × 2
##        x     y
##    <int> <int>
##  1     1     1
##  2     2     2
##  3     3     3
##  4     4     4
##  5     5     5
##  6     6     6
##  7     7     7
##  8     8     8
##  9     9     9
## 10    10    10

tibble - create new data

tibble(x = LETTERS[1:10], y = 1:10)

## # A tibble: 10 × 2
##    x         y
##    <chr> <int>
##  1 A         1
##  2 B         2
##  3 C         3
##  4 D         4
##  5 E         5
##  6 F         6
##  7 G         7
##  8 H         8
##  9 I         9
## 10 J        10

tibble - compatibility issues with packages

moveHMM::prepData(as_tibble(l))

## Warning: Unknown or uninitialised column: `ID`.

## <error/vctrs_error_subscript_oob>
## Error in `vectbl_as_col_location()`:
## ! Can't subset columns past the end.
## ℹ Location 2 doesn't exist.
## ℹ There is only 1 column.
## ---
## Backtrace:
##  1. base::tryCatch(moveHMM::prepData(as_tibble(l)), error = function(e) print(e))
##  5. moveHMM::prepData(as_tibble(l))
##  7. tibble:::`[.tbl_df`(x, i)
##  8. tibble:::vectbl_as_col_location(j, length(x), names(x), j_arg = j_arg, assign = FALSE)

tibble - compatibility issues: as.data.frame()

moveHMM::prepData(as.data.frame(l))

##         ID     step        angle  x  y
## 1  Animal1 156.8744           NA  1  1
## 2  Animal1 156.8276 0.0004610422  2  2
## 3  Animal1 156.7574 0.0006155637  3  3
## 4  Animal1 156.6639 0.0007703128  4  4
## 5  Animal1 156.5472 0.0009253809  5  5
## 6  Animal1 156.4073 0.0010808595  6  6
## 7  Animal1 156.2444 0.0012368401  7  7
## 8  Animal1 156.0586 0.0013934143  8  8
## 9  Animal1 155.8500 0.0015506736  9  9
## 10 Animal1       NA           NA 10 10

tibble - printing options

options(tibble.print_max = 4, tibble.print_min = 4)
as_tibble(l)

## # A tibble: 10 × 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     3     3
## 4     4     4
## # … with 6 more rows

tibble - practical 1 - problem

Create two vectors x and y with 10 random numbers (e.g. runif()).
Create a tibble with columns x and y.

tibble - practical 1 - solution

Create two vectors x and y with 10 random numbers (e.g. runif()).
Create a tibble with columns x and y.

x <- runif(10)
y <- runif(10)
tibble(x, y)

## # A tibble: 10 × 2
##        x      y
##    <dbl>  <dbl>
## 1 0.771  0.905 
## 2 0.220  0.745 
## 3 0.394  0.0169
## 4 0.0518 0.622 
## # … with 6 more rows

tibble - practical 2 - problem

Create a vector x with 10 random numbers (e.g. runif()).
Create a vector y with 5 random numbers (e.g. runif()).
Create a data.frame with columns x and y.
Create a tibble with columns x and y.

tibble - practical 2 - solution

Create a vector x with 10 random numbers (e.g. runif()).
Create a vector y with 5 random numbers (e.g. runif()).
Create a data.frame with columns x and y.

x <- runif(10)
y <- runif(5)
data.frame(x, y)

##            x         y
## 1  0.8002872 0.7593188
## 2  0.3531292 0.4823047
## 3  0.5712468 0.2036041
## 4  0.2484056 0.9589408
## 5  0.1329216 0.7326067
## 6  0.6193830 0.7593188
## 7  0.3570391 0.4823047
## 8  0.3006154 0.2036041
## 9  0.7815725 0.9589408
## 10 0.5639128 0.7326067

tibble - practical 2 - solution

Create a vector x with 10 random numbers (e.g. runif()).
Create a vector y with 5 random numbers (e.g. runif()).
Create a data.frame with columns x and y.
Create a tibble with columns x and y.

x <- runif(10)
y <- runif(5)
tibble(x, y)

## <error/tibble_error_incompatible_size>
## Error:
## ! Tibble columns must have compatible sizes.
## • Size 10: Existing data.
## • Size 5: Column at position 2.
## ℹ Only values of size one are recycled.
## ---
## Backtrace:
##  1. base::tryCatch(...)
##  5. tibble::tibble(x, y)
##  6. tibble:::tibble_quos(xs, .rows, .name_repair)
##  7. tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])

tibble - practical 2 - solution

x <- runif(10)
tibble(x, y = 0)

## # A tibble: 10 × 2
##       x     y
##   <dbl> <dbl>
## 1 0.258     0
## 2 0.810     0
## 3 0.435     0
## 4 0.979     0
## # … with 6 more rows

readr - goal

Goal: load tabular data from delimited files (comma-separated).

Pros:

fast and friendly
support many types of data
informative table summary and problem reports

Cons:

syntax can be sometimes frustrating

readr - read data

d <- read.csv("kalenji.csv")
str(d)

## 'data.frame':    51 obs. of  3 variables:
##  $ date: chr  "10/27/2021" "10/27/2021" "10/31/2021" "11/6/2021" ...
##  $ km  : num  6 6 10.3 8.7 8.7 4.6 10.4 6 6 6.2 ...
##  $ time: chr  "0:34:28" "0:33:24" "0:59:19" "0:54:00" ...

readr - read data

d <- read_csv("kalenji.csv")

## Rows: 51 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): date
## dbl  (1): km
## time (1): time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

readr - read data - suppress messages

d <- read_csv("kalenji.csv", show_col_types = FALSE) #suppress summary

readr - returns a tibble

readr loads directly a tibble:

read_csv("kalenji.csv", show_col_types = FALSE)

## # A tibble: 51 × 3
##   date          km time  
##   <chr>      <dbl> <time>
## 1 10/27/2021   6   34'28"
## 2 10/27/2021   6   33'24"
## 3 10/31/2021  10.3 59'19"
## 4 11/6/2021    8.7 54'00"
## # … with 47 more rows

readr - good column guessing

readr is smart:

d <- read_csv("kalenji.csv", show_col_types = FALSE)
d$time[2] - d$time[1]

## Time difference of -64 secs

readr - read general text delimited file

d <- read_delim("kalenji.txt", #general delimited file
                delim = ";",
                show_col_types = FALSE)

readr - write to file

readr can, of course, write to files.

write_csv(d, "copy-of-kalenji.csv")

readr - practical 1 - problem

Create a tibble with 10 random x and 10 random y.
Save it as random-numbers.csv.
Save it as random-numbers.txt using ; as separator.
Load them both.

readr - practical 1 - solution

Create a tibble with 10 random x and 10 random y.
Save it as random-numbers.csv.
Save it as random-numbers.txt using ; as separator.
Load them both.

d <- tibble(x = runif(10), y = runif(10))
# write files
write_csv(d, "random-numbers.csv")
write_delim(d, "random-numbers.txt", delim = ";")
# read files
d_csv <- read_csv("random-numbers.csv", show_col_types = FALSE)
d_semi <- read_delim("random-numbers.csv", show_col_types = FALSE, delim = ";")

readr - really faster?

For a 24 Mb table with 29,000 rows:

	test	replications	elapsed	relative
2	base	10	4.942	5.291
3	data.table	10	0.934	1.000
1	readr	10	3.457	3.701

magrittr - pipe

Before going deeper: magrittr pipe %>%.

command_1 PIPE command_2 PIPE command_3

PIPE = take output from left and pass it as input to right.

EXECUTE command_1 AND PASS ITS OUPTUT TO command_2 AND PASS ITS OUTPUT TO command_3.

round(mean(abs(x))) = x %>% abs() %>% mean() %>% round()

magrittr - what’s a pipe?

Before going deeper: magrittr pipe %>%.

command PIPE command PIPE command

d <- read_csv("kalenji.csv", show_col_types = FALSE)
d <- d[, c("km", "time")]
plot(d)

magrittr - pipe your way to glory!

Before going deeper: magrittr pipe %>%.

command PIPE command PIPE command

#make sure input/output is of the correct type
read_csv("kalenji.csv", show_col_types = FALSE) %>% #output = tibble
  select(km, time) %>% #output = tibble
  plot() #output = figure

magrittr - pipe practical 1 - problem

Create a tibble with x, y (runif(10)) and plot it without saving objects in the RAM.

magrittr - pipe practical 1 - solution

Create a tibble with x, y (runif(10)) and plot it without saving objects in the RAM.

tibble(x = runif(10), y = runif(10)) %>% 
  plot()

magrittr - pipe practical 2 - problem

Load kalenji.csv and plot it without saving objects in the RAM.

magrittr - pipe practical 2 - solution

Load kalenji.csv and plot it without saving objects in the RAM.

read_csv("kalenji.csv", show_col_types = FALSE) %>% plot(pch = 20)

dplyr - goal

Goal: Standardize data manipulation.

Pros:

intuitive verbs: select, filter, etc.
extremely versatile.
compact and modular complex data manipulaitons.

Cons:

more complex manipulations can be scaring.

dplyr - lifecycle

Life cycle of tidyverse functions

dplyr - lifecycle in help()

A superseded example

dplyr - key verbs

Common verbs:

filter(): retain rows conditionally.
select(): select columns and drop the rest.
mutate(): create a new column.
pull(): extract one column as a vector.
summarize(): summarize columns.
group_by(): groups tables into sub-tables. Each groups is manipulated separately.
arrange(): sort rows by values.
slice(): select rows by ID, randomly, firsts, etc.
join(): join tables.

dplyr - filter()

filter() retains rows conditionally.

dplyr - filter() - example

filter() retains rows conditionally.

d %>% filter(km < 6)

## # A tibble: 1 × 2
##      km time  
##   <dbl> <time>
## 1   4.6 25'28"

dplyr - filter() - more examples

filter() retains rows conditionally.

d %>% filter(km < 6)
d %>% filter(km == 6)
d %>% filter(km > 4, km < 8) #greater than 4 AND less than 8
d %>% filter(km < 4 | km > 8) #less than 4 OR greater than 8

dplyr - filter() - practical - problem

Retain only observations with even number of km (%% 2 == 0).
Retain only observations between 5 and 7 km.

dplyr - filter() - practical - solution

Retain only observations with even number of km (%% 2 == 0).

d %>% filter(km %% 2 == 0)

## # A tibble: 24 × 2
##      km time  
##   <dbl> <time>
## 1     6 34'28"
## 2     6 33'24"
## 3     6 34'08"
## 4     6 33'03"
## # … with 20 more rows

dplyr - filter() - practical - solution

Retain only observations with even number of km (%% 2 == 0).
Retain only observations between 5 and 7 km.

d %>% filter(km >= 5, km <= 7)

## # A tibble: 41 × 2
##      km time  
##   <dbl> <time>
## 1     6 34'28"
## 2     6 33'24"
## 3     6 34'08"
## 4     6 33'03"
## # … with 37 more rows

dplyr - select()

select() selects some columns and drops the rest.

dplyr - select() - example

select() selects some columns and drops the rest.

d %>% select(km) #d[, "km"]

## # A tibble: 51 × 1
##      km
##   <dbl>
## 1   6  
## 2   6  
## 3  10.3
## 4   8.7
## # … with 47 more rows

dplyr - select() - also reversed

You can also non-select columns with -

d %>% select(-km)

## # A tibble: 51 × 1
##   time  
##   <time>
## 1 34'28"
## 2 33'24"
## 3 59'19"
## 4 54'00"
## # … with 47 more rows

dplyr - tidyselect intermezzo

What if we want to select several columns conditionally?

tidysel <- tibble(
  Outcome = runif(10, 2, 3),
  `Predictor 1` = runif(10, -1, 1),
  `Predictor 2` = runif(10, 10, 150),
  `First levels` = rep(c("A", "B"), 5),
  `Second levels` = rep(c("C", "D"), 5)
)
tidysel

## # A tibble: 10 × 5
##   Outcome `Predictor 1` `Predictor 2` `First levels` `Second levels`
##     <dbl>         <dbl>         <dbl> <chr>          <chr>          
## 1    2.37         0.328          72.2 A              C              
## 2    2.78        -0.619          87.7 B              D              
## 3    2.12        -0.924         133.  A              C              
## 4    2.66         0.282          64.6 B              D              
## # … with 6 more rows

tidyselect allows this flexibility.

dplyr - tidyselect intermezzo - select columns starting with “Pred”

tidysel %>% select(starts_with("Pred"))

## # A tibble: 10 × 2
##   `Predictor 1` `Predictor 2`
##           <dbl>         <dbl>
## 1         0.328          72.2
## 2        -0.619          87.7
## 3        -0.924         133. 
## 4         0.282          64.6
## # … with 6 more rows

dplyr - tidyselect intermezzo - select columns containing “levels”

tidysel %>% select(contains("levels"))

## # A tibble: 10 × 2
##   `First levels` `Second levels`
##   <chr>          <chr>          
## 1 A              C              
## 2 B              D              
## 3 A              C              
## 4 B              D              
## # … with 6 more rows

dplyr - tidyselect intermezzo - select only numeric columns

For this, we need to use also the where helper, which checks conditions on columns.

Conditions must return TRUE or FALSE, e.g. is.numeric.

tidysel %>% select(where(is.numeric))

## # A tibble: 10 × 3
##   Outcome `Predictor 1` `Predictor 2`
##     <dbl>         <dbl>         <dbl>
## 1    2.37         0.328          72.2
## 2    2.78        -0.619          87.7
## 3    2.12        -0.924         133. 
## 4    2.66         0.282          64.6
## # … with 6 more rows

dplyr - tidyselect intermezzo - where

where is extremely useful for many verbs in dplyr. We will see more cases later.

You can already appreciate its simplicity, though:

# in base R
pca <- prcomp(tidysel[, c("Outcome", "Predictor 1", "Predictor 2")])

# in tidyverse
pca <- tidysel %>%
  select(where(is.numeric)) %>% 
  prcomp()

Application: PCAs on climate data

dplyr - tidyselect intermezzo - selection alternatives

In base R there are several way to do the same as:

tidysel %>% select(where(is.numeric))

This is not safe:

tidysel[, c(1, 2, 3)]
#what if the position of columns changes?

This is not scalable:

tidysel[, c("Outcome", "Predictor 1", "Predictor 2")]
#what if you add a new numeric column?

The only safe and scalable way except the tidyverse one is:

coltypes <- sapply(colnames(tidysel), \(x) class(tidysel[[x]]))
tidysel[, coltypes == "numeric"]

15 minutes break

Don’t drink soap.

dplyr - mutate()

mutate() creates new columns (or modifies existing ones).

dplyr - mutate() - example

mutate() creates new columns (or modifies existing ones).

library(lubridate) #this will save you time
d <- read_csv("kalenji.csv", show_col_types = FALSE)
d <- d %>% mutate(Date = as_date(date, format = "%m/%d/%y"))

dplyr - mutate() - compact it please.

d <- read_csv("kalenji.csv", show_col_types = FALSE)
d <- d %>% mutate(Date = as_date(date, format = "%m/%d/%y"))
d

## # A tibble: 51 × 4
##   date          km time   Date      
##   <chr>      <dbl> <time> <date>    
## 1 10/27/2021   6   34'28" 2020-10-27
## 2 10/27/2021   6   33'24" 2020-10-27
## 3 10/31/2021  10.3 59'19" 2020-10-31
## 4 11/6/2021    8.7 54'00" 2020-11-06
## # … with 47 more rows

Can you compress this even more and make it more readable?

Do we need the old date column?

dplyr - mutate() - compacted

d <- read_csv("kalenji.csv", show_col_types = FALSE) %>% 
  mutate(date = as_date(date, format = "%m/%d/%y"))

dplyr - mutate() - practical - problem

Create three new columns (month, week, day) using lubridate month(), week(), and day().

dplyr - mutate() - practical - solution

Create three new columns (month, week, day) using lubridate month(), week(), and day().

d <- d %>% 
  mutate(
    month = month(date, label = TRUE, abbr = FALSE),
    week = week(date),
    day = day(date)
  )
d

## # A tibble: 51 × 6
##   date          km time   month     week   day
##   <date>     <dbl> <time> <ord>    <dbl> <int>
## 1 2020-10-27   6   34'28" October     43    27
## 2 2020-10-27   6   33'24" October     43    27
## 3 2020-10-31  10.3 59'19" October     44    31
## 4 2020-11-06   8.7 54'00" November    45     6
## # … with 47 more rows

dplyr - mutate() - tidyselect

mutate() can make use of tidyselect, but within the across() helper.

The syntax always looks like: mutate(across(where( condition ), fn)).

tidysel %>% mutate(across(starts_with("Pred"), round, 1))

## # A tibble: 10 × 5
##   Outcome `Predictor 1` `Predictor 2` `First levels` `Second levels`
##     <dbl>         <dbl>         <dbl> <chr>          <chr>          
## 1    2.37           0.3          72.2 A              C              
## 2    2.78          -0.6          87.7 B              D              
## 3    2.12          -0.9         133.  A              C              
## 4    2.66           0.3          64.6 B              D              
## # … with 6 more rows

tidysel %>% mutate(across(where(is.character), tolower))

## # A tibble: 10 × 5
##   Outcome `Predictor 1` `Predictor 2` `First levels` `Second levels`
##     <dbl>         <dbl>         <dbl> <chr>          <chr>          
## 1    2.37         0.328          72.2 a              c              
## 2    2.78        -0.619          87.7 b              d              
## 3    2.12        -0.924         133.  a              c              
## 4    2.66         0.282          64.6 b              d              
## # … with 6 more rows

dplyr - pull()

pull() extracts one column as vector.

dplyr - pull() - example

pull() extracts one column as vector.

d %>% pull(km)

##  [1]  6.0  6.0 10.3  8.7  8.7  4.6 10.4  6.0  6.0  6.2  6.2  6.2  6.2  6.2  6.2
## [16]  6.2  6.0  6.0  6.2  6.0  6.0  6.2  6.0  6.0  6.2  6.0  6.0  6.2  6.0  6.0
## [31]  8.7  6.2  8.7  6.0  6.0  8.1  6.4  6.4  6.2  8.7  6.0  6.0  6.2  6.0  6.0
## [46]  8.7  6.0  6.0  6.2  6.0  6.0

dplyr - pull() - practical - problem

Get the table() of month and sort it by count.

dplyr - pull() - practical - solution

Get the table() of month and sort it by count.

d %>% 
  pull(month) %>% 
  table() %>% 
  sort()

## .
##       May      June      July    August September     April   October  November 
##         0         0         0         0         0         3         3         6 
##  December   January  February     March 
##         6        10        11        12

dplyr - summarize()

summarize() summarizes columns.

dplyr - summarize() - example

summarize() summarizes columns.

d %>% summarize(
  `shortest distance` = min(km),
  `average distance` = mean(km),
  `maximum distance` = max(km)
)

## # A tibble: 1 × 3
##   `shortest distance` `average distance` `maximum distance`
##                 <dbl>              <dbl>              <dbl>
## 1                 4.6               6.58               10.4

dplyr - summarize() - practical 1 - problem

Summarize Total km as sum of all distances and Total time as sum of all times.

dplyr - summarize() - practical 1 - solution

Summarize Total km as sum of all distances and Total time as sum of all times (as hours).

d %>% 
  summarize(
    `Total km` = sum(km),
    `Total time` = (time %>% sum() %>% as.numeric()) / 3600
  )

## # A tibble: 1 × 2
##   `Total km` `Total time`
##        <dbl>        <dbl>
## 1       335.         30.7

dplyr - summarize() - tidyselect

tidysel %>% summarize(
  across(contains("Pred"), #tidyselect
         list(min = min, max = max), #summary stats
         .names = "{col} : {fn}") #names of columns
)

## # A tibble: 1 × 4
##   `Predictor 1 : min` `Predictor 1 : max` `Predictor 2 : min` `Predictor 2 : m…`
##                 <dbl>               <dbl>               <dbl>              <dbl>
## 1              -0.924               0.679                24.2               150.

dplyr - summarize() - practical 2 - problem

Summarize all numeric variables to get mean and median values.

dplyr - summarize() - practical 2 - solution

Summarize all numeric variables to get mean and median values.

d %>% 
  summarize(across(where(is.numeric), 
                   list(mean = mean, median = median),
                   .names = "{fn} of {col}"))

## # A tibble: 1 × 6
##   `mean of km` `median of km` `mean of week` `median of week` `mean of day`
##          <dbl>          <dbl>          <dbl>            <dbl>         <dbl>
## 1         6.58            6.2           19.6               11          17.1
## # … with 1 more variable: `median of day` <int>

dplyr - group_by()

dplyr operations can also be performed on groups within tables by using group_by().

For instance, the sum of all distances for the whole kalenji dataset is:

d %>% summarize(`Total distance` = sum(km))

## # A tibble: 1 × 1
##   `Total distance`
##              <dbl>
## 1             335.

dplyr - group_by() - example

To get the summary sum for each month separately, we just need to specify month as the grouping variable. dplyr would then perform the next operation on each group separately:

d %>% 
  group_by(month) %>% 
  summarize(`Total distance` = sum(km))

## # A tibble: 7 × 2
##   month    `Total distance`
##   <ord>               <dbl>
## 1 January              60.8
## 2 February             73.9
## 3 March                78.6
## 4 April                18.2
## # … with 3 more rows

dplyr - group_by() - practical 1 - problem

Count the number of observation in each month

dplyr - group_by() - practical 1 - solution

Count the number of observation in each month

d %>% 
  group_by(month) %>% 
  summarize(n = length(km))

## # A tibble: 7 × 2
##   month        n
##   <ord>    <int>
## 1 January     10
## 2 February    11
## 3 March       12
## 4 April        3
## # … with 3 more rows

dplyr - group_by() - practical 1 - tally()

Count the number of observation in each month

d %>% 
  group_by(month) %>% 
  tally()

## # A tibble: 7 × 2
##   month        n
##   <ord>    <int>
## 1 January     10
## 2 February    11
## 3 March       12
## 4 April        3
## # … with 3 more rows

dplyr - group_by() - practical 2 - problem

Calculate the total distance per week within months

dplyr - group_by() - practical 2 - solution

Calculate the total distance per week within months

d %>% 
  group_by(month, week) %>% 
  summarize(`Total km` = sum(km)) %>% 
  filter(week > 3) #simply to show you why week and month

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

## # A tibble: 24 × 3
## # Groups:   month [7]
##   month     week `Total km`
##   <ord>    <dbl>      <dbl>
## 1 January      4       18.2
## 2 January      5        6.2
## 3 February     5       12  
## 4 February     6       18.2
## # … with 20 more rows

The output is still grouped.

dplyr - group_by() - practical 2 - ungroup()

If you don’t need groups, ungroup() the table, or you may get unexpected results later on.

d %>% 
  group_by(month, week) %>% 
  summarize(`Total km` = sum(km), .groups = "drop")

d %>% 
  group_by(month, week) %>% 
  summarize(`Total km` = sum(km)) %>% 
  ungroup()

dplyr - group_by() - practical 2 - lifecycle?

Always read the documentation

Lunch break

We’ll be back at 1.

dplyr - join()

Two tables to join

$\,$

dplyr - join()

Two tables to join

$\longrightarrow$

Joined table

dplyr - join() - example

left_table <- tibble(Group = LETTERS[1:3], x = rnorm(3))
right_table <- tibble(Group = LETTERS[2:4], y = rnorm(3))

left_table

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     0.0453
## 2 B     0.586 
## 3 C     0.770

right_table

## # A tibble: 3 × 2
##   Group      y
##   <chr>  <dbl>
## 1 B      1.05 
## 2 C     -0.852
## 3 D      1.11

dplyr - join() - types

Output tables has:

full_join(left_table, right_table) - all levels
inner_join(left_table, right_table) - only levels in both
left_join(left_table, right_table) - only levels in left_table
right_join(left_table, right_table) - only levels in right_table

left_table

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     0.0453
## 2 B     0.586 
## 3 C     0.770

right_table

## # A tibble: 3 × 2
##   Group      y
##   <chr>  <dbl>
## 1 B      1.05 
## 2 C     -0.852
## 3 D      1.11

dplyr - join() - full join

left_table

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     0.0453
## 2 B     0.586 
## 3 C     0.770

right_table

## # A tibble: 3 × 2
##   Group      y
##   <chr>  <dbl>
## 1 B      1.05 
## 2 C     -0.852
## 3 D      1.11

full_join(left_table, right_table)

## Joining, by = "Group"

## # A tibble: 4 × 3
##   Group       x      y
##   <chr>   <dbl>  <dbl>
## 1 A      0.0453 NA    
## 2 B      0.586   1.05 
## 3 C      0.770  -0.852
## 4 D     NA       1.11

dplyr - join() - inner join

left_table

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     0.0453
## 2 B     0.586 
## 3 C     0.770

right_table

## # A tibble: 3 × 2
##   Group      y
##   <chr>  <dbl>
## 1 B      1.05 
## 2 C     -0.852
## 3 D      1.11

inner_join(left_table, right_table)

## Joining, by = "Group"

## # A tibble: 2 × 3
##   Group     x      y
##   <chr> <dbl>  <dbl>
## 1 B     0.586  1.05 
## 2 C     0.770 -0.852

dplyr - join() - left join

left_table

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     0.0453
## 2 B     0.586 
## 3 C     0.770

right_table

## # A tibble: 3 × 2
##   Group      y
##   <chr>  <dbl>
## 1 B      1.05 
## 2 C     -0.852
## 3 D      1.11

left_join(left_table, right_table)

## Joining, by = "Group"

## # A tibble: 3 × 3
##   Group      x      y
##   <chr>  <dbl>  <dbl>
## 1 A     0.0453 NA    
## 2 B     0.586   1.05 
## 3 C     0.770  -0.852

dplyr - join() - right join

left_table

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     0.0453
## 2 B     0.586 
## 3 C     0.770

right_table

## # A tibble: 3 × 2
##   Group      y
##   <chr>  <dbl>
## 1 B      1.05 
## 2 C     -0.852
## 3 D      1.11

right_join(left_table, right_table)

## Joining, by = "Group"

## # A tibble: 3 × 3
##   Group      x      y
##   <chr>  <dbl>  <dbl>
## 1 B      0.586  1.05 
## 2 C      0.770 -0.852
## 3 D     NA      1.11

dplyr - join() - by?

Joining requires (at least) one joining variable. Avoid ambiguity: specify it.

left_table %>% left_join(right_table, by = "Group")

## # A tibble: 3 × 3
##   Group      x      y
##   <chr>  <dbl>  <dbl>
## 1 A     0.0453 NA    
## 2 B     0.586   1.05 
## 3 C     0.770  -0.852

dplyr - join() - by multiple variables

You can join by multiple variables.

left_table <- tibble(`First level` = sample(LETTERS, 100, replace = TRUE),
                     `Second level` = sample(letters, 100, replace = TRUE),
                     x = rnorm(100))
right_table <- tibble(`First level` = sample(LETTERS, 100, replace = TRUE),
                      `Second level` = sample(letters, 100, replace = TRUE),
                      y = rnorm(100))
inner_join(left_table, right_table, by = c("First level", "Second level"))

## # A tibble: 16 × 4
##   `First level` `Second level`      x      y
##   <chr>         <chr>           <dbl>  <dbl>
## 1 U             s               1.08   0.697
## 2 K             m              -0.106 -0.182
## 3 J             y              -1.26  -0.588
## 4 F             x               0.411  1.03 
## # … with 12 more rows

dplyr - join() - by different names

Variables do not need to have the same name.

left_table <- tibble(`First level` = sample(LETTERS, 100, replace = TRUE),
                     `Second level` = sample(letters, 100, replace = TRUE),
                     x = rnorm(100))
right_table <- tibble(`First group` = sample(LETTERS, 100, replace = TRUE),
                      `Second group` = sample(letters, 100, replace = TRUE),
                      y = rnorm(100))
inner_join(left_table, right_table,
           by = c("First level" = "First group", 
                  "Second level" = "Second group"))

## # A tibble: 26 × 4
##   `First level` `Second level`      x      y
##   <chr>         <chr>           <dbl>  <dbl>
## 1 L             h              -0.764  0.856
## 2 D             j              -0.447 -0.149
## 3 F             f               1.58  -0.118
## 4 T             u              -0.684  1.25 
## # … with 22 more rows

dplyr - join() - practical - problem

I have the name of the routes.

routes <- read_csv("routes.csv", show_col_types = FALSE)
routes

## # A tibble: 6 × 2
##   distance `route name` 
##      <dbl> <chr>        
## 1      4.6 stadium      
## 2      6   work         
## 3      6.4 first bridge 
## 4      8.7 second bridge
## # … with 2 more rows

Assign the route names to the kalenji.csv table.
Count the number of km I ran for each route.
Arrange them in decreasing order.
Rename missing routes as: “no name”.

dplyr - join() - practical - solution

Assign the route names to the kalenji.csv table.
Count the number of km I ran for each route.
Arrange them in decreasing order.
Rename missing routes as: “no name”.

d %>% 
  left_join(routes, by = c("km" = "distance")) %>% 
  group_by(`route name`) %>% 
  summarise(`Total km` = sum(km)) %>% 
  arrange(desc(`Total km`)) %>% 
  mutate(`route name` = ifelse(is.na(`route name`), "no name", `route name`))

## # A tibble: 6 × 2
##   `route name`  `Total km`
##   <chr>              <dbl>
## 1 work               144  
## 2 no name            101. 
## 3 second bridge       52.2
## 4 train lake          20.7
## # … with 2 more rows

The garmin.csv dataset

garmin dataset

garmin <- read_csv("garmin.csv", show_col_types = FALSE)

## New names:
## • `Avg Run Cadence` -> `Avg Run Cadence...10`
## • `Max Run Cadence` -> `Max Run Cadence...11`
## • `Avg Run Cadence` -> `Avg Run Cadence...20`
## • `Max Run Cadence` -> `Max Run Cadence...21`

garmin

## # A tibble: 74 × 41
##   `Activity Type` Date                Favorite Title    Distance Calories Time  
##   <chr>           <dttm>              <lgl>    <chr>       <dbl>    <dbl> <time>
## 1 Running         2022-09-08 08:19:19 FALSE    Stara Z…     5.35      484 36'08"
## 2 Running         2022-09-04 10:48:08 FALSE    Leipzig…     8.41      698 47'50"
## 3 Running         2022-09-02 07:04:01 FALSE    Leipzig…     6.76      524 39'15"
## 4 Running         2022-09-01 16:42:50 FALSE    Leipzig…     5.7       457 31'20"
## # … with 70 more rows, and 34 more variables: `Avg HR` <dbl>, `Max HR` <dbl>,
## #   `Avg Run Cadence...10` <dbl>, `Max Run Cadence...11` <dbl>,
## #   `Avg Pace` <chr>, `Best Pace` <chr>, `Total Ascent` <chr>,
## #   `Total Descent` <chr>, `Avg Stride Length` <dbl>,
## #   `Avg Vertical Ratio` <dbl>, `Avg Vertical Oscillation` <dbl>,
## #   `Avg Ground Contact Time` <dbl>, `Avg Run Cadence...20` <dbl>,
## #   `Max Run Cadence...21` <dbl>, `Training Stress Score®` <dbl>, …

We are interested only in Activity Type = Running and columns: Date, Distance, Time, Avg HR.

garmin dataset - practical 1 - problem

Filter only Running entries.
Select only relevant columns: Date, Distance, Time, Avg HR.
Create new column Speed = Distance / Time (km/h).
Plot them using pairs().

garmin dataset - practical 1 - solution

Filter only Running entries.
Select only relevant columns: Date, Distance, Time, Avg HR.

garmin <- read_csv("garmin.csv", show_col_types = FALSE) %>% 
  filter(`Activity Type` == "Running") %>% 
  select(Date, Distance, Time, `Avg HR`)

## New names:
## • `Avg Run Cadence` -> `Avg Run Cadence...10`
## • `Max Run Cadence` -> `Max Run Cadence...11`
## • `Avg Run Cadence` -> `Avg Run Cadence...20`
## • `Max Run Cadence` -> `Max Run Cadence...21`

garmin dataset - practical 1 - solution

Filter only Running entries.
Select only relevant columns: Date, Distance, Time, Avg HR.
Create new column Speed = Distance / Time (km/h).

garmin <- garmin %>% 
  mutate(Hours = hour(Time) + minute(Time) / 60 + second(Time) / 3600,
         Speed = Distance / Hours) %>% 
  select(-Hours)

garmin dataset - practical 1 - solution

Filter only Running entries.
Select only relevant columns.
Create new column Speed = Distance / Time (km/h).
Plot them using pairs().

pairs(garmin, pch = 20, cex = 1.5)

garmin dataset - practical 2 - problem

Round distances for both datasets kalenji and garmin to integers.
Calculate average Speed per Distance.
(Full) join the tables returning Distance, Time, and Average Speed (km/h).

garmin dataset - practical 2 - solution

kalenji <- read_csv("kalenji.csv", show_col_types = FALSE) %>% 
  mutate(Hours = hour(time) + minute(time) / 60 + second(time) / 3600,
         Speed = km / Hours) %>% 
  transmute(Distance = km, Time = time, Speed)

garmin dataset - practical 2 - solution

Round distances for both datasets kalenji and garmin to integers.
Calculate average Speed per Distance.
Join the tables returning Distance, Time, and Average Speed (km/h).

For each dataset: round() %>% summarize() %>% join().

garmin dataset - practical 2 - solution

Round distances for both datasets kalenji and garmin to integers.
Calculate average Speed per Distance.
Join the tables returning Distance, Time, and Average Speed (km/h).

For each dataset: round() %>% summarize() %>% join().

However, one round() %>% summarize() can be nested into join():

left_table %>% 
  round() %>% 
  summarize() %>% 
  join(
    right_table() %>% 
      round() %>% 
      summarize()
  )

garmin dataset - practical 2 - solution

Round distances for both datasets kalenji and garmin to integers.
Calculate average Speed per Distance.
Join the tables returning Distance, Time, and Average Speed (km/h).

avg_speed <- kalenji %>% 
  mutate(Distance = round(Distance)) %>% 
  group_by(Distance) %>% 
  summarize(`Average Speed` = mean(Speed)) %>% 
  full_join(
    garmin %>% 
      mutate(Distance = round(Distance)) %>% 
      group_by(Distance) %>% 
      summarize(`Average Speed` = mean(Speed)),
    by = "Distance",
    suffix = c(" - kalenji", " - garmin")
  )

Line indentations (should) matter

garmin dataset - practical 2 - solution

Round distances for both datasets kalenji and garmin to integers.
Calculate average Speed per Distance.
Join the tables returning Distance, Time, and Average Speed (km/h).

avg_speed

## # A tibble: 10 × 3
##   Distance `Average Speed - kalenji` `Average Speed - garmin`
##      <dbl>                     <dbl>                    <dbl>
## 1        5                      10.8                    10.2 
## 2        6                      11.0                    10.9 
## 3        8                      11.2                    10.5 
## 4        9                      10.8                     9.68
## # … with 6 more rows

garmin dataset - practical 2 - line indentations matter

avg_speed <- kalenji %>% 
  mutate(Distance = round(Distance)) %>% 
  group_by(Distance) %>% 
  summarize(`Average Speed` = mean(Speed)) %>% 
  full_join(
    garmin %>% 
      mutate(Distance = round(Distance)) %>% 
      group_by(Distance) %>% 
      summarize(`Average Speed` = mean(Speed)),
    by = "Distance",
    suffix = c(" - kalenji", " - garmin")
  )

garmin dataset - practical 2 - no consistent indentations = you’re a monster

avg_speed <- kalenji %>% 
  mutate(Distance = round(Distance)) %>% group_by(Distance) %>% 
  summarize(
    `Average Speed` = mean(Speed)) %>% 
  full_join(garmin %>% 
              mutate(Distance = round(Distance)) %>% group_by(Distance) %>% 
              summarize(`Average Speed` = mean(Speed)), by = "Distance",
            suffix = c(
              " - kalenji",
              " - garmin"
            ))

garmin dataset - practical 2 - how do you even write?

Nel mezzo
del
cammin di      nostra vita, mi
ritrovai perunaselvaoscura
che
la
diritta
via     era       smarrita.

garmin dataset - practical 2 - writing and coding are not that different

Nel mezzo del cammin di nostra vita,
mi ritrovai per una selva oscura
che la diritta via era smarrita.

One line = one operation

One block = one coherent prodecure

lines -> blocks -> pipeline

garmin dataset - practical 2 - considerations

Round distances for both datasets kalenji and garmin to integers.
Calculate average Speed per Distance.
Join the tables returning Distance, Time, and Average Speed (km/h).

All manipulations were done on the fly, i.e. without saving objects to memory unless specifically stated: avg_speed <- is the only new object. Old tables are still the same:

garmin$Distance #not rounded

##  [1]  5.35  8.41  6.76  5.70  5.70  2.00  3.82  5.69 10.54  5.84  5.72  5.71
## [13]  5.67  9.52  6.01  5.81  5.74  6.03  5.78  7.04  6.06  5.83  5.84  5.79
## [25]  5.81  8.08  6.15  5.38  5.83  5.77  4.38  1.40  9.29  5.71  5.72  5.80
## [37]  6.07  7.02  6.53  6.61  8.42 10.49  6.04  5.96  6.97  8.42  6.01  5.73
## [49]  5.78  5.58  5.04  6.01  5.71  4.50 10.01  5.75  5.84  5.91  5.81 10.10
## [61]  6.00  6.07  6.21  6.05  5.66  5.71  7.89

garmin dataset - practical 2 - more considerations

Let’s say we want to compare Average Speed between the two methods, e.g. boxplot().

How would you manipulate this table?

avg_speed

## # A tibble: 10 × 3
##   Distance `Average Speed - kalenji` `Average Speed - garmin`
##      <dbl>                     <dbl>                    <dbl>
## 1        5                      10.8                    10.2 
## 2        6                      11.0                    10.9 
## 3        8                      11.2                    10.5 
## 4        9                      10.8                     9.68
## # … with 6 more rows

Tidy data: one row = one observation.

tidyr - goal

Goal: To achieve tidy data.

Pros:

Very powerful: few commands to achieve tidy data.
Extremely versatile.

Cons:

Syntax can be counter-intuitive.

tidyr - pivotting

The most frequent use of tidyr is for pivotting: from long to wide and back again.

tidyr - long and wide tables

The most frequent use of tidyr is for pivotting: from long to wide and back again.

Long table

long <- tibble(
  Group = LETTERS[1:3],
  x = rnorm(3)
)
long

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     -0.418
## 2 B     -0.754
## 3 C     -0.222

Wide table

wide <- tibble(
  A = rnorm(1),
  B = rnorm(1), 
  C = rnorm(1)
)
wide

## # A tibble: 1 × 3
##        A     B     C
##    <dbl> <dbl> <dbl>
## 1 -0.613 -1.78  1.51

tidyr - wide to long

pivot_longer(): wide to long.

wide %>% pivot_longer(cols = everything(),
                      names_to = "Group",
                      values_to = "x")

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     -0.613
## 2 B     -1.78 
## 3 C      1.51

tidyr - wide to long and back again

pivot_longer(): wide to long.

wide %>% pivot_longer(cols = everything(),
                      names_to = "Group",
                      values_to = "x")

## # A tibble: 3 × 2
##   Group      x
##   <chr>  <dbl>
## 1 A     -0.613
## 2 B     -1.78 
## 3 C      1.51

pivot_wider(): long to wide.

long %>% pivot_wider(names_from = Group,
                     values_from = x)

## # A tibble: 1 × 3
##        A      B      C
##    <dbl>  <dbl>  <dbl>
## 1 -0.418 -0.754 -0.222

tidyr - practical - problem

avg_speed

## # A tibble: 10 × 3
##   Distance `Average Speed - kalenji` `Average Speed - garmin`
##      <dbl>                     <dbl>                    <dbl>
## 1        5                      10.8                    10.2 
## 2        6                      11.0                    10.9 
## 3        8                      11.2                    10.5 
## 4        9                      10.8                     9.68
## # … with 6 more rows

Make avg_speed table longer: three columns Distance, Speed, and Method.
Method should have values kalenji or garmin only.
Boxplot: Speed ~ Method.

tidyr - practical - solution

Make avg_speed table longer: three columns Distance, Speed, and Method.

avg_speed <- avg_speed %>% 
  pivot_longer(cols = contains("Speed"),
               names_to = "Method",
               values_to = "Speed")

tidyr - practical - solution

Make avg_speed table longer: three columns Distance, Speed, and Method.
Method should have values kalenji or garmin only.

avg_speed <- avg_speed %>% 
  mutate(Method = gsub("[[:alpha:]]+ [[:alpha:]]+ - ", "", Method))

tidyr - practical - solution

Make avg_speed table longer: three columns Distance, Speed, and Method.
Method should have values kalenji or garmin only.
Boxplot: Speed ~ Method.

boxplot(Speed ~ Method, data = avg_speed)
grid(col = "grey20", lwd = .5)

tidyverse - final practical - problem

Plot total distance per week, coloring points according to sampling method.
Plot average speed per week, coloring points according to sampling method.

tidyverse - final practical - solution

d <- read_csv("kalenji.csv", show_col_types = FALSE) %>% 
  mutate(date = as_date(date, format = "%m/%d/%y"),
         week = week(date),
         speed = km / ( hour(time) + minute(time) / 60 )) %>% 
  group_by(week) %>% 
  summarize(`Total km` = sum(km),
            `Avg speed` = mean(speed)) %>% 
  mutate(method = "Kalenji") %>%
  bind_rows(
    read_csv("garmin.csv", show_col_types = FALSE) %>% 
      filter(`Activity Type` == "Running") %>% 
      mutate(Date = as_date(Date),
             week = week(Date),
             speed = Distance / ( hour(Time) + minute(Time) / 60 )) %>% 
      group_by(week) %>% 
      summarize(`Total km` = sum(Distance),
                `Avg speed` = mean(speed)) %>% 
      mutate(method = "Garmin")
  ) %>% 
  mutate(week = ifelse(week > 40, week - 53, week)) #these are last year weeks

## New names:
## • `Avg Run Cadence` -> `Avg Run Cadence...10`
## • `Max Run Cadence` -> `Max Run Cadence...11`
## • `Avg Run Cadence` -> `Avg Run Cadence...20`
## • `Max Run Cadence` -> `Max Run Cadence...21`

tidyverse - final practical - plot

par(mfrow = c(1, 2))
with(d, plot(week, `Total km`, col = as.factor(method)))
with(d, plot(week, `Avg speed`, col = as.factor(method)))

I probably wasn’t so accurate with the Kalenji watch.

Tidyverse crash course @ iDiv

Today’s program

Today’s program

Today’s program

What is the tidyverse?

Where to find these slides

Why the tidyverse? Because it’s for lazy people.

Why the tidyverse? Because it’s for lazy people.

Why the tidyverse? Because it makes the code more readable.

Why the tidyverse? Because it makes the code more readable.

Why the tidyverse? Because you will learn SQL without knowing it.

Basic ideas of tidyverse

Basic ideas of tidyverse

tibble

tibble - goal

tibble - what it looks like

tibble - from data.frames

tibble - from matrices

tibble - from lists

tibble - create new data

tibble - compatibility issues with packages

tibble - compatibility issues: as.data.frame()

tibble - printing options

tibble - practical 1 - problem

tibble - practical 1 - solution

tibble - practical 2 - problem

tibble - practical 2 - solution

tibble - practical 2 - solution

tibble - practical 2 - solution

readr

readr - goal

readr - read data

readr - read data

readr - read data - suppress messages

readr - returns a tibble

readr - good column guessing

readr - read general text delimited file

readr - write to file

readr - practical 1 - problem

readr - practical 1 - solution

readr - really faster?

magrittr - pipe

magrittr - what’s a pipe?

magrittr - pipe your way to glory!

magrittr - pipe practical 1 - problem

magrittr - pipe practical 1 - solution

magrittr - pipe practical 2 - problem

magrittr - pipe practical 2 - solution

dplyr

dplyr - goal

dplyr - lifecycle

dplyr - lifecycle in help()

dplyr - key verbs

dplyr - filter()

dplyr - filter() - example

dplyr - filter() - more examples

dplyr - filter() - practical - problem

dplyr - filter() - practical - solution

dplyr - filter() - practical - solution

dplyr - select()

dplyr - select() - example

dplyr - select() - also reversed

dplyr - tidyselect intermezzo

dplyr - tidyselect intermezzo - select columns starting with “Pred”

dplyr - tidyselect intermezzo - select columns containing “levels”

dplyr - tidyselect intermezzo - select only numeric columns

dplyr - tidyselect intermezzo - where

dplyr - tidyselect intermezzo - selection alternatives

15 minutes break

dplyr - mutate()

dplyr - mutate() - example

dplyr - mutate() - compact it please.

dplyr - mutate() - compacted

dplyr - mutate() - practical - problem

dplyr - mutate() - practical - solution

dplyr - mutate() - tidyselect

dplyr - pull()

dplyr - pull() - example

dplyr - pull() - practical - problem

dplyr - pull() - practical - solution