13 December 2024

Why multivariate analysis?

Landscape of tools

What we will cover today

Function for plotting

The iris dataset

head(iris, 4)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa

From left to right: I. setosa, I. veriscolor, I. virginica.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

PCA is a tool to compress information from one table into a more manageble space, a PC space.

PC is a new coordinate system such that the directions (principal components) capture the largest variation in the data.

PCA principles

You can code it by hand…

# scale and center table
d <- as.data.frame(scale(iris[, 1:4], center = TRUE, scale = TRUE))

# covariance matrix
covariance_matrix <- cov(d)

# eigenvalues
lambdas <- eigen(covariance_matrix)$values
importance_principal_components <- lambdas / sum(lambdas)

# eigenvectors
v <- eigen(covariance_matrix)$vectors
principal_components <- t(t(v) / colSums(v))

… but you don’t need to in R (see next slide).

PCA in R

in R, prcomp() runs the lines from the slide before for you. To perform a PCA in R, you then just need to call prcomp(). The syntax is

prcomp(x = <table>, scale = TRUE, center = TRUE)  # set center and scale always TRUE

where <table> is the table you want to perform PCA on.

PCA in R

Let’s perform a PCA on the iris dataset, but only selecting the columns that relate to continuous traits (columns 1 to 4).

pca <- prcomp(iris[, 1:4], center = TRUE, scale = TRUE)

PCA in R

You can see information about the principal components (PCs) with the summary() function.

summary(pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

Ignore the row Standard deviation for now.

  • The row Proportion of Variance is the variance explained by each PC.
  • The row Cumulative Proportion is the proportion of variance explained by the PCs + the proportion explained by PCs with lower index. For instance the cumulative proportion of PC3 is proportion of PC1 + proportion of PC1 + proportion of PC3.
  • You want to select PCs that, together, explain circa 70-80 % of the variance, at least.
  • This means that you want the PCs for which Cumulative Proportion \(\geq 0.70\) or \(\geq 0.80\).
  • In this example, I am happy selecting only the first two PCs (Cumulative Proportion = 0.96).

PCA biplot

PCA results can be plotted using the function biplot(). biplot() plots the data points as points and the variables (columns) as arrows.

biplot(pca, xlabs = rep("+", nrow(iris)))

The argument xlabs in biplot() specifies the shape of the points (data points).

  1. Arrows (variables) that have the same direction are positively correlated.
  2. Arrows (variables) that have opposite direction are negatively correlated.
  3. Arrows (variables) that are at a 90 degree angle are uncorrelated.
  • Petal.Length, Petal.Width, and Sepal.Length are all correlated.
  • Sepal.Width in uncorrelated with the other variables.

Petals and Sepals

You can also see the results before by plotting columns against each other. Usually, PCA is to avoid doing this, but here we will do it for understanding PCA.

You can see that our interpretation from PCA is correct.

A color PCA biplot

Base R does not have good plotting tools for PCA. See first slide for color_for_species().

plot(pca$x[, 1:2], col = color_for_species(iris$Species))

Principal Coordinate Analysis (PCoA)

Principal Coordinate Analysis (PCoA)

Principal Coordinate Analysis (PCoA), also known as metric Multi-Dimensional Scaling (mMDS) is similar to PCA, but it needs distance matrices.

For example, the Euclidean distance: \(D = \sqrt{\sum{(x_i - x_j) ^ 2}}\)

PCoA tries to order data on a plot, where the distance between points is proportional to the distance between data.

In this figure, if difference in color indicates the dissimilarity between points, then we want points with more similar color closer together.

Principal Coordinate Analysis (PCoA)

Euclidean distance: \(D = \sqrt{\sum{(x_i - x_j) ^ 2}}\)

# calculating by hand Euclidean distance
sqrt(sum( (iris[1, 1:4] - iris[2, 1:4])^2 ))  
[1] 0.5385165
distance <- dist(iris[, 1:4])  # using dist()

The function dist(x = <table>, method = <method>) calculates the pairwise distance between all rows in <table> using the distance metric specified by <method> (default = “eulicean”). Basically, it does the manual calculation I did for all rows.

The output is a symmetric matrix, with entries the distance between rows.

as.matrix(distance)[1:3, 1:3]
          1         2        3
1 0.0000000 0.5385165 0.509902
2 0.5385165 0.0000000 0.300000
3 0.5099020 0.3000000 0.000000

See how the entry as.matrix(distance)[1, 2] (equal to as.matrix(distance)[2, 1], being symmetric) is the same as what I calculated by hand.

Principal Coordinate Analysis (PCoA)

image(as.matrix(distance), col = hcl.colors(100, "Zissou1"))

PCoA in R

You can also see the distance matrix using image(). Unfortunately, image() rotates the matrix 90 degrees, i.e. the rows and column of the image are inverted (but it does not matter, as the matrix is symmetric).

I use hcl.colors() to change the color palette.

In base R, cmdscale(d = <distance>) performs a PCoA, where <distance> is the distance matrix (calculated before).

distance <- dist(iris[, 1:4])
pcoa <- cmdscale(distance)

PCoA in R

plot(pcoa)

PCoA vs PCA R

Are PCoA results much different from PCA?

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
pcoa <- cmdscale(distance, k = 2)
plot(pcoa, col = color_for_species(iris$Species))
plot(pca$x[, 1], -pca$x[, 2], col = color_for_species(iris$Species))

Not in this example, but in general they can be.

non-metric Multi-Dimensional Scaling (nMDS)

nMDS

PCoA tries to order data on a plot, where the distance between points is proportional to the distance between data.

Sometimes this is not possible.

Non-metric Multi-Dimensional Scaling (nMDS) extend PCoA to get the best representations of points on a plot.

nMDS in R

There are several nMDS algorithms in R, but none in base R. I will use isoMDS() from the MASS package, simply because MASS is often installed together with other packages.

I need to add a small amount to the distance matrix (1e-9 = 0.000000001) because in MASS::isoMDS() distances between rows cannot be zero. Two rows in iris are identical, therefore they have distance = 0. In general, you don’t need to do this.

nMDS <- MASS::isoMDS(distance + 1e-9)  # distances cannot be zero
initial  value 3.025865 
iter   5 value 2.637651
final  value 2.582478 
converged

The algorithm works through iteration and stops when a value (called “stress”) reaches a certain relative value. Don’t think too much about it, just remember that nMDS moves points around several times until it decides that are good enough. In this example, isoMDS() moves around points 5 tiems (5 iterations) before stopping.

nMDS in R

plot(nMDS$points, col = color_for_species(iris$Species))

nMDS vs PCoA

Are nMDS results much different from PCoA?

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
plot(pcoa, col = color_for_species(iris$Species))
plot(nMDS$points, col = color_for_species(iris$Species))

Not in this example, but in general they can be.

K-means clustering

K-means clustering

  • Partition \(n\) observations into \(k\) clusters (groups).
  • Often done after PCA or PCoA.
  • Useful to find groups in your data.
  • In base R, use kmeans().
  • Out of the scope of this course, but useful to remember: if you want to identify groups in your data, do a PCA (prcomp()) or PCoA (cmdscale()) and then a K-mean clustering (kmeans()).

In the figure above, colors show the clusters and shapes the species.

Take-home messages

  • When you have multiple response variables or too many dimensions, you need multivariate analysis.
  • Multivariate analysis compresses information so that you can work with it easier.
  • If you have raw values, use PCA.
  • If you have distances or dissimilarities, use PCoA or nMDS.
  • K-mean clustering: supervised machine learning to find groups.