Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

Understand what problem dimension reduction methods are solving
Apply Principal component analysis to the data
- preprocessing
- compute pca with prcomp()
- extract the first x PCs: broom::augment()
- observe the screen plot to find the optimal number of PCs
- plot the data projected onto the first two PCs

Motivation - dimension reduction

Hand shadow puppet: 3D hand poses reduced to 2D shadows (Projection)

We want the projection to resemble the original data as much as possible.

Principal component analysis

Principal component analysis is a linear dimension reduction technique that finds a low-dimensional representation of a dataset that contains as much as possible of the variation.

It can be formulated as an optimization problem to maximize the variance of the projected data, subject to some constraint.
The solution to the optimization happens to be the singular value decomposition of the data matrix \(X\) (if the data has column mean of zero).
All of these will be taken care of by the function prcomp().

PCA example - Penguins data again

library(tidyverse)
as_tibble(penguins)

# A tibble: 344 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
3 Adelie  Torgersen     40.3     18           195      3250 female  2007
4 Adelie  Torgersen     NA       NA            NA        NA <NA>    2007
5 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
# ℹ 339 more rows

Would you think missing values would be a problem for PCA?

penguins_clean <- penguins |> na.omit()
penguins_clean |> head()

  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
6  Adelie Torgersen     39.3     20.6         190      3650   male 2007
7  Adelie Torgersen     38.9     17.8         181      3625 female 2007

PCA example - Penguins data

The function prcomp() in R can be used to perform PCA. The data is centered by default (center = TRUE).

We would also like to scale the data (scale = TRUE) since the features are measured in different units (mm vs g).
You may separately center and scale the data before applying PCA and set center = FALSE and scale = FALSE.

res_pca <-  prcomp(penguins_clean[, 3:6], scale = TRUE)
res_pca

Standard deviations (1, .., p=4):
[1] 1.6569115 0.8821095 0.6071594 0.3284579

Rotation (n x k) = (4 x 4):
                   PC1         PC2        PC3        PC4
bill_len     0.4537532 -0.60019490 -0.6424951  0.1451695
bill_dep    -0.3990472 -0.79616951  0.4258004 -0.1599044
flipper_len  0.5768250 -0.00578817  0.2360952 -0.7819837
body_mass    0.5496747 -0.07646366  0.5917374  0.5846861

Penguins data - Inspect the PCA results

eigenvalue: standard deviations of the principal components - their square capture the variance explained by each PC

res_pca$sdev

[1] 1.6569115 0.8821095 0.6071594 0.3284579

rotation: principal components (eigenvectors) - directions of maximum variance

res_pca$rotation

                   PC1         PC2        PC3        PC4
bill_len     0.4537532 -0.60019490 -0.6424951  0.1451695
bill_dep    -0.3990472 -0.79616951  0.4258004 -0.1599044
flipper_len  0.5768250 -0.00578817  0.2360952 -0.7819837
body_mass    0.5496747 -0.07646366  0.5917374  0.5846861

center/ scale: column mean and standard deviation (if applicable)

res_pca$center

   bill_len    bill_dep flipper_len   body_mass 
   43.99279    17.16486   200.96697  4207.05706

res_pca$scale

   bill_len    bill_dep flipper_len   body_mass 
   5.468668    1.969235   14.015765  805.215802

Diagnostics: Choose k

res_pca_var <- tibble(
  n = 1:length(res_pca$sdev), 
  sd = res_pca$sdev, 
  var = res_pca$sdev^2,
  prop = var/ sum(var))
res_pca_var

# A tibble: 4 × 4
      n    sd   var   prop
  <int> <dbl> <dbl>  <dbl>
1     1 1.66  2.75  0.686 
2     2 0.882 0.778 0.195 
3     3 0.607 0.369 0.0922
4     4 0.328 0.108 0.0270

res_pca_var |> 
  mutate(cum_prop = cumsum(prop))

# A tibble: 4 × 5
      n    sd   var   prop cum_prop
  <int> <dbl> <dbl>  <dbl>    <dbl>
1     1 1.66  2.75  0.686     0.686
2     2 0.882 0.778 0.195     0.881
3     3 0.607 0.369 0.0922    0.973
4     4 0.328 0.108 0.0270    1

ggplot(res_pca_var, aes(x = n, y = prop)) + 
  geom_line() +
  geom_text(aes(label = round(prop, 3)), 
            vjust = -0.5, hjust = -0.1,
            size = 5) +
  xlab("Number of PCs") + 
  ylab("Percentage of variance explained") +
  theme_bw() + 
  theme(text = element_text(size = 20)) + 
  ylim(0, 1)

This plot is called a scree plot - used to tell the number of principal components to retain. Choose the number at the “elbow” point: k = 2 here.

Scree plot

The fviz_screeplot() function from the factoextra package allows you to plot the scree plot in one command.

factoextra::fviz_eig(res_pca, addlabels = TRUE)

Diagnostics: plot the first 2 PCs

Obtain the projected data:

Option 1: use base R predict()

as_tibble(predict(res_pca))

# A tibble: 333 × 4
    PC1     PC2     PC3    PC4
  <dbl>   <dbl>   <dbl>  <dbl>
1 -1.85 -0.0320  0.235   0.528
2 -1.31  0.443   0.0274  0.401
3 -1.37  0.161  -0.189  -0.528
4 -1.88  0.0123  0.628  -0.472
5 -1.92 -0.816   0.700  -0.196
# ℹ 328 more rows

Option 2: use broom::augment()

fitted_df <- res_pca |> broom::augment(penguins_clean)
fitted_df

# A tibble: 333 × 13
  .rownames species island    bill_len bill_dep flipper_len body_mass
  <chr>     <fct>   <fct>        <dbl>    <dbl>       <int>     <int>
1 1         Adelie  Torgersen     39.1     18.7         181      3750
2 2         Adelie  Torgersen     39.5     17.4         186      3800
3 3         Adelie  Torgersen     40.3     18           195      3250
4 5         Adelie  Torgersen     36.7     19.3         193      3450
5 6         Adelie  Torgersen     39.3     20.6         190      3650
# ℹ 328 more rows
# ℹ 6 more variables: sex <fct>, year <int>, .fittedPC1 <dbl>,
#   .fittedPC2 <dbl>, .fittedPC3 <dbl>, .fittedPC4 <dbl>

Diagnostics: plot the first 2 PCs

factoextra::fviz_pca_var(res_pca)

Diagnostics: plot the first 2 PCs

factoextra::fviz_pca_biplot(res_pca)

Diagnostics: plot the first 2 PCs

Gentoo can be separated from the other two species by bill depth (PC1).
Among Adelie and Chinstrap, bill length helps separate them (PC2).

This matches with the summary statistics:

# A tibble: 3 × 3
  species   bill_dep bill_len
  <fct>        <dbl>    <dbl>
1 Adelie        18.3     38.8
2 Chinstrap     18.4     48.8
3 Gentoo        15.0     47.6

What’s more

Principal component analysis (PCA) is a linear dimension reduction method, there are other linear methods, e.g.
- Projection Pursuit (PP)
There are also non-linear dimension reduction methods, e.g.
- Multidimensional Scaling (MDS),
- T-stochastic Neighbourhood Embedding (t-SNE), and
- Uniform manifold Approximation and Projection (UMAP)

Appendix

Data:

\[\begin{aligned} X_{n\times p} & = [X_{1} \enspace X_{2} \enspace \dots \enspace X_{p}]_{n\times p} \\ & = \left[ \begin{array}{cccc} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p}\\ \vdots & \vdots & & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{array} \right]_{n\times p} \end{aligned}\]

We require column means of \(X\) to be zero (centered data): \(\bar{X_i} = 0\)

Projected data: \[\begin{aligned} Z_{n\times k} = X\Phi &=[Z_{1} \enspace \dots \enspace Z_{k}]_{n\times k} \\ & = \left[ \begin{array}{cccc} z_{11} & \dots & z_{1k} \\ z_{21} & \dots & z_{2k}\\ \vdots & & \vdots \\ z_{n1} & \dots & z_{nk} \end{array} \right]_{n\times k} \end{aligned}\]

Principal component analysis

In PCA, \(\Phi\) is called the rotation or the loading matrix:

\[\begin{aligned} \Phi_{p \times k} & = [\phi_{1} \enspace \phi_{2} \enspace \dots \enspace \phi_{k}]_{p\times k} \\ & = \left[ \begin{array}{cccc} \phi_{11} & \phi_{12} & \dots & \phi_{1k} \\ \phi_{21} & \phi_{22} & \dots & \phi_{2k}\\ \vdots & \vdots & & \vdots \\ \phi_{p1} & \phi_{p2} & \dots & \phi _{pk} \end{array} \right]_{p\times k} \end{aligned}\]

\(\Phi\) satisfies two conditions:

It is normalized: \(\sum _{i=1}^{p} \phi_{ij}^2 = 1\) for \(j = 1, \dots, k\) and
Vectors are orthogonal to each other: \(\sum_{i=1}^{p} \phi_{ij}\phi_{ik} = 0\) for \(j \neq k\).

(We will talk about why this is the case in a second.)

Principal component analysis

Principal components are the linear combinations of the original variables:

The first principal component (PC1):

\[\begin{aligned} Z_1 = \begin{bmatrix} z_{11} \\ z_{21} \\ \vdots \\ z_{n1}\end{bmatrix} &= \phi_{11} \begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1}\end{bmatrix} + \phi_{21} \begin{bmatrix} x_{12} \\ x_{22} \\ \vdots \\ x_{n2}\end{bmatrix} + \dots + \phi_{p1}\begin{bmatrix} x_{1p} \\ x_{2p} \\ \vdots \\ x_{np}\end{bmatrix} \\ &= \phi_{11}X_1 + \phi_{21}X_2 + \dots + \phi_{p1}X_p \\ \end{aligned}\]

where \(\sum _{i=1}^{p} \phi_{i1}^2 = 1\) and \(\bar{Z_i} = 0\) (since \(\bar{X_i} = 0\))

Principal component analysis

For each element in \(Z_1\): \(z_{i1} = \phi_{11}x_{i1} + \phi_{21}x_{i2} + \dots + \phi_{p1}x_{ip} = \sum_{j = 1}^p \phi_{j1} x_{ij}\)

Write down the variance of \(Z_1\):

\[\text{Var}(Z_1) = E(Z_1^2) = \frac{1}{n}\sum_{i = 1}^n z_{i1}^2=\frac{1}{n}\sum_{i = 1}^n (\sum_{j = 1}^p \phi_{j1} x_{ij})^2 \]

We would like to find the direction (the coefficients \(\phi_{j1}\)’s) that maximizes the variance of \(Z_1\):

\[\max_{\phi_{11}, \dots, \phi_{p1}} \text{Var}(Z_1) = \max_{\phi_{11}, \dots, \phi_{p1}} \frac{1}{n}\sum_{i = 1}^n (\sum_{j = 1}^p \phi_{j1} x_{ij})^2 \] subject to the normalization constraint \(\sum _{i=1}^{p} \phi_{i1}^2 = 1\).

Optimization

\[\text{Var}(Z_1) = E(Z_1^2) = \frac{1}{n}\sum_{i = 1}^n z_{i1}^2 = \left\lVert z_{1} \right\rVert^2 = \left\lVert X\phi_1 \right\rVert^2 = \phi_1^T X^T X \phi_1\]

subject to \(\phi_1^T \phi_1 = 1\)

Use Lagrange multipliers to solve the constrained optimization problem:

\[ L(\phi_1, \lambda) = \phi_1^T X^T X \phi_1 - \lambda (\phi_1^T \phi_1 - 1)\] Take the derivative and set it to zero gives \[X^TX\phi_1 = \lambda \phi_1\]

This is the eigenvalue equation for the matrix \(X^TX\).

Recognize \(X^TX\) is related the covariance matrix: \(S = \frac{1}{n-1}X^TX\), the first principal component (\(\phi_1\)) is the eigenvector of \(X^TX\) corresponding to the largest eigenvalue.

Computation

The first principal component (\(\phi_1\)) is the eigenvector of \(X^TX\) corresponding to the largest eigenvalue. This can be obtained through:

Singular value decomposition (SVD) of \(X\) (always apply):

\[X = UDV^T\]

Eigen-decomposition of \(X^TX = (n-1) S\) (since \(X^TX\) is square):

\[X^TX = V D^2 V^T\]

where

\(X\) is a \(n \times p\) matrix
\(U\) is a \(n \times n\) orthogonal matrix (\(U^TU = 1\))
\(D\) is a \(n \times p\) diagonal matrix with non-negative elements (square root of the eigenvalues of \(X^TX\) arranged in descending order)
\(V\) is a \(p \times p\) orthogonal matrix (eigenvectors: \(V^TV = 1\))

Principal component analysis

The second principal component (PC2): \(Z_2 = \phi_{12}X_1 + \phi_{22}X_2 + \dots + \phi_{p2}X_p\)

The second principal component is the linear combination of \(X_1, X_2, \dots, X_p\) that has the largest variance out of all linear combinations that are uncorrelated with \(Z_1\).

subject to “\(Z_2\) being uncorrelated with \(Z_1\)”.

This is equivalent to constraining \(\phi_2\) to be orthogonal to \(\phi_1\), hence

\[\max_{\phi_{12}, \dots, \phi_{p2}} \text{Var}(Z_2) = \max_{\phi_{12}, \dots, \phi_{p2}} \frac{1}{n}\sum_{i = 1}^n (\sum_{j = 1}^p \phi_{j2} x_{ij})^2 \] subject to the normalization constraint \(\sum _{i=1}^{p} \phi_{i2}^2 = 1\) and the orthogonality constraint \(\sum_{i=1}^{p} \phi_{i1}\phi_{i2} = 0\).

Elements of Data Science SDS 322E

Learning objectives

Motivation - dimension reduction

Principal component analysis

PCA example - Penguins data again

PCA example - Penguins data

Penguins data - Inspect the PCA results

Diagnostics: Choose k

Scree plot

Diagnostics: plot the first 2 PCs

Diagnostics: plot the first 2 PCs

Diagnostics: plot the first 2 PCs

Diagnostics: plot the first 2 PCs

What’s more

Appendix

Principal component analysis

Principal component analysis

Principal component analysis

Optimization

Computation

Principal component analysis

Elements of Data Science
SDS 322E