Introduction

2017-09-22

As the number of cyber-attacks continues to grow on a daily basis, so does the delay in threat detection. For instance, in 2015, the Office of Personnel Management (OPM) discovered that approximately 21.5 million individual records of Federal employees and contractors had been stolen. On average, the time between an attack and its discovery is more than 200 days. In the case of the OPM breach, the attack had been going on for almost a year. Currently, cyber analysts inspect numerous potential incidents on a daily basis, but have neither the time nor the resources available to perform such a task. anomalyDetection aims to curtail the time frame in which cyber-attacks go unnoticed and to aid in the discovery of these attacks among the millions of daily logged events, while minimizing the number of false positives and negatives. By incorporating a tabular vector approach along with multivariate analysis functionality, anomalyDetection provides cyber analysts the ability to effectively and efficiently identify time periods associated with suspected anomalies for further evaluation.

Functions

anomalyDetection provides 13 functions to aid in the detection of potential cyber anomalies:

Function Purpose
tabulate_state_vector Employs a tabulated vector approach to transform security log data into unique counts of data attributes based on time blocks.
block_inspect Creates a list where the original data has been divided into blocks denoted in the state vector.
mc_adjust Handles issues with multi-collinearity.
mahalanobis_distance Calculates the distance between the elements in data and the mean vector of the data for outlier detection.
bd_row Indicates which variables in data are driving the Mahalanobis distance for a specific row, relative to the mean vector of the data.
horns_curve Computes Horn’s Parallel Analysis to determine the factors to retain within a factor analysis.
factor_analysis Reduces the structure of the data by relating the correlation between variables to a set of factors, using the eigen-decomposition of the correlation matrix.
factor_analysis_results Provides easy access to factor analysis results.
kaisers_index Computes scores designed to assess the quality of a factor analysis solution. It measures the tendency towards unifactoriality for both a given row and the entire matrix as a whole.
principal_components Relates the data to a set of a components through the eigen-decomposition of the correlation matrix, where each component explains some variance of the data.
principal_components_results Provides easy access to principal component analysis results.
get_all_factors finds all factor pairs for a given integer.
hmat Plots a histogram matrix using data, a state vector, or mahalanobis distances

anomalyDetection also incorporates the pipe operator (%>%) from the magrittr package for streamlining function composition. To illustrate the functionality of anomalyDetection we will use the security_logs data that mimics common information that appears in security logs and comes with anomalyDetection.

# we'll use several tidyverse packages for some common manipulations
library(tibble)
library(dplyr)
library(tidyr)
library(ggplot2)
library(magrittr)
library(anomalyDetection)

security_logs
## # A tibble: 300 x 10
##    Device_Vendor Device_Product Device_Action          Src_IP
##            <chr>          <chr>         <chr>           <chr>
##  1        McAfee            NSP       Attempt   223.70.128.61
##  2         CISCO            ASA       Failure 174.110.206.174
##  3           IBM          SNIPS       Success 174.110.206.174
##  4        McAfee            NSP       Success   227.12.127.87
##  5       Juniper            SRX       Success     28.9.24.154
##  6        McAfee            NSP       Success     28.9.24.154
##  7        McAfee            NSP       Attempt     28.9.24.154
##  8        McAfee            ePO       Attempt   223.70.128.61
##  9        McAfee            ePO       Attempt 174.110.206.174
## 10         CISCO            ASA       Attempt   227.12.127.87
## # ... with 290 more rows, and 6 more variables: Dst_IP <chr>,
## #   Src_Port <int>, Dst_Port <int>, Protocol <chr>, Country_Src <chr>,
## #   Bytes_TRF <int>

State Vector Creation

To apply the statistical methods that we’ll see in the sections that follow, we employ the tabulated vector approach. This approach transforms the security log data into unique counts of data attributes based on pre-defined time blocks. Therefore, as each time block is generated, the categorical fields are separated by their levels and a count of occurrences for each level are recorded into a vector. All numerical fields, such as bytes in and bytes out, are recorded as a summation within the time block. The result is what we call a “state vector matrix”.

Thus, for our security_logs data we can create our state vector matrix based on our data being divided into 30 blocks of 10 observations each. What results is the summary of instances for each categorical level in our data for each time block. Consequently, row one represents the first time block and there were 2 instance of ASA as the Device_Product, 5 instances of Attempt as the Device_Action, etc.

tabulate_state_vector(security_logs, 10)
## Some variables contain more than 50 levels. Only the 10 most popular levels of these variables will be tabulated.
## # A tibble: 30 x 54
##      ASA Attempt Bytes_TRF_102 Bytes_TRF_120 Bytes_TRF_160 Bytes_TRF_200
##    <dbl>   <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
##  1     2       5             0             1             1             0
##  2     0       4             1             0             0             1
##  3     2       3             0             0             1             0
##  4     5       0             0             1             0             0
##  5     3       3             0             0             0             0
##  6     2       4             0             0             0             0
##  7     2       3             1             0             0             0
##  8     3       6             0             0             0             1
##  9     0       4             0             1             0             0
## 10     2       3             1             0             0             0
## # ... with 20 more rows, and 48 more variables: Bytes_TRF_208 <dbl>,
## #   Bytes_TRF_60 <dbl>, Bytes_TRF_64 <dbl>, Bytes_TRF_70 <dbl>,
## #   Bytes_TRF_72 <dbl>, Bytes_TRF_80 <dbl>, Bytes_TRF_90 <dbl>,
## #   CISCO <dbl>, China <dbl>, Dst_IP_145.114.4.203 <dbl>,
## #   Dst_IP_151.194.233.198 <dbl>, Dst_IP_219.142.109.8 <dbl>,
## #   Dst_IP_32.73.26.223 <dbl>, Dst_IP_56.137.121.203 <dbl>,
## #   Dst_Port_20000 <dbl>, Dst_Port_25 <dbl>, Dst_Port_593 <dbl>,
## #   Dst_Port_80 <dbl>, Dst_Port_90 <dbl>, Failure <dbl>, Firewall <dbl>,
## #   IBM <dbl>, India <dbl>, Juniper <dbl>, Korea <dbl>, McAfee <dbl>,
## #   NSP <dbl>, Netherlands <dbl>, `Palo Alto Networks` <dbl>,
## #   Russia <dbl>, SNIPS <dbl>, SRX <dbl>, Src_IP_174.110.206.174 <dbl>,
## #   Src_IP_223.70.128.61 <dbl>, Src_IP_227.12.127.87 <dbl>,
## #   Src_IP_28.9.24.154 <dbl>, Src_IP_89.130.69.91 <dbl>,
## #   Src_Port_113 <dbl>, Src_Port_135 <dbl>, Src_Port_21 <dbl>,
## #   Src_Port_25 <dbl>, Src_Port_80 <dbl>, Success <dbl>, TCP <dbl>,
## #   UDP <dbl>, US <dbl>, `United Kingdom` <dbl>, ePO <dbl>

Multicollinearity Adjustment

Prior to proceeding with any multivariate statistical analyses we should inspect the state vector for multicollinearity, to avoid issues such as matrix singularity, rank deficiency, and strong correlation values, and remove any columns that pose an issue. We can use mc_adjust to handle issues with multi-collinearity by first removing any columns whose variance is close to or less than a minimum level of variance (min_var). Then, it removes linearly dependent columns. Finally, it removes any columns that have a high absolute correlation value equal to or greater than max_cor. In order to perform multivariate analysis, we need the number of blocks to exceed the number of features, so instead of blocks of size 10, we downsize to blocks of 6 observations each.

(state_vec <- security_logs %>%
  tabulate_state_vector(6) %>%
  mc_adjust())
## Some variables contain more than 50 levels. Only the 10 most popular levels of these variables will be tabulated.
## # A tibble: 50 x 38
##      ASA Attempt China Dst_IP_145.114.4.203 Dst_IP_151.194.233.198
##    <dbl>   <dbl> <dbl>                <dbl>                  <dbl>
##  1     1       1     1                    3                      1
##  2     1       5     0                    0                      2
##  3     0       2     1                    1                      1
##  4     0       2     0                    0                      2
##  5     2       2     0                    0                      2
##  6     3       0     4                    0                      1
##  7     2       1     2                    3                      1
##  8     2       2     1                    2                      1
##  9     2       2     0                    2                      0
## 10     1       2     1                    2                      0
## # ... with 40 more rows, and 33 more variables:
## #   Dst_IP_219.142.109.8 <dbl>, Dst_IP_32.73.26.223 <dbl>,
## #   Dst_IP_56.137.121.203 <dbl>, Dst_Port_20000 <dbl>, Dst_Port_25 <dbl>,
## #   Dst_Port_593 <dbl>, Dst_Port_80 <dbl>, Dst_Port_90 <dbl>,
## #   Failure <dbl>, Firewall <dbl>, IBM <dbl>, India <dbl>, Juniper <dbl>,
## #   Korea <dbl>, McAfee <dbl>, NSP <dbl>, Netherlands <dbl>, Russia <dbl>,
## #   Src_IP_174.110.206.174 <dbl>, Src_IP_223.70.128.61 <dbl>,
## #   Src_IP_227.12.127.87 <dbl>, Src_IP_28.9.24.154 <dbl>,
## #   Src_IP_89.130.69.91 <dbl>, Src_Port_113 <dbl>, Src_Port_135 <dbl>,
## #   Src_Port_21 <dbl>, Src_Port_25 <dbl>, Src_Port_80 <dbl>,
## #   Success <dbl>, TCP <dbl>, US <dbl>, `United Kingdom` <dbl>, ePO <dbl>

By default, mc_adjust removes all columns that violate the variance, dependency, and correlation thresholds. Alternatively, we can use action = "select" as an argument, which provides interactivity where the user can select the variables that violate the correlation threshold that they would like to retain.

Multivariate Statistical Analyses

With our data adjusted to eliminate multicollinearity concerns, we can now proceed with multivariate analyses to identify anomalies. First we’ll use the mahalanobis_distance function to compare the distance between each observation by its distance from the data mean, independent of scale. This is computed as

\[MD = \sqrt{(x - \bar{x})'C^{-1}(x-\bar{x})} \tag{1}\]

where \(x\) is a vector of \(p\) observations, \(x=(x_1, \dots, x_p)\), \(\bar{x}\) is the mean vector of the data, \(\bar{x}=(\bar{x}_1, \dots, \bar{x}_p)\), and \(C^{-1}\) is the inverse data covariance matrix.

Here, we include output = "both" to return both the Mahalanobis distance and the absolute breakdown distances and normalize = TRUE so that we can compare relative magnitudes across our data.

state_vec %>%
  mahalanobis_distance("both", normalize = TRUE) %>%
  as_tibble
## # A tibble: 50 x 39
##          MD       ASA_BD   Attempt_BD    China_BD Dst_IP_145.114.4.203_BD
##       <dbl>        <dbl>        <dbl>       <dbl>                   <dbl>
##  1 30.57851 0.0004576264 0.0084686395 0.002386113             0.037495757
##  2 33.93982 0.0004576264 0.0228966921 0.021475020             0.022337898
##  3 29.41206 0.0224236919 0.0006273066 0.002386113             0.002393346
##  4 31.08585 0.0224236919 0.0006273066 0.021475020             0.022337898
##  5 28.77484 0.0233389447 0.0006273066 0.021475020             0.022337898
##  6 32.19485 0.0462202630 0.0163099724 0.073969512             0.022337898
##  7 27.71047 0.0233389447 0.0084686395 0.026247246             0.037495757
##  8 32.34490 0.0233389447 0.0006273066 0.002386113             0.017551205
##  9 26.15937 0.0233389447 0.0006273066 0.021475020             0.017551205
## 10 24.57524 0.0004576264 0.0006273066 0.002386113             0.017551205
## # ... with 40 more rows, and 34 more variables:
## #   Dst_IP_151.194.233.198_BD <dbl>, Dst_IP_219.142.109.8_BD <dbl>,
## #   Dst_IP_32.73.26.223_BD <dbl>, Dst_IP_56.137.121.203_BD <dbl>,
## #   Dst_Port_20000_BD <dbl>, Dst_Port_25_BD <dbl>, Dst_Port_593_BD <dbl>,
## #   Dst_Port_80_BD <dbl>, Dst_Port_90_BD <dbl>, Failure_BD <dbl>,
## #   Firewall_BD <dbl>, IBM_BD <dbl>, India_BD <dbl>, Juniper_BD <dbl>,
## #   Korea_BD <dbl>, McAfee_BD <dbl>, NSP_BD <dbl>, Netherlands_BD <dbl>,
## #   Russia_BD <dbl>, Src_IP_174.110.206.174_BD <dbl>,
## #   Src_IP_223.70.128.61_BD <dbl>, Src_IP_227.12.127.87_BD <dbl>,
## #   Src_IP_28.9.24.154_BD <dbl>, Src_IP_89.130.69.91_BD <dbl>,
## #   Src_Port_113_BD <dbl>, Src_Port_135_BD <dbl>, Src_Port_21_BD <dbl>,
## #   Src_Port_25_BD <dbl>, Src_Port_80_BD <dbl>, Success_BD <dbl>,
## #   TCP_BD <dbl>, US_BD <dbl>, `United Kingdom_BD` <dbl>, ePO_BD <dbl>

We can use this information in a modified heatmap visualization to identify outlier values across our security log attributes and time blocks. The larger and brighter the dot the more significant the outlier is and deserves attention.

state_vec %>%
  mahalanobis_distance("both", normalize = TRUE) %>%
  as_tibble %>%
  mutate(Block = 1:n()) %>% 
  gather(Variable, BD, -c(MD, Block)) %>%
  ggplot(aes(factor(Block), Variable, color = MD, size = BD)) +
  geom_point()

This process could be streamlined by using the hmat command to create the histogram matrix. This command can take raw data, a state vector, or calculated Mahalanobis distances as input. Additional ggplot2 aesthetics can be added using the + operator. By default, hmat displays the top 20 most anomalous blocks. The user can change the number of blocks kept using the top argument. The arguments needed to feed into tabulate_state_vector, mc_adjust, and/or mahalanobis_distance can be entered directly into hmat.

state_vec %>% 
  hmat(input = "SV", top = 15, normalize = TRUE) +
  ggtitle("Histogram Matrix of Anomalous Blocks") +
  ylab(NULL)

We can build onto this concept with the bd_row command to identify which security log attributes in the data are driving the Mahalanobis distance. bd_row measures the relative contribution of each variable, \(x_i\), to \(MD\) by computing

\[BD_i = \Bigg|\frac{x_i - \bar{x}_i}{\sqrt{C_{ii}}} \Bigg| \tag{2}\]

where \(C_{ii}\) is the variance of \(x_i\). Furthermore, bd_row will look at a specified row and rank-order the columns by those that are driving the Mahalanobis distance. For example, the plot above identified block 17 as having the largest Mahalanobis distance suggesting some abnormal activity may be occuring during that time block. We can drill down into that block and look at the top 10 security log attributes that are driving the Mahalanobis distance as these may be areas that require further investigation.

state_vec %>%
  mahalanobis_distance("bd", normalize = TRUE) %>%
  bd_row(17, 10)
##                  India_BD            Src_Port_80_BD 
##                  2.974807                  2.201276 
##           Dst_Port_593_BD Dst_IP_151.194.233.198_BD 
##                  2.149510                  1.880467 
##               Firewall_BD                    NSP_BD 
##                  1.338744                  1.326790 
##   Dst_IP_219.142.109.8_BD            Dst_Port_25_BD 
##                  1.249429                  1.161628 
##         Dst_Port_20000_BD   Dst_IP_145.114.4.203_BD 
##                  1.153349                  1.141101

Next, we can use factor analysis by first exploring the factor loadings (correlations between the columns of the state vector matrix and the suggested factors) and then comparing the factor scores against one another for anomaly detection. Factor analysis is another dimensionality reduction technique designed to identify underlying structure of the data. Factor analysis relates the correlations between variables through a set of factors to link together seemingly unrelated variables. The basic factor analysis model is

\[ X= Λf+e \tag{3}\]

where \(X\) is the vector of responses \(X = (x_1, \dots, x_p)\), \(f\) are the common factors \(f = (f_1, \dots, f_q)\), \(e\) is the unique factors \(e = (e_1, \dots, e_p)\), and \(Λ\) is the factor loadings. For the desired results, anomalyDetection uses the correlation matrix. Factor loadings are correlations between the factors and the original data and can thus range from -1 to 1, which indicate how much that factor affects each variable. Values close to 0 imply a weak effect on the variable.

A factor loadings matrix can be computed to understand how each original data variable is related to the resultant factors. This can be computed as

\[ Λ = \bigg[\sqrt{λ_1}*e_1,\dots,\sqrt{λ_p}*e_p \bigg] \tag{4}\] where \(λ_1\) is the eigenvalue for each factor, \(e_i\) is the eigenvector for each factor, and \(p\) is the number of columns. Factor scores are used to examine the behavior of the observations relative to each factor and can be used to identify anomaly detection. Factor scores are calculated as

\[ \hat{f} ̂= X_s R^{-1} Λ \tag{5}\]

where \(X_s\) is the standardized observations, \(R^{-1}\) is the inverse of the correlation matrix, and $ $ is the factor loadings matrix. To simplify the results for interpretation, the factor loadings can undergo an orthogonal or oblique rotation. Orthogonal rotations assume independence between the factors while oblique rotations allow the factors to correlate. anomalyDetection utilizes the most common rotation option known as varimax. Varimax rotates the factors orthogonally to maximize the variance of the squared factor loadings which forces large factors to increase and small ones to decrease, providing easier interpretation.

To begin using factor analysis, the dimensions of the reduced state vector matrix are first passed to the horns_curve function to find the recommended set of eigenvalues.

horns_curve(state_vec)
##  [1] 3.26660240 2.92445020 2.66884991 2.45582260 2.26959714 2.10340776
##  [7] 1.95249077 1.81129927 1.68706078 1.56764718 1.45163848 1.34655602
## [13] 1.24763334 1.15488272 1.06756927 0.98415770 0.90523552 0.83150235
## [19] 0.75944722 0.69276476 0.63096733 0.57230665 0.51779921 0.46508305
## [25] 0.41633473 0.36907261 0.32447345 0.28362938 0.24595147 0.21167487
## [31] 0.18026801 0.14955726 0.12173730 0.09739073 0.07526632 0.05578220
## [37] 0.03790831 0.02186762

Next, the dimensionality is determined by finding the eigenvalues of the correlation matrix of the state vector matrix and retaining only those factors whose eigenvalues are greater than or equal to those produced by horns_curve. We use factor_analysis to reduce the state vector matrix into resultant factors. The factor_analysis function generates a list containing five outputs:

  1. fa_loadings: numerical matrix with the original factor loadings
  2. fa_scores: numerical matrix with the row scores for each factor
  3. fa_loadings_rotated: numerical matrix with the varimax rotated factor loadings
  4. fa_scores_rotated: numerical matrix with the row scores for each varimax rotated factor
  5. num_factors: numeric vector identifying the number of factors
state_vec %>%
  horns_curve() %>%
  factor_analysis(state_vec, hc_points = .) %>%
  str
## List of 5
##  $ fa_loadings        : num [1:38, 1:11] 0.2618 -0.3515 0.3082 0.0393 -0.3154 ...
##  $ fa_scores          : num [1:50, 1:11] 0.437 -0.139 -1.27 -0.599 -1.158 ...
##  $ fa_loadings_rotated: num [1:38, 1:11] 0.07471 -0.08339 0.17689 0.15318 -0.00127 ...
##  $ fa_scores_rotated  : num [1:50, 1:11] -0.0196 0.717 -0.8017 1.8066 -2.1938 ...
##  $ num_factors        : int 11

For easy access to these results we can use the factor_analysis_results parsing function. The factor_analysis_results will parse the results either by their list name or by location. For instance to extract the rotated factor scores you can use factor_analysis_results(data, results = fa_scores_rotated) or factor_analysis_results(data, results = 4) as demonstrated below.

state_vec %>%
  horns_curve() %>%
  factor_analysis(state_vec, hc_points = .) %>%
  factor_analysis_results(4) %>%
  as_tibble
## # A tibble: 50 x 11
##             V1         V2         V3         V4          V5         V6
##          <dbl>      <dbl>      <dbl>      <dbl>       <dbl>      <dbl>
##  1 -0.01964907 -0.4757135  1.1548628  0.2608188 -0.25520493 -0.6814042
##  2  0.71697221 -0.8379020  0.1102633  1.0096914  3.10101892 -0.2800242
##  3 -0.80167537 -0.2840280 -0.7967628  0.1489798 -0.39472050  1.0957468
##  4  1.80657967 -1.0533456  0.4324990  0.2658135 -0.04538516  1.3877732
##  5 -2.19379756 -0.3234772  1.4051072 -0.4427774  1.36299647  0.3069633
##  6  0.29429297  2.6352866 -0.8200113  0.6586836 -1.20162288  0.9783532
##  7  0.12497873  1.4739631  0.5696657 -0.4803574  0.05914320  1.2107143
##  8  1.85862960  0.1498621 -0.0714478 -0.1751160 -0.11299312 -0.4373385
##  9  0.22433178 -0.9120156  2.6645448 -1.5902883  0.25151611  0.1997990
## 10  0.30121850 -0.0585642 -0.1341263  1.3655895  0.64825915 -0.4868956
## # ... with 40 more rows, and 5 more variables: V7 <dbl>, V8 <dbl>,
## #   V9 <dbl>, V10 <dbl>, V11 <dbl>

To evaluate the quality for a factor analysis solution Kaiser proposed the Index of Factorial Simplicity (IFS). The IFS is computed as

\[ IFS = \frac{∑_i\big[q ∑_sv_js^4-(∑_sv_{js}^2)^2\big]}{∑_i\big[(q-1)(∑_sv_{js}^2)^2 \big]} \tag{6}\]

where \(q\) is the number of factors, \(j\) the row index, \(s\) the column index, and \(v_{js}\) is the value in the loadings matrix. Furthermore, Kaiser created the following evaluations of the score produced by the IFS as shown below:

  1. In the .90s: Marvelous
  2. In the .80s: Meritorious
  3. In the .70s: Middling
  4. In the .60s: Mediocre
  5. In the .50s: Miserable
  6. < .50: Unacceptable

Thus, to assess the quality of our factor analysis results we apply kaisers_index to the rotated factor loadings and as the results show below our output value of 0.702 suggests that our results are “middling”.

state_vec %>%
  horns_curve() %>%
  factor_analysis(data = state_vec, hc_points = .) %>%
  factor_analysis_results(fa_loadings_rotated) %>%
  kaisers_index()
## [1] 0.6236816

We can visualize the factor analysis results to show the correlation between the columns of the reduced state vector to the rotated factor loadings. Strong negative correlations are depicted as red while strong positive correlations are shown as blue. This helps to identify which factors are correlated with each security log data attribute. Furthermore, this helps to identify two or more security log data attributes that appear to have relationships with their occurrences. For example, this shows that Russia is highly correlated with IP address 223.70.128. If there is an abnormally large amount of instances with Russian occurrances this would be the logical IP address to start investigating.

fa_loadings <- state_vec %>%
  horns_curve() %>%
  factor_analysis(state_vec, hc_points = .) %>%
  factor_analysis_results(fa_loadings_rotated)

row.names(fa_loadings) <- colnames(state_vec)

gplots::heatmap.2(fa_loadings, dendrogram = 'both', trace = 'none', 
            density.info = 'none', breaks = seq(-1, 1, by = .25), 
            col = RColorBrewer::brewer.pal(8, 'RdBu'))

We can also visualize the rotated factor score plots to see which time blocks appear to be outliers and deserve closer attention.

state_vec %>%
  horns_curve() %>%
  factor_analysis(state_vec, hc_points = .) %>%
  factor_analysis_results(fa_scores_rotated) %>%
  as_tibble() %>%
  mutate(Block = 1:n()) %>%
  gather(Factor, Score, -Block) %>%
  mutate(Absolute_Score = abs(Score)) %>%
  ggplot(aes(Factor, Absolute_Score, label = Block)) +
  geom_text() +
  geom_boxplot(outlier.shape = NA)

This allows us to look across the factors and identify outlier blocks that may require further intra-block analysis. If we assume that an absolute rotated factor score \(\geq\) 2 represents our outlier cut-off then we see that time blocks 4, 13, 15, 17, 24, 26, and 27 require further investigation. We saw block 17 being highlighted with the mahalanobis_distance earlier but these other time blocks were not as obvious so by performing and comparing these multiple anomaly detection approaches we can gain greater insights and confirmation.

An alternative, yet similar approach to factor analysis is principal component analysis. The goal in factor analysis is to explain the covariances or correlations between the variables. By contrast, the goal of principal component analysis is to explain as much of the total variance in the variables as possible. Thus, The first principal component of a set of features \(X_1, X_2,\dots,X_p\) is the normalized linear combination of the features

\[ Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \cdots + \phi_{p1}X_p \tag{7}\]

that has the largest variance. By normalized, we mean that \(\sum^p_{j=1} \phi^2_{j1} = 1\). We refer to the elements \(\phi_{11},\dots,\phi_{p1}\) as the loadings of the first principal component; together, the loadings make up the principal component loading vector, \(\phi_1 = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})^T\). The loadings are constrained so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance. After the first principal component \(Z_1\) of the features has been determined, we can find the second principal component \(Z_2\). The second principal component is the linear combination of \(X_1,\dots,X_p\) that has maximal variance out of all linear combinations that are uncorrelated with \(Z_1\). The second principal component scores \(z_{12}, z_{22},\dots,z_{n2}\) take the form

\[z_{12} = \phi_{12}x_{i1} + \phi_{22}x_{i2} + \cdots + \phi_{p2}x_{ip} \tag{8} \]

where \(\phi_2\) is the second principal loading vector, with elements \(\phi_{12}, \phi_{22}, \dots, \phi_{p2}\). This continues until all principal components have been computed. To perform a principal components analysis we use principal_components which will create a list containing:

  1. pca_sdev: the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix).
  2. pca_loadings: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).
  3. pca_rotated: the value of the rotated data (the centered, and scaled if requested, data multiplied by the rotation matrix) is returned.
  4. pca_center: the centering used.
  5. pca_scale: a logical response indicating whether scaling was used.
principal_components(state_vec) %>% str
## List of 5
##  $ pca_sdev    : num [1:38] 1.94 1.86 1.66 1.59 1.5 ...
##  $ pca_loadings: num [1:38, 1:38] 0.1735 -0.5479 0.2626 0.0674 -0.1162 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:38] "ASA" "Attempt" "China" "Dst_IP_145.114.4.203" ...
##   .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
##  $ pca_rotated : num [1:50, 1:38] -0.0217 -3.7951 -1.1869 -1.296 -1.8197 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
##  $ pca_center  : Named num [1:38] 0.98 2.08 0.9 1.12 1.26 1.34 1.1 1.18 1.16 1.26 ...
##   ..- attr(*, "names")= chr [1:38] "ASA" "Attempt" "China" "Dst_IP_145.114.4.203" ...
##  $ pca_scale   : logi FALSE

For easy access to these results we can use the principal_components_result parsing function. The principal_components_result will parse the results either by their list name or by location. For instance to extract the computed component scores as outlined in Eq. 8 you can use principal_components_result(data, results = pca_rotated) or principal_components_result(data, results = 3) as demonstrated below.

state_vec %>%
  principal_components() %>%
  principal_components_result(pca_rotated) %>%
  as_tibble
## # A tibble: 50 x 38
##           PC1        PC2        PC3         PC4        PC5          PC6
##         <dbl>      <dbl>      <dbl>       <dbl>      <dbl>        <dbl>
##  1 -0.0216638 -1.6883758 -3.4143797 -0.07627033 -0.1724198  1.984738191
##  2 -3.7951191 -1.3291247  2.8195597 -0.18815367 -2.7642938 -0.358152156
##  3 -1.1869087  1.6090340  1.7741351 -1.34876744  3.1122510  1.341390730
##  4 -1.2959757 -0.2664863 -1.6904617 -1.41708479  1.0017278 -1.384788576
##  5 -1.8197383  1.5963172 -0.3817940 -1.61250995 -1.7474613  0.287549590
##  6  4.1050528 -0.4262763  1.8513243 -2.54171914 -0.2668373  0.004921558
##  7  2.9973990  1.8193573 -1.1730791 -1.15406442 -0.3602044 -1.214368957
##  8  0.9097784 -2.5230807 -0.1668170  0.91489062 -0.4197324 -1.752438034
##  9 -0.1971346  0.3522211 -4.4512593  1.31041660  0.8168886 -0.281709611
## 10  0.1557614 -2.0685964  0.2749452  0.27561529 -1.7229484 -0.223817608
## # ... with 40 more rows, and 32 more variables: PC7 <dbl>, PC8 <dbl>,
## #   PC9 <dbl>, PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>,
## #   PC15 <dbl>, PC16 <dbl>, PC17 <dbl>, PC18 <dbl>, PC19 <dbl>,
## #   PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>, PC24 <dbl>,
## #   PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>,
## #   PC30 <dbl>, PC31 <dbl>, PC32 <dbl>, PC33 <dbl>, PC34 <dbl>,
## #   PC35 <dbl>, PC36 <dbl>, PC37 <dbl>, PC38 <dbl>

We could then follow up the principal component analysis with similar visualization activities as performed post factor analysis to identify potential anomalies.