Contents

1 Introduction

Estimation of covariance or correlation matrices has widespread usage in a broad spectrum of statistical applications. The most commonly used estimator, namely the sample covariance or correlation matrix, is rank deficient and hence unstable in cases where the dimensionality of the problem (p) is greater than the number of samples (n). This problem has driven statisticians to suggest various alternative estimators in settings. Several estimators of correlation matrix have been proposed in such settings and their theoretical properties and performance comparisons have been studied comprehensively (Touloumis 2015, Ledoit and Wolf (2003), Bickel and Levina (2008), Rothman, Levina, and Zhu (2009)). Some of these methods are already available as R packages - corpcor (Schäfer and Strimmer 2004), glasso (Friedman, Hastie, and Tibshirani 2008), PDSCE(Rothman, Levina, and Zhu 2009) etc.

These approaches, however, are not well suited for handling large scale missingness in data. Also, some of these methods work well under some specific sets of assumptions about the underlying matrix, for e.g. - thresholding estimators assume a banded structure of the correlation matrix. In this package, ee introduce a method CorShrink that adapts to varying degree of missingness in observations corresponding to each pair of features. Also, CorShrink can be applied directly to data consisting of missing values, as well as to derived quantities like vectors and matrices of correlations between features, and allows for two formulations - an asymptotic approach and a resampling based approach. Even in examples with no missing data, CorShrink estimated correlations are visibly closer to the true correlations compared to the standard methods. CorShrink also can be applied to other correlation-like quantities such as partial correlations, rank correlations and cosine similarity values from word2vec models.


2 CorShrink Installation

CorShrink is a companion package to ashr R package (Stephens 2016). Before installing CorShrink, please make sure you have the latest version of ashr.

devtools::install_github("stephens999/ashr")

The other dependencies of this package include SQUAREM,reshape2 and Matrix. Next we install CorShrink.

install.packages("CorShrink")

The development version can be installed from Github as well.

library(devtools)
install_github("kkdey/CorShrink")

Then load the package with:

library(CorShrink)

3 Methods

The main steps in CorShrink are as follows

\[ Z_{ij} = 0.5 \log \left (\frac{1 + R_{ij}}{1 - R_{ij}} \right ) \]

\[ Z^{\star}_{ij} : = ash \; (Z_{ij}, s_{ij}) \]

The matrix format shrinkage is performed by the CorShrinkMatrix function while the vector format shrinkage is performed by the CorShrinkVector function.

\[ R^{\star}_{ij} = \frac{exp \; (2 Z^{\star}_{ij}) - 1}{exp \; (2 Z^{\star}_{ij}) + 1} \]


4 Illustration

We load an example data matrix - the person (544) by tissue samples (53) gene expression data for the gene ENSG00000166819 collected from the Genotype Tissue Expression (GTEx) Project .

data <- get(load(system.file("extdata", "sample_by_feature_data.rda",
                             package = "CorShrink")))

Just by checking the first few rows and columns, we see that the data contains many missing values. The data is

data[1:5,1:5]
##            Adipose - Subcutaneous Adipose - Visceral (Omentum) Adrenal Gland
## GTEX-111CU              10.472332                     10.84006      2.721234
## GTEX-111FC               7.335392                           NA            NA
## GTEX-111VG               9.118889                           NA            NA
## GTEX-111YS              10.806459                     11.26113      3.454823
## GTEX-1122O              11.040446                     11.71497      1.522667
##            Artery - Aorta Artery - Coronary
## GTEX-111CU             NA                NA
## GTEX-111FC             NA                NA
## GTEX-111VG             NA                NA
## GTEX-111YS       1.162059                NA
## GTEX-1122O       1.674467          4.188002

4.1 Standard version CorShrink

4.1.1 CorShrinkData

We estimate the adaptively shrunk correlation matrix for this data using CorShrink.

par(mfrow=c(1,2))
out <- CorShrinkData(data, sd_boot = FALSE, image_original = TRUE, 
                     image_corshrink = TRUE, optmethod = "mixEM",
                     image.control = list(x.cex = 0.3, y.cex = 0.3))

The function outputs a list with two elements which are two versions of CorShrink estimated matrices - ash_cor_only and ash_cor_PD. ash_cor_only version may not be positive definite, while ash_cor_PD is the nearest positive definite approximation to ash_cor_only version.

out$ash_cor_only[1:5,1:5]
##                              Adipose - Subcutaneous
## Adipose - Subcutaneous                   1.00000000
## Adipose - Visceral (Omentum)             0.24049763
## Adrenal Gland                           -0.04421764
## Artery - Aorta                           0.01350303
## Artery - Coronary                        0.21607852
##                              Adipose - Visceral (Omentum) Adrenal Gland
## Adipose - Subcutaneous                        0.240497629  -0.044217641
## Adipose - Visceral (Omentum)                  1.000000000   0.002133414
## Adrenal Gland                                 0.002133414   1.000000000
## Artery - Aorta                                0.004513620  -0.001106213
## Artery - Coronary                             0.012460325   0.038592275
##                              Artery - Aorta Artery - Coronary
## Adipose - Subcutaneous          0.013503026        0.21607852
## Adipose - Visceral (Omentum)    0.004513620        0.01246032
## Adrenal Gland                  -0.001106213        0.03859228
## Artery - Aorta                  1.000000000        0.03911927
## Artery - Coronary               0.039119272        1.00000000
out$ash_cor_PD[1:5, 1:5]
##                              Adipose - Subcutaneous
## Adipose - Subcutaneous                   1.00000000
## Adipose - Visceral (Omentum)             0.23988372
## Adrenal Gland                           -0.04233340
## Artery - Aorta                           0.01383807
## Artery - Coronary                        0.21428301
##                              Adipose - Visceral (Omentum) Adrenal Gland
## Adipose - Subcutaneous                        0.238557101 -0.0420762179
## Adipose - Visceral (Omentum)                  1.000000000  0.0017261516
## Adrenal Gland                                 0.001727098  1.0000000000
## Artery - Aorta                                0.003347350 -0.0008208815
## Artery - Coronary                             0.013558812  0.0378046615
##                              Artery - Aorta Artery - Coronary
## Adipose - Subcutaneous         0.0137667663        0.21315748
## Adipose - Visceral (Omentum)   0.0033486200        0.01356260
## Adrenal Gland                 -0.0008216432        0.03783595
## Artery - Aorta                 1.0000000000        0.03871327
## Artery - Coronary              0.0387171456        1.00000000

4.1.2 CorShrinkMatrix

CorShrink takes as input not just the samples by features data matrix but also a matrix of pairwise correlations with a matrix of number of samples for each pair contributing to the correlation.

cormat <- get(load(system.file("extdata", "corr_matrix.rda",
                             package = "CorShrink")))
nsamp <- get(load(system.file("extdata", "common_samples.rda",
                             package = "CorShrink")))

Besides the EM algorithm mixEM used for the optimization above, another option is to use a variational EM analog mixVBEM.

par(mfrow=c(1,2))
out <- CorShrinkMatrix(cormat, nsamp, image_corshrink  = TRUE, optmethod = "mixEM")
out <- CorShrinkMatrix(cormat, nsamp, image_corshrink = TRUE, optmethod = "mixVBEM")

4.1.3 CorShrinkVector

CorShrink can be applied to vectors of correlations as well.

cor_vec <- c(-0.56, -0.4, 0.02, 0.2, 0.9, 0.8, 0.3, 0.1, 0.4)
nsamp_vec <- c(10, 20, 30, 4, 50, 60, 20, 10, 3)
out <- CorShrinkVector(corvec = cor_vec, nsamp_vec = nsamp_vec,
                       optmethod = "mixEM")
out
## [1] -0.1008374131 -0.0593356711  0.0006075647  0.0127976613  0.8944288298
## [6]  0.7935548958  0.0236649340  0.0051466351  0.0250207584

Note that the correlations computed from adequate amount of data as for the 5th and 6th entries above, the amount of shrinkage is minimal, while it is substantial for the 4th and 9th entries which correspond to small number of samples.

4.2 Re-sampling version CorShrink

We have so far looked at CorShrinkData, CorShrinkMatrix and CorShrinkVector, three functions that provide adaptive shrinkage of correlations at the level of the data matrix, matrix of correlations and vector of correlations respectively. In the above examples, we have used the asymptotic version of our algorithm (see Methods). Next we show example usage of a resampling based version of CorShrink.

4.2.1 CorShrinkData - resampling

par(mfrow=c(1,2))
out <- CorShrinkData(data, sd_boot = TRUE, image_original = TRUE, 
                     image_corshrink = TRUE, optmethod = "mixEM",
                     image.control = list(x.cex = 0.3, y.cex = 0.3))
## Finished Bootstrap : 1 
## Finished Bootstrap : 2 
## Finished Bootstrap : 3 
## Finished Bootstrap : 4 
## Finished Bootstrap : 5 
## Finished Bootstrap : 6 
## Finished Bootstrap : 7 
## Finished Bootstrap : 8 
## Finished Bootstrap : 9 
## Finished Bootstrap : 10 
## Finished Bootstrap : 11 
## Finished Bootstrap : 12 
## Finished Bootstrap : 13 
## Finished Bootstrap : 14 
## Finished Bootstrap : 15 
## Finished Bootstrap : 16 
## Finished Bootstrap : 17 
## Finished Bootstrap : 18 
## Finished Bootstrap : 19 
## Finished Bootstrap : 20 
## Finished Bootstrap : 21 
## Finished Bootstrap : 22 
## Finished Bootstrap : 23 
## Finished Bootstrap : 24 
## Finished Bootstrap : 25 
## Finished Bootstrap : 26 
## Finished Bootstrap : 27 
## Finished Bootstrap : 28 
## Finished Bootstrap : 29 
## Finished Bootstrap : 30 
## Finished Bootstrap : 31 
## Finished Bootstrap : 32 
## Finished Bootstrap : 33 
## Finished Bootstrap : 34 
## Finished Bootstrap : 35 
## Finished Bootstrap : 36 
## Finished Bootstrap : 37 
## Finished Bootstrap : 38 
## Finished Bootstrap : 39 
## Finished Bootstrap : 40 
## Finished Bootstrap : 41 
## Finished Bootstrap : 42 
## Finished Bootstrap : 43 
## Finished Bootstrap : 44 
## Finished Bootstrap : 45 
## Finished Bootstrap : 46 
## Finished Bootstrap : 47 
## Finished Bootstrap : 48 
## Finished Bootstrap : 49 
## Finished Bootstrap : 50

The algorithm works by first computing a Bootstrap estimate of the standard error of the Fisher z-scores for each pair and then using this estimate together with the correlations to shrink the latter.

4.2.2 CorShrinkMatrix - resampling

The breakdown can be formulated at the level of a correlation matrix as follows.

par(mfrow = c(1,2))
zscoreSDmat <- bootcorSE_calc(data, verbose = FALSE)
out <- CorShrinkMatrix(cormat, zscore_sd = zscoreSDmat, image_original = TRUE,
                       image_corshrink = TRUE, optmethod = "mixEM")


5 Extras

So far, in all our examples, we assumed that the estimated correlations between any pair of variables is shrunk towards 0. But CorShrink allows the user to choose a non-zero shrinkage target, estimated from the data, using the mode option in ash.control input.

One can choose a fixed non-zero target in mode as well.

par(mfrow=c(1,2))
out <- CorShrinkData(data, sd_boot = FALSE, image_corshrink = TRUE, 
                     optmethod = "mixEM",
                     image.control = list(x.cex = 0.3, y.cex = 0.3,
                      main_corshrink = "CorShrink (target = 0)"))
out <- CorShrinkData(data, sd_boot = FALSE, image_corshrink = TRUE, 
                     optmethod = "mixEM",
                     ash.control = list(mode = "estimate"),
                     image.control = list(x.cex = 0.3, y.cex = 0.3,
                      main_corshrink = "CorShrink (target = estimated)"))

In general, CorShrink assumes a normal prior for the population Fisher z-scores. But under specific settings, a non-symmetric distribution , such as uniform or half-uniform could be a better fit. This can be achieved using the mixcompdist in ash.control.

par(mfrow=c(2,2))
out <- CorShrinkData(data, sd_boot = FALSE, image_corshrink = TRUE, 
                     optmethod = "mixEM",
                     ash.control = list(mixcompdist = "normal"),
                     image.control = list(x.cex = 0.3, y.cex = 0.3,
                            main_corshrink = "CorShrink (normal)"))
out <- CorShrinkData(data, sd_boot = FALSE, image_corshrink = TRUE, 
                     optmethod = "mixEM",
                     ash.control = list(mixcompdist = "uniform"),
                     image.control = list(x.cex = 0.3, y.cex = 0.3,
                            main_corshrink = "CorShrink (uniform)"))
out <- CorShrinkData(data, sd_boot = FALSE, image_corshrink = TRUE, 
                     optmethod = "mixEM",
                     ash.control = list(mixcompdist = "halfuniform"),
                     image.control = list(x.cex = 0.3, y.cex = 0.3,
                        main_corshrink = "CorShrink (halfuniform)"))
out <- CorShrinkData(data, sd_boot = FALSE, image_corshrink = TRUE,  
                     optmethod = "mixEM",
                     ash.control = list(mixcompdist = "+uniform"),
                     image.control = list(x.cex = 0.3, y.cex = 0.3,
                        main_corshrink = "CorShrink (+uniform)"))


6 Acknowledgements

We would like to thank the GTEx Consortium, John Blischak, Sarah Urbut, Chiaowen Joyce Hsiao, Peter Carbonetto and all members of the Stephens Lab.


7 Session Info

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] CorShrink_0.1.1 knitr_1.17      BiocStyle_2.6.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13      magrittr_1.5      MASS_7.3-47       doParallel_1.0.11
##  [5] pscl_1.5.2        SQUAREM_2017.10-1 lattice_0.20-35   foreach_1.4.3    
##  [9] plyr_1.8.4        ashr_2.1-27       stringr_1.2.0     tools_3.4.2      
## [13] parallel_3.4.2    grid_3.4.2        htmltools_0.3.6   iterators_1.0.8  
## [17] yaml_2.1.14       rprojroot_1.2     digest_0.6.12     bookdown_0.5     
## [21] Matrix_1.2-12     reshape2_1.4.2    codetools_0.2-15  evaluate_0.10.1  
## [25] rmarkdown_1.8     stringi_1.1.5     compiler_3.4.2    backports_1.1.0  
## [29] truncnorm_1.0-7

References

Bickel, Peter J, and Elizaveta Levina. 2008. “Covariance Regularization by Thresholding.” The Annals of Statistics. JSTOR, 2577–2604.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2008. “Sparse Inverse Covariance Estimation with the Graphical Lasso.” Biostatistics 9 (3). Oxford University Press: 432–41.

Higham, Nicholas J. 2002. “Computing the Nearest Correlation Matrix—a Problem from Finance.” IMA Journal of Numerical Analysis 22 (3). Oxford University Press: 329–43.

Ledoit, Olivier, and Michael Wolf. 2003. “Improved Estimation of the Covariance Matrix of Stock Returns with an Application to Portfolio Selection.” Journal of Empirical Finance 10 (5). Elsevier: 603–21.

Rothman, Adam J, Elizaveta Levina, and Ji Zhu. 2009. “Generalized Thresholding of Large Covariance Matrices.” Journal of the American Statistical Association 104 (485). Taylor & Francis: 177–86.

Schäfer, Juliane, and Korbinian Strimmer. 2004. “An Empirical Bayes Approach to Inferring Large-Scale Gene Association Networks.” Bioinformatics 21 (6). Oxford University Press: 754–64.

Stephens, Matthew. 2016. “False Discovery Rates: A New Deal.” Biostatistics 18 (2). Oxford University Press: 275–94.

Touloumis, Anestis. 2015. “Nonparametric Stein-Type Shrinkage Covariance Matrix Estimators in High-Dimensional Settings.” Computational Statistics & Data Analysis 83. Elsevier: 251–61.