BranchGLM Vignette

1 Description

BranchGLM is a package for fitting glms and performing variable selection. Most functions in this package make use of RcppArmadillo and some of them can also make use of OpenMP to perform parallel computations. This vignette introduces the package, provides examples on how to use the main functions in the package and also briefly describes the methods employed by the functions.

2 Installation

BranchGLM can be installed using the install_github() function from the devtools package.


devtools::install_github("JacobSeedorff21/BranchGLM")

3 Fitting glms

3.1 Optimization methods

3.2 Examples

### Using mtcars

library(BranchGLM)

cars <- mtcars

### Fitting linear regression model with Fisher scoring

LinearFit <- BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity")

LinearFit
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ .
#> 
#>             Estimate      SE       t p.values  
#> (Intercept)  12.3034 49.6061  0.6573   0.5110  
#> cyl          -0.1114  2.7695 -0.1066   0.9151  
#> disp          0.0133  0.0473  0.7468   0.4552  
#> hp           -0.0215  0.0577 -0.9868   0.3237  
#> drat          0.7871  4.3341  0.4813   0.6303  
#> wt           -3.7153  5.0206 -1.9612   0.0499 *
#> qsec          0.8210  1.9369  1.1234   0.2613  
#> vs            0.3178  5.5774  0.1510   0.8800  
#> am            2.5202  5.4505  1.2254   0.2204  
#> gear          0.6554  3.9574  0.4389   0.6607  
#> carb         -0.1994  2.1964 -0.2406   0.8098  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 7.0235
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 147 on 21 degrees of freedom
#> AIC: 166

### Fitting gamma regression with inverse link with L-BFGS

GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "inverse",
                      method = "LBFGS", grads = 5)

GammaFit
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ .
#> 
#>             Estimate      SE       z p.values   
#> (Intercept)  -0.0679  0.0029 -2.2012   0.0277 * 
#> cyl           0.0018  0.0002  0.9049   0.3655   
#> disp          0.0000  0.0000 -0.2273   0.8202   
#> hp           -0.0001  0.0000 -1.6589   0.0971 . 
#> drat         -0.0004  0.0003 -0.1565   0.8756   
#> wt           -0.0092  0.0003 -2.7445   0.0061 **
#> qsec          0.0017  0.0001  1.4919   0.1357   
#> vs           -0.0003  0.0003 -0.0933   0.9256   
#> am            0.0006  0.0003  0.1823   0.8554   
#> gear          0.0049  0.0002  1.8936   0.0583 . 
#> carb         -0.0010  0.0001 -0.6930   0.4883   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0087
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 21 degrees of freedom
#> AIC: 152
#> Algorithm converged in 2 iterations using L-BFGS

3.3 Useful functions

### Predict method

predict(GammaFit)
#>  [1] 21.18615 20.58497 25.08262 19.73161 16.84319 19.66668 14.40835 22.08084
#>  [9] 23.72765 18.71060 19.08313 15.34283 16.20932 16.27113 12.61543 12.23133
#> [17] 12.09623 28.31611 31.25422 32.12053 22.43103 17.18572 17.63337 13.73915
#> [25] 15.78662 29.52933 26.61283 30.53478 16.86586 19.39727 14.08581 21.53070

### Accessing coefficients matrix

GammaFit$coefficients
#>                  Estimate           SE           z    p.values
#> (Intercept) -6.792915e-02 2.876056e-03 -2.20120066 0.027721822
#> cyl          1.760788e-03 1.813520e-04  0.90486903 0.365534775
#> disp        -7.992537e-06 3.276651e-06 -0.22732922 0.820167753
#> hp          -6.719661e-05 3.774991e-06 -1.65894623 0.097126627
#> drat        -4.270142e-04 2.542503e-04 -0.15652428 0.875619781
#> wt          -9.224571e-03 3.132502e-04 -2.74445162 0.006061209
#> qsec         1.738884e-03 1.086275e-04  1.49187199 0.135732706
#> vs          -3.101872e-04 3.096972e-04 -0.09334419 0.925630130
#> am           6.273623e-04 3.207733e-04  0.18227243 0.855368935
#> gear         4.880930e-03 2.402244e-04  1.89359104 0.058279314
#> carb        -1.026295e-03 1.380129e-04 -0.69303204 0.488289447

4 Performing variable selection

4.1 Stepwise methods

4.1.1 Forward selection example

### Forward selection with mtcars

VariableSelection(GammaFit, type = "forward")
#> Variable Selection Info:
#> -------------------------------------------
#> Variables were selected using forward selection with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 27
#> 
#> Order the variables were added to the model:
#> 
#> 1). wt
#> 2). hp
#> -------------------------------------------
#> Final Model:
#> -------------------------------------------
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ hp + wt
#> 
#>             Estimate      SE       z p.values    
#> (Intercept)  -0.0089  0.0003 -3.1800   0.0015 ** 
#> hp           -0.0001  0.0000 -4.2117   <2e-16 ***
#> wt           -0.0098  0.0001 -7.0999   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0104
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 29 degrees of freedom
#> AIC: 142
#> Algorithm converged in 2 iterations using Fisher's scoring

4.1.2 Backward elimination example

### Backward elimination with mtcars

VariableSelection(GammaFit, type = "backward")
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using backward elimination with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 49
#> 
#> Order the variables were removed from the model:
#> 
#> 1). vs
#> 2). drat
#> 3). am
#> 4). disp
#> 5). carb
#> 6). cyl
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ hp + wt + qsec + gear
#> 
#>             Estimate      SE       z p.values    
#> (Intercept)  -0.0469  0.0017 -2.6323   0.0085 ** 
#> hp           -0.0001  0.0000 -2.0838   0.0372 *  
#> wt           -0.0095  0.0002 -5.2124   <2e-16 ***
#> qsec          0.0013  0.0001  1.7385   0.0821 .  
#> gear          0.0027  0.0002  1.6114   0.1071    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 2 iterations using Fisher's scoring

4.2 Branch and bound

4.2.1 Branch and bound example

  • If showprogress is true, then progress of the branch and bound algorithm will be reported occasionally.
  • Parallel computation can be used with this method and can lead to very large speedups.
### Branch and bound with mtcars

VariableSelection(GammaFit, type = "branch and bound", showprogress = FALSE)
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 112
#> 
#> 
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ hp + wt + qsec + gear
#> 
#>             Estimate      SE       z p.values    
#> (Intercept)  -0.0469  0.0017 -2.6323   0.0085 ** 
#> hp           -0.0001  0.0000 -2.0838   0.0372 *  
#> wt           -0.0095  0.0002 -5.2124   <2e-16 ***
#> qsec          0.0013  0.0001  1.7385   0.0821 .  
#> gear          0.0027  0.0002  1.6114   0.1071    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 2 iterations using Fisher's scoring

### Can also use a formula and data

FormulaVS <- VariableSelection(mpg ~ . ,data = cars, family = "gamma", 
                               link = "inverse", type = "branch and bound",
                               showprogress = FALSE)

### Number of models fit divided by the number of possible models

FormulaVS$numchecked / 2^(length(FormulaVS$variables))
#> [1] 0.109375

### Extracting final model

FormulaVS$finalmodel
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ hp + wt + qsec + gear
#> 
#>             Estimate      SE       z p.values    
#> (Intercept)  -0.0469  0.0017 -2.6323   0.0085 ** 
#> hp           -0.0001  0.0000 -2.0838   0.0372 *  
#> wt           -0.0095  0.0002 -5.2124   <2e-16 ***
#> qsec          0.0013  0.0001  1.7385   0.0821 .  
#> gear          0.0027  0.0002  1.6114   0.1071    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 2 iterations using Fisher's scoring

4.3 Using keep

### Example of using keep

VariableSelection(mpg ~ . ,data = cars, family = "gamma", 
                               link = "inverse", type = "branch and bound",
                               keep = c("hp", "cyl"), metric = "AIC",
                               showprogress = FALSE)
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 143
#> Number of models fit: 49
#> Variables that were kept in each model:  hp, cyl
#> 
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ cyl + hp + wt + qsec + gear
#> 
#>             Estimate      SE       z p.values    
#> (Intercept)  -0.0646  0.0026 -2.3934   0.0167 *  
#> cyl           0.0014  0.0002  0.8601   0.3897    
#> hp           -0.0001  0.0000 -2.2650   0.0235 *  
#> wt           -0.0104  0.0002 -4.9820   <2e-16 ***
#> qsec          0.0018  0.0001  1.9186   0.0550 .  
#> gear          0.0039  0.0002  1.7910   0.0733 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0089
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 26 degrees of freedom
#> AIC: 143
#> Algorithm converged in 2 iterations using Fisher's scoring

4.4 Convergence issues

5 Utility functions for binomial glms

5.1 Table

### Predicting if a car gets at least 18 mpg

catData <- ToothGrowth

catFit <- BranchGLM(supp ~ ., data = catData, family = "binomial", link = "logit")

Table(catFit)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#> 
#>          OJ  17   13
#> Observed
#>          VC  7    23
#> 
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667 
#> Sensitivity:  0.7667 
#> Specificity:  0.5667 
#> PPV:  0.6389

5.2 ROC


catROC <- ROC(catFit)

plot(catROC, main = "ROC Curve", col = "indianred")

5.3 Cindex/AUC


Cindex(catFit)
#> [1] 0.7127778

AUC(catFit)
#> [1] 0.7127778

5.4 MultipleROCPlots

### Showing ROC plots for logit, probit, and cloglog

probitFit <- BranchGLM(supp ~ . ,data = catData, family = "binomial", 
                       link = "probit")

cloglogFit <- BranchGLM(supp ~ . ,data = catData, family = "binomial", 
                       link = "cloglog")

MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit), 
                  names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))

5.5 Using predictions


preds <- predict(catFit)

Table(preds, catData$supp)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#> 
#>          OJ  17   13
#> Observed
#>          VC  7    23
#> 
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667 
#> Sensitivity:  0.7667 
#> Specificity:  0.5667 
#> PPV:  0.6389

AUC(preds, catData$supp)
#> [1] 0.7127778

ROC(preds, catData$supp) |> plot(main = "ROC Curve", col = "deepskyblue")