Poisson Regression

Poisson regression involves regression models in which the response variable is in the form of counts. For example, the count of number of car accidents or number of costumers in line at a reception desk. The response variables are assumed to be generated from Poisson distributions.

The general mathematical equation for Poisson regression is \[ \log(E(y_i|x_i)) = x_i^T \beta. \] We generate some artificial data using this logic. Consider a dataset containing the information of complain calls about 100 companies over a period of 10 years. count gives the number of complains, and the dataset also have other variables like age, sex, job, education, region, marriage. The generate.data() function allow you to generate simulated data. By specifying support.size = 3, here we set only 3 of the 5 above mentioned variable have effect on the expectation of the response count.

library(abess)
dat <- generate.data(n = 100, p = 6, support.size = 3,family = "poisson")
colnames(dat$x) <- c("age", "sex", "job", 
                     "education", "region", "marriage")
dat$beta
## [1] 0.0000000 0.0000000 1.3161226 0.7623454 0.6557793 0.0000000
head(dat$x)
##             age         sex        job  education     region    marriage
## [1,] -0.6264538 -0.62036668  0.4094018  0.8936737  1.0744410  0.07730312
## [2,]  0.1836433  0.04211587  1.6888733 -1.0472981  1.8956548 -0.29686864
## [3,] -0.8356286 -0.91092165  1.5865884  1.9713374 -0.6029973 -1.18324224
## [4,]  1.5952808  0.15802877 -0.3309078 -0.3836321 -0.3908678  0.01129269
## [5,]  0.3295078 -0.65458464 -2.2852355  1.6541453 -0.4162220  0.99160104
## [6,] -0.8204684  1.76728727  2.4976616  1.5122127 -0.3756574  1.59396745
complain <- data.frame('count'=dat$y, dat$x)

Best Subset Selection for Poisson Regression

The abess() function in the abess package allows you to perform best subset selection in a highly efficient way. You can call the abess() function using formula just like what you do with lm(). Or you can specify the design matrix x and the response y. To carry out a poisson regression, we should set the family = "poisson".

library(abess)
abess_fit <- abess(x = dat$x, y = dat$y, family = "poisson")
abess_fit <- abess(count ~ ., complain, family = "poisson")
class(abess_fit)
## [1] "abess"

Interpret the Result

Hold on, we aren’t finished yet. After get the estimator, we can further do more exploring work. The output of abess() function contains the best model for all the candidate support size in the support.size. You can use some generic function to quickly draw some information of those estimators.

# draw the estimated coefficients on all candidate support size
coef(abess_fit)
## 7 x 7 sparse Matrix of class "dgCMatrix"
##                    0         1         2          3          4           5
## (intercept) 1.581038 0.4281173 0.2558214 -0.1046763 -0.1048038 -0.10377213
## age         .        .         .          .         -0.1220315 -0.12420272
## sex         .        .         .          .          .          .         
## job         .        1.5000590 1.3764397  1.3352379  1.3066533  1.30608518
## education   .        .         .          0.7481296  0.7402970  0.74057564
## region      .        .         0.5200980  0.6772817  0.7321942  0.72343088
## marriage    .        .         .          .          .         -0.01863175
##                        6
## (intercept) -0.112594621
## age         -0.150054463
## sex         -0.052978982
## job          1.310728269
## education    0.749821752
## region       0.733104826
## marriage     0.004436532
# get the deviance of the estimated model on all candidate support size
deviance(abess_fit)
## [1]  -282.3847  -792.8596  -936.6450 -1011.8841 -1013.1902 -1013.2601 -1013.6465
# print the fitted model
print(abess_fit)
## Call:
## abess.formula(formula = count ~ ., data = complain, family = "poisson")
## 
##   support.size        dev        GIC
## 1            0  -282.3847  -564.7694
## 2            1  -792.8596 -1582.9829
## 3            2  -936.6450 -1867.8174
## 4            3 -1011.8841 -2015.5593
## 5            4 -1013.1902 -2015.4350
## 6            5 -1013.2601 -2012.8386
## 7            6 -1013.6465 -2010.8750

Prediction is allowed for all the estimated model. Just call predict.abess() function with the support.size set to the size of model you are interested in. If a support.size is not provided, prediction will be made on the model with best tuning value.

head(predict(abess_fit, newx = dat$x, support.size = c(3, 4)))
##              3         4
## [1,]  1.838255  1.954873
## [2,]  2.650749  2.692234
## [3,]  3.080214  3.088154
## [4,] -1.098251 -1.302053
## [5,] -2.200394 -2.211221
## [6,]  4.107201  4.103329

The plot.abess() function helps to visualize the change of models with the change of support size. There are 5 types of graph you can generate, including coef for the coefficient value, l2norm for the L2-norm of the coefficients, dev for the deviance and tune for the tuning value. Default if coef.

plot(abess_fit, label = TRUE)

The graph shows that, beginning from the most dense model, the 3th variable (job) is included in the active set until the support size reaches 0.

We can also generate a graph about the tuning value. Remember that we used the default GIC to tune the support size.

plot(abess_fit, type = "tune")

The tuning value reaches the lowest point at 3. And We might choose the estimated model with support size equals 6 as our final model. In fact, the tuning values of different model sizes are provided in tune.value of the abess object. You can get the best model size through the following call.

abess_fit$support.size[which.min(abess_fit$tune.value)]
## [1] 3

To extract any model from the abess object, we can call the extract() function with a given support.size. If support.size is not provided, the model with the best tuning value will be returned. Here we extract the model with support size equals to 3.

best.model = extract(abess_fit, support.size = 3)
str(best.model)
## List of 7
##  $ beta        :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:3] 2 3 4
##   .. ..@ p       : int [1:2] 0 3
##   .. ..@ Dim     : int [1:2] 6 1
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : chr [1:6] "age" "sex" "job" "education" ...
##   .. .. ..$ : chr "3"
##   .. ..@ x       : num [1:3] 1.335 0.748 0.677
##   .. ..@ factors : list()
##  $ intercept   : num -0.105
##  $ support.size: num 3
##  $ support.vars: chr [1:3] "job" "education" "region"
##  $ support.beta: num [1:3] 1.335 0.748 0.677
##  $ dev         : num -1012
##  $ tune.value  : num -2016
best.model$beta
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                   3
## age       .        
## sex       .        
## job       1.3352379
## education 0.7481296
## region    0.6772817
## marriage  .

The return is a list containing the basic information of the estimated model. The best model has estimated coefficients every close to the true coefficients and it successfully recovers the correct support.

Gamma Regression

Gamma regression can be used when you have positive continuous response variables such as payments for insurance claims, or the lifetime of a redundant system. It is well known that the density of Gamma distribution can be represented as a function of a mean parameter (\(\mu\)) and a shape parameter (\(\alpha\)), specifically, \[ f(y \mid \mu, \alpha)=\frac{1}{y \Gamma(\alpha)}\left(\frac{\alpha y}{\mu}\right)^{\alpha} e^{-\alpha y / \mu} {I}_{(0, \infty)}(y), \] where \(I(\cdot)\) denotes the indicator function. In the Gamma regression model, response variables are assumed to follow Gamma distributions. Specifically, \[ y_i \sim Gamma(\mu_i, \alpha), \] where \(1/\mu_i = x_i^T\beta\).

We apply the above procedure for gamma regression simply by changing family = "poisson" to family = "gamma". This time we consider the response variables as (continuous) levels of satisfaction. Also, instead of GIC, we carry out cross validation to tune the support size by setting tune.type = "cv" in abess.

# generate data
dat <- generate.data(n = 100, p = 6, support.size = 3, family = "gamma")
colnames(dat$x) <- c("age", "sex", "job", 
                     "education", "region", "marriage")
dat$beta
## [1]  0.00000  0.00000 80.37275 66.66677 64.02925  0.00000
head(dat$x)
##             age         sex        job  education     region    marriage
## [1,] -0.6264538 -0.62036668  0.4094018  0.8936737  1.0744410  0.07730312
## [2,]  0.1836433  0.04211587  1.6888733 -1.0472981  1.8956548 -0.29686864
## [3,] -0.8356286 -0.91092165  1.5865884  1.9713374 -0.6029973 -1.18324224
## [4,]  1.5952808  0.15802877 -0.3309078 -0.3836321 -0.3908678  0.01129269
## [5,]  0.3295078 -0.65458464 -2.2852355  1.6541453 -0.4162220  0.99160104
## [6,] -0.8204684  1.76728727  2.4976616  1.5122127 -0.3756574  1.59396745
complain <- data.frame('count'=dat$y, dat$x)

abess_fit <- abess(count~., complain, family = "gamma", tune.type ="cv")

# draw the estimated coefficients on all candidate support size
coef(abess_fit)
## 7 x 7 sparse Matrix of class "dgCMatrix"
##                    0        1         2         3         4         5
## (intercept) 342.3786 433.5399 449.07955 453.59162 454.62486 456.86280
## age           .        .        .         .         .       -12.73044
## sex           .        .        .         .       -17.05833 -15.30942
## job           .        .      104.53483  83.99608  85.32251  87.43632
## education     .        .        .        72.05160  71.94211  71.31821
## region        .      132.5486  73.37383  55.16201  58.87711  61.05067
## marriage      .        .        .         .         .         .      
##                       6
## (intercept) 456.9214598
## age         -12.6859059
## sex         -15.1305799
## job          87.4267021
## education    71.2965875
## region       61.0173033
## marriage      0.8783947
# get the deviance of the estimated model on all candidate support size
deviance(abess_fit)
## [1] -4.835917 -4.979948 -5.041090 -5.058884 -5.059648 -5.060127 -5.060129
# print the fitted model
print(abess_fit)
## Call:
## abess.formula(formula = count ~ ., data = complain, family = "gamma", 
##     tune.type = "cv")
## 
##   support.size       dev        cv
## 1            0 -4.835917 -4.782249
## 2            1 -4.979948 -4.879505
## 3            2 -5.041090 -5.009830
## 4            3 -5.058884 -4.568354
## 5            4 -5.059648 -4.568300
## 6            5 -5.060127 -4.569411
## 7            6 -5.060129 -4.569221
# predict results for given support sizes
head(predict(abess_fit, newx = dat$x, support.size = c(3, 4)))
##             3        4
## [1,] 611.6387 627.6912
## [2,] 624.5590 634.2712
## [3,] 695.6343 711.8548
## [4,] 376.5943 373.0828
## [5,] 357.8650 365.3057
## [6,] 751.6207 724.2588
plot(abess_fit, label = TRUE)

# tuning plot
plot(abess_fit, type = "tune")

# extract fitted model
best.model = extract(abess_fit, support.size = 3)
str(best.model)
## List of 7
##  $ beta        :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:3] 2 3 4
##   .. ..@ p       : int [1:2] 0 3
##   .. ..@ Dim     : int [1:2] 6 1
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : chr [1:6] "age" "sex" "job" "education" ...
##   .. .. ..$ : chr "3"
##   .. ..@ x       : num [1:3] 84 72.1 55.2
##   .. ..@ factors : list()
##  $ intercept   : num 454
##  $ support.size: num 3
##  $ support.vars: chr [1:3] "job" "education" "region"
##  $ support.beta: num [1:3] 84 72.1 55.2
##  $ dev         : num -5.06
##  $ tune.value  : num -4.57
# estimated coefficients
best.model$beta
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                  3
## age        .      
## sex        .      
## job       83.99608
## education 72.05160
## region    55.16201
## marriage   .