Based on the manuscript entitled “Objective Bayes regression using I-priors” by Wicher Bergsma [2016, unpublished]. In a linear regression setting, priors can be assigned to the regression function using a vector space framework, and the posterior estimate of the regression function obtained. I-priors are a class of such priors based on the principle of maximum entropy.
This package performs linear regression modelling using I-priors in R. It is intuitively designed to be similar to lm
, with both formula and non-formula based input. The parameters of an I-prior model are the scale parameters of the reproducing kernel Hilbert space (RKHS) over the set of covariates, lambda
, and the standard deviation of model errors, sigma
. While the main interest of I-prior modelling is prediction, inference is also possible, e.g. via log-likelihood ratio tests.
For installation instructions and some examples of I-prior modelling, continue reading below. The package is documented with help files, and the wiki is a good source to view some discussion topics and further examples.
R/iprior makes use of several C++ code, so as a prerequisite, you must have a working C++ compiler. To get it:
sudo apt-get install r-base-dev
or similar.To fit an I-prior model to mod
regressing y
against x
, where these are contained in the data frame dat
, the following syntax are equivalent.
mod <- iprior(y ~ x, data = dat) # formula based input
mod <- iprior(y = dat$y, x = dat$x) # non-formula based input
The call to iprior()
can be accompanied by model options in the form of model = list()
, such as choosing the RKHS, number of scale parameters, and others. Control options for the EM algorithm fit is done through the option control = list()
. Find the full list of options by typing ?iprior
in R, or visiting this wiki page.
We will be analysing Brownlee’s stack loss plant data, which is available in R built-in. For more information and a description of the dataset, consult the help section ?stackloss
.
str(stackloss)
## 'data.frame': 21 obs. of 4 variables:
## $ Air.Flow : num 80 80 75 62 62 62 62 62 58 58 ...
## $ Water.Temp: num 27 27 25 24 22 23 24 24 23 18 ...
## $ Acid.Conc.: num 89 88 90 87 87 87 93 93 87 80 ...
## $ stack.loss: num 42 37 37 28 18 18 19 20 15 14 ...
We can fit a multiple regression model on the dataset, regressing stack.loss
against the other three variables. The I-prior for our function lives in a “straight line” RKHS which we call the Canonical RKHS. We fit an I-prior model as follows:
mod.iprior <- iprior(stack.loss ~ ., data = stackloss)
## Iteration 0: Log-likelihood = -61.242212 .....
## Iteration 65: Log-likelihood = -56.347909
## EM complete.
The iprior
package estimates the model by an EM algorithm, and by default prints reports for every 100 iterations completed. Several options are available to tweak this by supplying a list of control options (see the package help files for more details).
The summary output was designed to look similar to an lm
output. The only differences are the inclusion of RKHS information, EM convergence report and the final log-likelihood value.
summary(mod.iprior)
##
## Call:
## iprior(formula = stack.loss ~ ., data = stackloss)
##
## RKHS used:
## Canonical (Air.Flow, Water.Temp, Acid.Conc.)
## with multiple scale parameters.
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -7.4240 -1.7040 -0.3859 1.8700 5.7690
##
## Estimate S.E. z P[|Z>z|]
## (Intercept) 17.5238 0.6710 26.118 <2e-16 ***
## lam1.Air.Flow 0.0408 0.0250 1.634 0.102
## lam2.Water.Temp 0.2223 0.1369 1.623 0.105
## lam3.Acid.Conc. -0.0123 0.0104 -1.181 0.238
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## EM converged to within 1e-07 tolerance. No. of iterations: 65
## Standard deviation of errors: 3.075 with S.E.: 0.5045
## Log-likelihood value: -56.34791
The object mod.iprior
is of class iprior
and contains a bunch of things of interest that we can extract. Of interest, among other things, might be fitted values, fitted(mod.iprior)
, residuals, residuals(mod.iprior)
and the model coefficients, coef(mod.iprior)
.
To compare the I-prior model against a regular linear regression model, we could look at the fitted versus residual plot.
The High School and Beyond is a national longitudinal survey of of students from public and private high schools in the United States, with information such as students’ cognitive and non-cognitive skills, high school experiences, work experiences and future plans collected. Papers such as Raudenbush and Bryk (2002) and Raudenbush et. al. (2004) had analyzed this particular dataset, as mentioned in Rabe-Hesketh and Skrondal (2008).
data(hsbsmall)
str(hsbsmall)
## 'data.frame': 661 obs. of 3 variables:
## $ mathach : num 16.663 -2.155 0.085 18.804 2.409 ...
## $ ses : num 0.322 0.212 0.682 -0.148 -0.468 0.842 0.072 0.332 -0.858 0.902 ...
## $ schoolid: Factor w/ 16 levels "1374","1433",..: 1 1 1 1 1 1 1 1 1 1 ...
This dataset contains the variables mathach
, a measure of mathematics achievement; ses
, the socioeconomic status of the students based on parental education, occupation and income; and schoolid
, the school identifier for the students. The original dataset contains 160 groups with varying number of observations per group (n = 7185
in total). However, this smaller set contains only 16 randomly selected schools, such that the total sample size is n = 661
. This was mainly done for computational reasons to illustrate this example.
We fit an I-prior model, with the aim of predicting mathach
from ses
, with the assumption that students’ ses
varied by schoolid
. This achieved by adding an interaction term between the two variables.
(mod.iprior <- iprior(mathach ~ ses + schoolid + ses:schoolid, data = hsbsmall))
## Iteration 0: Log-likelihood = -2871.5026 .....
## Iteration 71: Log-likelihood = -2137.7988
## EM complete.
##
## Call:
## iprior(formula = mathach ~ ses + schoolid + ses:schoolid, data = hsbsmall)
##
## RKHS used: Pearson & Canonical, with multiple scale parameters.
##
##
## Parameter estimates:
## (Intercept) lambda1 lambda2 psi
## 13.68325416 0.41778770 0.13231532 0.02804771
On a technical note, the vector space for functions over the set of nominal-type variables (such as schoolid
) is called the Pearson RKHS.
A plot of fitted lines, one for each school, is produced using the plot()
function. The option plots = "fitted"
produces the plot of interest, but there are other options for this as well.
plot(mod.iprior, plots = "fitted")
Instead of just “straight-line” regression functions, we could also use smoothed curves. The vector space which contains such curves is called the Fractional Brownian Motion RKHS with a Hurst coefficient Hurst
, which defaults to 0.5. The Hurst coefficient can be thought of as a smoothing parameter, but this is treated as a fixed parameter in the iprior
package.
Consider a simulated set of points in datfbm
which were generated from a mixed Gaussian distribution fx <- function(x) 65 * dnorm(x, mean = 2) + 35 * dnorm(x, mean = 7, sd = 1.5)
.
data(datfbm)
str(datfbm)
## 'data.frame': 100 obs. of 2 variables:
## $ y: num 3.85 8.78 6.75 6.99 4.59 ...
## $ x: num -0.168 0.133 0.255 0.412 0.424 ...
To illustrate one-dimensional smoothing, we fit an iprior model and produce the plot of fitted values.
mod.iprior <- iprior(y ~ x, data = datfbm, model = list(kernel = "FBM"), control = list(silent = TRUE))
summary(mod.iprior)
##
## Call:
## iprior(formula = y ~ x, data = datfbm)
##
## RKHS used:
## Fractional Brownian Motion with Hurst coef. 0.5 (x)
## with a single scale parameter.
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -3.4880 -1.1460 -0.1998 1.0970 3.1390
##
## Estimate S.E. z P[|Z>z|]
## (Intercept) 9.9961 0.1694 58.993 < 2.2e-16 ***
## lambda 5.8860 1.2468 4.721 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## EM converged to within 1e-07 tolerance. No. of iterations: 499
## Standard deviation of errors: 1.694 with S.E.: 0.1354
## Log-likelihood value: -222.8153
plot(mod.iprior, plots = "fitted")