This vignette demonstrates the basic usage of the melt package, including replication code from a paper by Kim, MacEachern, and Peruggia (2024) published in the Journal of Statistical Software. For more details on the package and its applications, readers are encouraged to refer to the paper.
For a simple illustration of building a model, we apply
el_mean()
to the synthetic classification problem data
synth.tr
from the MASS package. The synth.tr
object is a data.frame
with 250 rows and three columns. We
select two columns xs
and ys
, the \(x\) and \(y\) coordinates, to build an
EL
model with two-dimensional mean parameter.
library(MASS)
library(dplyr)
data("synth.tr", package = "MASS")
data <- dplyr::select(synth.tr, c(xs, ys))
We specify c(0, 0.5)
as par
in
el_mean()
and build an EL
object with the
data
.
The data
object is implicitly coerced into a
matrix
since el_mean()
takes a numeric matrix
as an input for the data. Basic print()
and
show()
methods display relevant information about an
EL
object.
fit_mean
#>
#> Empirical Likelihood
#>
#> Model: mean
#>
#> Maximum EL estimates:
#> xs ys
#> -0.07276 0.50436
#>
#> Chisq: 6.158, df: 2, Pr(>Chisq): 0.04601
#> EL evaluation: converged
The asymptotic chi-square statistic is displayed, along with the associated degrees of freedom and the \(p\) value.
Next, we consider an infeasible parameter value
c(1, 0.5)
outside the convex hull of the data to show how
el_control()
interacts with the model fitting functions
through control
argument. The evaluation algorithm
continues until the iteration reaches maxit_l
or the
negative empirical log-likelihood ratio exceeds th
. Setting
a large th
for the infeasible value, we observe that the
algorithm hits the maxit
with each element of
lambda
diverging quickly.
ctrl <- el_control(maxit_l = 50, th = 10000)
fit2_mean <- el_mean(data, par = c(1, 0.5), control = ctrl)
logL(fit2_mean)
#> [1] -10001.14
logLR(fit2_mean)
#> [1] -8620.776
getOptim(fit2_mean)
#> $par
#> xs ys
#> 1.0 0.5
#>
#> $lambda
#> [1] -9.908531e+14 2.757135e+14
#>
#> $iterations
#> [1] 50
#>
#> $convergence
#> [1] FALSE
#>
#> $cstr
#> [1] FALSE
In addition, melt contains another function el_eval()
to
perform the EL evaluation for other general estimating functions.
A similar process applies to the other model fitting functions,
except that el_lm()
and el_glm()
require a
formula object for model specification. We illustrate the use of
el_lm()
for regression analysis with the crime rates data
UScrime
available in MASS. Here we update the control
parameters for significance tests of the coefficients.
data("UScrime", package = "MASS")
ctrl <- el_control(maxit = 1000, nthreads = 2)
(fit_lm <- el_lm(y ~ Pop + Ineq, data = UScrime, control = ctrl))
#>
#> Empirical Likelihood
#>
#> Model: lm
#>
#> Maximum EL estimates:
#> (Intercept) Pop Ineq
#> 1046.749 3.251 -1.344
#>
#> Chisq: 13.95, df: 2, Pr(>Chisq): 0.0009332
#> Constrained EL: converged
The print()
method also applies and shows the MELE, the
overall model test result, and the convergence status. The estimates are
obtained from lm.fit()
. The hypothesis for the overall test
is that all the parameters except the intercept are zero. The
convergence status shows that a constrained optimization is performed in
testing the hypothesis. The EL evaluation applies to the test and the
convergence status if the model does not include an intercept. The large
chi-square value above implies that the data do not support the
hypothesis, regardless of the convergence.
Note that failure to converge does not necessarily indicate
unreliable test results. Most commonly, the algorithm fails to converge
if the additional constraint imposed by a hypothesis is incompatible
with the convex hull constraint. The control parameters affect
the test results as well. The summary()
method reports more
details, such as the results of significance tests, where each test
involves solving a constrained EL problem.
summary(fit_lm)
#>
#> Empirical Likelihood
#>
#> Model: lm
#>
#> Call:
#> el_lm(formula = y ~ Pop + Ineq, data = UScrime, control = ctrl)
#>
#> Number of observations: 47
#> Number of parameters: 3
#>
#> Parameter values under the null hypothesis:
#> (Intercept) Pop Ineq
#> 1047 0 0
#>
#> Lagrange multipliers:
#> [1] 3.504e-03 1.420e-05 -2.618e-05
#>
#> Maximum EL estimates:
#> (Intercept) Pop Ineq
#> 1046.749 3.251 -1.344
#>
#> logL: -187.9 , logLR: -6.977
#> Chisq: 13.95, df: 2, Pr(>Chisq): 0.0009332
#> Constrained EL: converged
#>
#> Coefficients:
#> Estimate Chisq Pr(>Chisq)
#> (Intercept) 1046.749 447.645 < 2e-16 ***
#> Pop 3.251 4.925 0.02647 *
#> Ineq -1.344 13.654 0.00022 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These tests are all asymptotically pivotal without explicit studentization. As a result, the output does not have standard errors.
By iteratively solving constrained EL problems for a grid of
parameter values, confidence intervals for the parameters can be
calculated with confint()
. The chi-square calibration is
the default, but the user can specify a critical value cv()
optionally. Below we calculate asymptotic 95% confidence intervals.
confint(fit_lm)
#> lower upper
#> (Intercept) 579.7584201 1698.919267
#> Pop 0.3491718 6.352967
#> Ineq -1.9453327 -0.687159
Similarly, we obtain confidence regions for two parameters with
confreg()
.
Now we consider elt()
for hypothesis testing, where the
arguments rhs
and lhs
define a linear
hypothesis. Either one or the other must be provided. The argument
lhs
takes a numeric matrix
or a
vector
. Alternatively, a character vector
can
be supplied to symbolically specify a hypothesis, which is convenient
when there are many variables. When lhs
is
NULL
, it performs the EL evaluation at rhs
When rhs
is NULL
, on the other hand,
rhs
is set to the zero vector automatically, and the EL
optimization is performed with lhs
. Technically,
elt()
can reproduce the test results from
fit_mean()
. Note the equivalence between the optimization
results.
elt_mean <- elt(fit_mean, rhs = c(0, 0.5))
all.equal(getOptim(elt_mean), getOptim(fit_mean))
#> [1] TRUE
elt_lm <- elt(fit_lm, lhs = c("Pop", "Ineq"))
all.equal(getOptim(elt_lm), getOptim(fit_lm))
#> [1] TRUE
In addition to specifying an arbitrary linear hypothesis through
rhs
and lhs
, extra arguments
alpha
and calibrate
expand options for
testing. The argument alpha
controls the significance level
determining the critical value, and calibrate
chooses the
calibration method. We apply the \({F}\) and bootstrap calibrations to
fit_mean()
at a significance level of 0.05. The number of
threads is increased to four with 100000 bootstrap replicates in
el_control()
.
ctrl <- el_control(
maxit = 10000, tol = 1e-04, nthreads = 4, b = 100000, step = 1e-05
)
(elt_mean_f <- elt(fit_mean,
rhs = c(0, 0.5), calibrate = "F", control = ctrl
))
#>
#> Empirical Likelihood Test
#>
#> Hypothesis:
#> xs = 0.0
#> ys = 0.5
#>
#> Significance level: 0.05, Calibration: F
#>
#> Statistic: 6.158, Critical value: 6.089
#> p-value: 0.04835
#> EL evaluation: converged
(elt_mean_boot <- elt(fit_mean,
rhs = c(0, 0.5), calibrate = "boot", control = ctrl
))
#>
#> Empirical Likelihood Test
#>
#> Hypothesis:
#> xs = 0.0
#> ys = 0.5
#>
#> Significance level: 0.05, Calibration: Bootstrap
#>
#> Statistic: 6.158, Critical value: 6.064
#> p-value: 0.04756
#> EL evaluation: converged
We illustrate performing multiple comparisons and constructing
simultaneous confidence intervals with the thiamethoxam
data, a data.frame
with 165 observations and 11 variables.
We fit a quasi-Poisson regression model with a log link function using
el_glm()
to obtain a QGLM
model object.
data("thiamethoxam")
fit_glm <- el_glm(visit ~ trt + var + fruit + defoliation,
family = quasipoisson(link = "log"), data = thiamethoxam,
control = ctrl
)
print(summary(fit_glm), width.cutoff = 50)
#>
#> Empirical Likelihood
#>
#> Model: glm (quasipoisson family with log link)
#>
#> Call:
#> el_glm(formula = visit ~ trt + var + fruit + defoliation,
#> family = quasipoisson(link = "log"), data = thiamethoxam,
#> control = ctrl)
#>
#> Number of observations: 165
#> Number of parameters: 7
#>
#> Parameter values under the null hypothesis:
#> (Intercept) trtSpray trtFurrow trtSeed varGZ fruit
#> 1.972 0.000 0.000 0.000 0.000 0.000
#> defoliation phi
#> 0.000 1.726
#>
#> Lagrange multipliers:
#> [1] -0.20319 -0.18634 0.01835 0.14497 -0.17456 0.10961 -0.04870 -0.08773
#>
#> Maximum EL estimates:
#> (Intercept) trtSpray trtFurrow trtSeed varGZ fruit
#> 1.97228 -0.11281 0.08001 0.31794 -0.21088 0.05142
#> defoliation
#> -0.02044
#>
#> logL: -909.6 , logLR: -67.16
#> Chisq: 134.3, df: 6, Pr(>Chisq): < 2.2e-16
#> Constrained EL: converged
#>
#> Coefficients:
#> Estimate Chisq Pr(>Chisq)
#> (Intercept) 1.97228 421.866 < 2e-16 ***
#> trtSpray -0.11281 1.680 0.194885
#> trtFurrow 0.08001 1.014 0.314039
#> trtSeed 0.31794 11.951 0.000546 ***
#> varGZ -0.21088 9.498 0.002057 **
#> fruit 0.05142 14.470 0.000142 ***
#> defoliation -0.02044 27.147 1.89e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion for quasipoisson family: 1.726303
We assess the significance of trt
by testing whether the
coefficients are all zero. The output of summary()
reports
a small \({p}\) value with a different
solution from the overall model test.
elt_glm <- elt(fit_glm, lhs = c("trtSpray", "trtFurrow", "trtSeed"))
summary(elt_glm)
#>
#> Empirical Likelihood Test
#>
#> Hypothesis:
#> trtSpray = 0
#> trtFurrow = 0
#> trtSeed = 0
#>
#> Significance level: 0.05, Calibration: Chi-square
#>
#> Parameter values under the null hypothesis:
#> (Intercept) trtSpray trtFurrow trtSeed varGZ fruit
#> 1.97324 0.00000 0.00000 0.00000 -0.21019 0.05958
#> defoliation phi
#> -0.02535 1.72700
#>
#> Lagrange multipliers:
#> [1] -0.097865 -0.158722 0.123355 0.251704 0.009850 -0.002071 0.007687
#> [8] 0.020678
#>
#> logL: -849.8, logLR: -7.34
#> Statistic: 14.68, Critical value: 7.815
#> p-value: 0.002112
#> Constrained EL: converged
Finally, we extend the framework to multiple testing with
elmt()
, which can be directly applied to the fitted model
object. Its syntax is similar to elt()
, where
rhs
and lhs
now specify multiple hypotheses.
For general hypotheses involving separate matrices, elmt()
accepts list
objects for rhs
and
lhs
. The elmt()
function employs a
multivariate chi-square calibration technique based on Monte Carlo
simulations to determine the common critical value. Details of multiple
testing procedures are provided in Kim,
MacEachern, and Peruggia (2023). Continuing on the
previous test result, we perform comparisons with the control with the
overall significance level at 0.05.
elmt_glm <- elmt(fit_glm, lhs = list("trtSpray", "trtFurrow", "trtSeed"))
summary(elmt_glm)
#>
#> Empirical Likelihood Multiple Tests
#>
#> Overall significance level: 0.05
#>
#> Calibration: Multivariate chi-square
#>
#> Hypotheses:
#> Estimate Chisq Df p.adj
#> trtSpray = 0 -0.11281 1.680 1 0.46470
#> trtFurrow = 0 0.08001 1.014 1 0.66341
#> trtSeed = 0 0.31794 11.951 1 0.00171 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Common critical value: 5.646
Note the use of a list
for lhs
by
elmt()
. While a character vector
lhs
acts as a single hypothesis for elt()
,
elements of lhs
in elmt()
define distinct
hypotheses for convenience. The Df
column shows the
marginal chi-square degrees of freedom for each hypothesis. For an
object of class ELMT
, confint()
uses the
common critical value computed by elmt()
.